News
6 minute read

Turning turbulence into transcripts

IBM worked with Australia’s Royal Flying Doctor Service to aid clinicians on the ground and in the air using IBM Granite's high-performing speech recognition technology.

Have you ever tried to dictate a text message to a friend, or asked a voice assistant a question, only to find out that your phone completely misunderstood you? Chances are, if you’ve used any major smartphone software in the last decade, you have. Often, that’s because these systems can’t decipher speech in imperfect situations. And most of the time, the real world is less than perfect.

New applications of IBM’s Granite speech model show, however, that even in loud and trying scenarios, the model can understand people far more easily than other major language models. This has the potential to unlock myriad new use cases for LLMs where the primary input is speech.

One example can be found in the skies above Australia. The Royal Flying Doctor Service (RFDS) is a nonprofit that operates across the country to provide critical access to medical professionals and services for those far from city centers. Australia’s major population centers are primarily located around its coasts and away from the arid plains and outback in the middle, but there are still people who live and work in isolated areas. The RFDS’s small planes can access people living in remote communities, and those working at mining sites or farms in the outback, who need urgent medical care or a regional care clinic.

As with any modern medical procedures, the RFDS’s clinicians record everything they’re doing with a patient, both for accountability and future care considerations. Every time a bag of saline is pulled from a compartment, or a vial of morphine is drawn, the clinicians onboard the cramped and loud aluminum tubes are recording all that information onto a tablet, along with patient care notes. These notes become part of an electronic health record which follows the patient through their care journey.

Usually, clinicians type out everything they’ve done for a patient after they’ve finished treating them. But for many patients in emergency medical situations, there isn’t a lot of down time on these flights — they might need constant monitoring or attention while in transit. IBM Client Engineering developed a prototype in a hackathon organized by RFDS that uses speech and visual AI to streamline patient information into the electronic health record application. The team at IBM found that their solution reduced the admin time spent by clinicians from 28 minutes per patient down to just 2 minutes. This automated system ensures the clinician’s records are both accurate, and gives them far more time to spend on critical patient care.

Automation like this is especially valuable in emergency situations where the clinician’s focus on the patient is critical. There can be one or two flight clinicians aboard dealing with up to two patients per trip. Writing up what each of them are doing on each patient is time away from hands-on care.

RFDS had asked IBM Client Engineering to demonstrate how AI could be used safely to improve patient care and flight clinicians’ workload. With the solution IBM came up with, the clinicians wouldn’t have to wait for spare moments between attending to patients or risk forgetting to write something down. But tiny aircraft are notoriously loud — would any model be robust enough to pick out a clinician’s voice over the din of the engines and transcribe it effectively? “It’s basically a tiny cigar tube in the sky,” said Phil Downey, a technical product manager at IBM Research, working with the Client Engineering team and RFDS.

IBM Research and Client Engineering started testing IBM’s own IBM Granite-Speech, a model specially tuned for speech recognition, to build out their idea. They created a simple workflow around it, where a user records audio, which is sent to a transcription system running the Granite model on the vLLM inference engine, and then on to an interface where the user or someone else can read what was said. The team found that the Granite model was particularly adept at picking out the flight clinicians and pilots' voices on audio recorded by RFDS when flying their aircraft at altitude. The recordings had all the background engine noise and static that is usually present when flying an aircraft. The recordings didn’t have any noise cancelling technology, nor was the aircraft insulated against noise in any way.

The IBM team tested their system out running in the cloud, using the full-size Granite-Speech model, as well as on very run-of-the-mill hardware. Using the 2B smaller version of the model, along with a five-year-old Intel i9-12900K CPU and a three-year-old NVIDIA GeForce RTX 4060 Ti, on a computer running Windows 11, they had the model running locally with ease on this hardware. Downey and teammate Srikanth Koneru noted that at most, the tiny 2B model running locally used less than 18GB of memory and 6GB of GPU memory, using no more than 10% of the CPU throughout the process.

The team’s tests showed that it was possible to run a system like this onboard the RFDS’s aircraft. Because of the noise, the staff aboard already wear headsets with microphones to communicate, and already have tablets and computers on board their flights. And as of last year, the Australian government mandated that all pathology reports collected by clinicians must be filed into My Health Record, the country’s electronic health record system.

Once other IBM Researchers heard about the work with RFDS, they wanted to see just how capable Granite could be in difficult-to-hear environments. George Saon, a distinguished research scientist leading IBM Research’s AI speech strategy, replicated Downey’s work with the recordings taken from the plane. Along with Luis Lastras, the director of language technologies at IBM Research, the two compared the recording comprehension of other major models from the top AI players to see how they fared. The Granite model was able to understand the clinicians’ recordings considerably better than any other model they tested, with most others only picking up a few phrases here and there.

When asked why the Granite model performed so much more effectively than others, Saon said it was part of the team’s generally rigorous methods. “It was just down to the way we train models,” Saon said. Many of the enterprise use cases IBM’s clients would use speech-recognition models for could involve poor quality audio. Whether that’s a customer service bot listening to a customer who has a poor cell connection, or a facilities manager trying to file a report in a loud server room or factory, there are myriad places where watsonx and Granite models could likely be expected to work.

But this model wasn’t specifically trained to handle this airborne task. “Granite completely aces this,” said Lastras. “This is truly out of distribution — it was never trained for this.”

The Granite team trains many of its models by masking sections of its training data; in this case that could mean parts of words are garbled or masked out in audio recordings used in training data, and when the model chooses the correct word, it’s rewarded. The goal is to make these models more robust for less-than-ideal situations, while not affecting how they work when sound quality is good. “Speech recognition is not a solved problem, even though everyone takes it for granted,” Saon argued. “As soon as you hit somewhere noisy, like an airplane, or a restaurant or car with crosstalk, there are issues — people are still working hard on those.”

IBM Client Engineering in Australia and IBM Research are working with the RFDS to turn this concept into a reality. The team believes simple additions could make this concept even stronger, such as having the system set up to recognize specific words as flags to send pertinent information to the right people. Downey said the team envisions a doctor talking to the system about the medicines they’ve administered and having that part of the transcribed report go directly to the hospital’s pharmacy department, or additional treatment that needs to be done going to the attending physicians. This would be processed downstream, potentially by the receiving hospital’s computers once the report has come in.

This scenario is just one of many where Granite-powered speech-recognition systems could change how we work and live. Anyone can try out the underlying open-source models on Hugging Face right now. And the potential upsides of bringing technology like this to the most remote locations could be massive. “If we can make it mobile, it makes a lot of sense,” Downey said. “There’s a big use case for it in broader uses cases across health and other industries.”

IBM Research has since released granite-4.0-1b-speech in Hugging Face. It's not only IBM's smallest LLM-based ASR model to date, but it's also currently the number one open weights model on the OpenASR leaderboard for English speech recognition accuracy. It introduces additional language support as well as keyword biasing through simple prompting, making it easy for developers to customize specific words and expand the capabilities of our specialized small models.

Related posts