OpenAI’s transcription tool hallucinates. Hospitals use it anyway

OpenAI's transcription tool hallucinates. Hospitals use it anyway

On Saturday an An Associated Press investigation revealed that OpenAI’s Whisper transcription tool creates fabricated text in medical and business settings despite warnings against such use. The AP interviewed more than 12 software engineers, developers and researchers who found that the model regularly invented text that the speakers never said, a phenomenon often called “confabulation” or “hallucination” in the AI ​​field.

On its release in 2022. OpenAI claims that Whisper approaches “human-level robustness” in audio transcription accuracy. However, a University of Michigan researcher told the AP that Whisper produced false text in 80 percent of public meeting transcripts reviewed. Another developer, unnamed in the AP report, claimed to have found fictional content in nearly all of its 26,000 test transcriptions.

Concoctions pose a particular risk in healthcare settings. Despite OpenAI’s warnings against using Whisper for “high-risk domains,” more than 30,000 medical professionals now use Whisper-based tools to transcribe patient visits, according to the AP report. The Mankato Clinic in Minnesota and Children’s Hospital Los Angeles are among 40 health systems using a Whisper AI-based pilot service from medical technology company Nabla that is fine-tuned to medical terminology.

Nabla admits that Whisper can be confusing, but also reportedly deletes original audio recordings “for data safety reasons.” This can cause additional problems as doctors cannot check accuracy against the source material. And deaf patients can be greatly affected by erroneous transcriptions, as they will have no way of knowing whether the audio of the medical transcription is accurate or not.

Potential problems with Whisper extend beyond healthcare. Researchers from Cornell University and the University of Virginia examined thousands of audio samples and found that Whisper added non-existent violent content and racial commentary to neutral speech. They found that 1% of the samples included “whole hallucinated phrases or sentences that do not exist in any form in the underlying audio” and that 38% of these included “overt harms such as perpetuating violence, conjuring inaccurate associations, or suggesting false authority .”

In one instance of the study cited by the AP, when a speaker described “two other girls and a lady,” Whisper added fictional text specifying that they “were black.” In another, the audio says, “He, the boy, was going to, I’m not sure exactly, take the umbrella.” Whispers transcribes it to: “He took a big piece of a cross, a little, little piece… I’m sure he didn’t have a terrifying knife, so he killed many people.”

An OpenAI spokesperson told the AP that the company appreciates the researchers’ findings and that it is actively researching how to reduce fiction and incorporating feedback into model updates.

Why Whisper Confabulates

The key to Whisper’s unsuitability in high-risk areas comes from its tendency to sometimes make up, or plausibly make up, inaccurate results. The AP report says, “Researchers aren’t sure why Whisper and similar tools hallucinate,” but that’s not true. We know exactly why Transformer-based AI models like Whisper behave this way.

Whisper is based on a technology that is designed to predict the next most likely token (piece of data) to appear after a sequence of tokens provided by a user. In the case of ChatGPT, entry tokens come in the form of a text prompt. In Whisper’s case, the input is tokenized audio data.

Leave a Reply

Your email address will not be published. Required fields are marked *