Whisper-Medusa: The Faster and Open-Source Speech Recognition Model Outperforming OpenAI’s Whisper

aiOla, an Israeli AI startup, has recently launched an open-source speech recognition model called Whisper-Medusa. This new model is 50% faster than OpenAI’s Whisper, thanks to its innovative “multi-head attention” architecture. Whisper-Medusa can predict more tokens at a time than its predecessor, offering faster and more seamless speech-to-text conversions. The model’s code and weights are available on Hugging Face under an MIT license that allows for both research and commercial usage.

Whisper-Medusa builds on the success of Whisper, which has become the gold standard in speech recognition. Whisper is widely used in sectors like healthcare and fintech, enabling tasks such as transcription and powering multimodal AI systems. However, aiOla aims to take speech recognition to the next level by developing a model that is even faster than Whisper.

To achieve this, aiOla modified Whisper’s architecture by adding a multi-head attention mechanism. This mechanism allows the model to attend to information from different representation subspaces simultaneously, resulting in a 50% increase in speech prediction speed. Importantly, despite the speed improvement, Whisper-Medusa maintains the same level of accuracy as Whisper.

The training process for Whisper-Medusa involved a machine-learning approach called weak supervision. aiOla froze the main components of Whisper and used audio transcriptions generated by the model as labels to train additional token prediction modules. Currently, Whisper-Medusa can predict ten tokens at a time, but aiOla plans to expand to a larger version capable of predicting 20 tokens at a time.

According to Gill Hetz, aiOla’s VP of research, improving the speed and latency of language and learning models (LLMs) is easier compared to automatic speech recognition systems. The novel multi-head attention approach employed by aiOla enables faster prediction speed while maintaining accuracy. This advancement in recognition and transcription speeds has the potential to revolutionize speech applications by allowing for real-time responses.

Whisper-Medusa has been tested on real enterprise data use cases to ensure its accuracy in real-world scenarios. Although Hetz did not disclose if any company has early access to Whisper-Medusa, he emphasized the benefits that real-time speech-to-text capabilities can bring to individuals and companies. Increased productivity, reduced operational costs, and faster content delivery are just some of the advantages that can be achieved with faster recognition and transcription speeds.

In conclusion, aiOla’s Whisper-Medusa offers a significant improvement in speech recognition speed without compromising accuracy. By releasing the model as open source, aiOla encourages innovation and collaboration within the AI community. The advancements in speech recognition brought by Whisper-Medusa have the potential to enhance productivity, reduce costs, and deliver real-time responses in various industries.