French AI company Mistral launched an open source text-to-speech model named Voxtral TTS, designed for voice AI assistants and enterprise applications such as customer support. This development positions Mistral directly against competitors including ElevenLabs, Deepgram, and OpenAI.
Voxtral TTS supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model aims to meet customer demands for a flexible speech model suitable for various edge devices, providing a cost-effective solution that maintains high performance.
Pierre Stock, VP of science operations at Mistral AI, said, “Our customers have been asking for a speech model. So we built a small-sized speech model that can fit on a smartwatch, a smartphone, a laptop, or other edge devices.” He emphasized that while the model is competitively priced, it delivers state-of-the-art performance.
The model allows for the adaptation of custom voices with samples of less than five seconds. It captures subtle characteristics like accents and speech irregularities. Additionally, Voxtral TTS, based on Ministral 3B, can switch languages without losing voice quality, making it suitable for real-time translation and dubbing.
The model’s performance metrics are notable. It has a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample of 500 characters and a real-time factor (RTF) of 6x, meaning it can render a clip in approximately 1.6 seconds.
This launch follows Mistral’s introduction of two transcription models earlier in 2023, aimed at large batch processing and low-latency real-time use cases. Voxtral TTS is part of Mistral’s strategy to provide a comprehensive suite of voice products to enterprises.
Stock outlined future plans, stating, “We plan to have an end-to-end platform that can handle multimodal streams of input, including audio, text, and image.” This platform is intended to enhance the information processed by systems it integrates into.








