Xiaomi has announced an update to its MiMo voice AI platform with the launch of the MiMo-V2.5-TTS series and MiMo-V2.5-ASR. The company describes the new lineup as a full-link voice model system designed for the agent era, covering both speech output and speech input.

The launch follows Xiaomi’s MiMo-V2-TTS model introduced in March, which focused on detailed control over tone, emotion, and speaking style.

The Xiaomi MiMo-V2.5-TTS lineup includes three separate models and is available for a limited time at no cost through Xiaomi’s MiMo Open Platform.

The base MiMo-V2.5-TTS model includes preset voices and supports adjustments for speech rate, tone, and emotion.

MiMo-V2.5-TTS-VoiceDesign allows users to create entirely new voice timbres using a short input sentence.

MiMo-V2.5-TTS-VoiceClone is designed to reproduce a specific voice using a small number of samples while maintaining consistency across different speaking styles and instructions.

Xiaomi said the models can interpret natural language instructions instead of requiring structured parameters.

Users can describe how a voice should sound in plain language, similar to directing a voice actor. The system also supports layered script-style inputs for use cases such as game characters and audio dramas, allowing separate control of character traits, scenes, and dialogue.

Inline audio tags are also supported, letting users adjust emotion or delivery at specific points within a sentence. These tags can be mixed in the same text and are said to work in both Chinese and English.

Xiaomi is also releasing MiMo-V2.5-ASR as an open-source speech recognition model.

The company said it is designed for real-world scenarios such as bilingual conversations, regional dialects, and noisy environments.

Supported Chinese dialects include Wu, Cantonese, Minnan, and Sichuanese. The model can switch between Chinese and English without preset language tags. It can also recognize song lyrics even when vocals are mixed with music.

For meetings and multi-speaker environments, the system is designed to transcribe overlapping conversations with speaker separation.

Xiaomi said it can maintain accuracy in high-noise settings and with far-field audio capture.

MiMo-V2.5-ASR also includes built-in phonetics and context-based punctuation, reducing the need for post-processing.

Xiaomi said the model delivers state-of-the-art or near-state-of-the-art results on benchmarks covering bilingual recognition, dialect processing, and code-switching tasks.

The TTS models are available through Xiaomi’s platform and can be tested in MiMo Studio. The ASR model is available with open-source weights and code for direct use or customization.

📢 For the latest Tech & Telecom news, videos and analysis join ProPakistani's WhatsApp Group now!

Follow ProPakistani on Google News & scroll through your favourite content faster!

Shares