Voice is becoming the interface, and Microsoft just shipped two models built to power it. As part of the seven-model MAI family announced at Build 2026, MAI-Voice-2 handles expressive text-to-speech across 15 languages, and MAI-Transcribe-1.5 handles speech-to-text across 43 languages. Together they form a complete audio loop: listen, understand, and respond in a natural voice.
The headline numbers are striking. MAI-Transcribe-1.5 can transcribe an hour of audio in under 15 seconds at a 2.4% Word Error Rate, and MAI-Voice-2 produces speech that listeners preferred over its predecessor 72% of the time, with voice cloning from as little as five seconds of reference audio. For anyone building call centers, audiobooks, accessibility tools, or voice agents, this is a serious production stack.
This guide covers both models in depth: capabilities, benchmarks, keyword biasing, consent guardrails, and how to build voice agents on Microsoft Foundry. For the full MAI lineup, see our Microsoft MAI models developer guide.
What This Guide Covers
1MAI-Voice-2: Expressive Text-to-Speech
MAI-Voice-2 is Microsoft's most expressive text-to-speech model to date, and a significant jump from MAI-Voice-1 across the dimensions that matter for production voice: fidelity, language coverage, speaker consistency, and emotional range. It expands from English-only to 15 languages while maintaining the same naturalness and expressiveness as English.
The standout features for developers:
- Granular emotion control via emotion tags such as sad, whispered, or excited, plus role styles like motivational trainer or sports commentator
- Stable speaker identity across long-form content, so a single voice holds up across an entire audiobook, podcast, or lecture
- Code-switching for select language pairs such as Hindi-English and Spanish-English, switching mid-sentence without losing prosody or speaker identity
- Strong preference results - MAI-Voice-2 was preferred over MAI-Voice-1 72.1% of the time across roughly 2,500 listening tests
Near-human quality
In speaker-similarity tests, Microsoft reports MAI-Voice-2 output is difficult to distinguish from recordings of the same voice. Across 11 languages and roughly 2,222 responses, 45.5% of listeners preferred the generated speech, 44% preferred the real human recording, and 10.5% called it a tie, effectively a coin flip between synthetic and human.
2Voice Cloning & Consent Guardrails
MAI-Voice-2 supports zero-shot voice prompting: with just 5 to 60 seconds of reference audio, developers can create a custom voice in Microsoft Foundry across all supported languages, with no retraining or fine-tuning required. That makes it practical for companies to bring a consistent brand voice into their products without maintaining a separate voice model.
Microsoft pairs this capability with strict consent controls. Consent is enforced at the system level, so only authorized, licensed voices can be synthesized in production, and Microsoft states no unlicensed voice cloning is possible. Access to the voice-creation feature requires an application.
Plan for consent in your design
Voice cloning carries real legal and ethical risk. Microsoft's system-level consent enforcement helps, but your application still needs a clear process for capturing and storing voice-owner consent, and for documenting which voices are authorized. Build that into your product flow from the start, not as an afterthought.
3MAI-Transcribe-1.5: Speech-to-Text
MAI-Transcribe-1.5 is Microsoft's most accurate multilingual speech-to-text model, with best-in-class Word Error Rate across 43 languages on the FLEURS benchmark. It expanded coverage from 25 to 43 languages without sacrificing accuracy, and Microsoft positions it as the fastest, most efficient, and most cost-effective transcription model among the hyperscalers.
| Metric | MAI-Transcribe-1.5 | MAI-Transcribe-1 |
|---|---|---|
| Languages | 43 | 25 |
| Overall WER (Artificial Analysis) | 2.4% | 2.6% |
| FLEURS ranking | #1 | #1 |
| Speed (1 hour of audio) | Under 15 seconds | ~53 seconds |
The speed jump is the headline: transcribing an hour of audio in under 15 seconds is up to five times faster on long audio than models like Gemini 3.1, Scribe v2, and GPT-4o-Transcribe. It is also optimized for messy real-world conditions such as noisy backgrounds, which is exactly where many transcription models fall down.
4Keyword Biasing for Domain Vocabulary
The feature most likely to matter in production is keyword biasing. Generic transcription models routinely mangle the words that matter most: people and product names, medical terms, internal acronyms, and customer-specific vocabulary. MAI-Transcribe-1.5 lets you supply a list of domain-specific keywords and biases its predictions toward them.
Crucially, the model does not blindly force matches. It uses the surrounding context to decide when biasing should apply, so it improves recognition of specialized vocabulary while keeping accuracy on general speech. Microsoft reports a 30% reduction in Word Error Rate on the FLEURS multilingual benchmark when keyword biasing is used. In its example, names like Aoife, Xochitl, and Soren that a baseline model garbled were transcribed correctly once supplied as keywords.
# Transcription with keyword biasing (illustrative)
POST https://<your-foundry-endpoint>/transcribe
Authorization: Bearer <FOUNDRY_API_KEY>
{
"model": "mai-transcribe-1.5",
"audio_url": "https://.../call-recording.wav",
"language": "en",
"keywords": [
"Aoife", "Xochitl", "Soren", "Niamh",
"MAI-Transcribe", "Foundry", "CPT-99457"
]
}Microsoft also flagged what is coming next: speaker diarization (who said what in multi-speaker audio), a native streaming API for real-time transcription, and continued language expansion. The streaming API in particular will matter for live voice agents, since the current model is batch-first.
5Building a Voice Agent Loop
The two models are designed to compose. A typical voice agent loop chains transcription, reasoning, and speech synthesis, with a language model in the middle that you can swap as needed. Microsoft's own DuoAI demo shows this exact pattern, combining MAI-Transcribe-1.5, MAI-Voice-2, and MAI-Image-2.5 into a multi-agent conversation.
Until the native streaming transcription API ships, design real-time agents around the batch-first model with short audio chunks, and budget for the latency that introduces. For the orchestration and guardrail side of agent design, see our multi-agent orchestration patterns guide.
6Access & What's Next
Both models are available in Microsoft Foundry and the MAI Playground. MAI-Voice-2 is now in Azure Foundry and integrating into VS Code and the Dynamics 365 Contact Center. MAI-Transcribe-1.5 is in Foundry and being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Center. A MAI-Voice-2-Flash variant is coming for lower-cost, ultra-efficient synthesis.
- MAI-Voice-2 model card and Foundry docs for the TTS API and cookbook
- MAI-Transcribe-1.5 model card and Foundry docs for the transcription API, including keyword biasing
- DuoAI on MAI Playground to try both models in a live multi-agent conversation
7Why Lushbinary for Voice AI
Production voice systems are deceptively hard. Latency budgets, consent management, multilingual edge cases, and graceful failure handling all have to work together. Lushbinary builds voice and speech applications end-to-end, from contact-center automation to accessibility tools and branded assistants.
- Voice agent development - full transcribe, reason, and speak loops built on MAI or a routed mix of models
- Consent & compliance - voice-cloning consent workflows and audit trails that hold up to scrutiny
- Multilingual deployment - tuning for accuracy and naturalness across the languages your users actually speak
- Foundry & Azure integration - secure, scalable deployment with monitoring and cost controls
๐ Free Consultation
Building a voice agent, contact-center assistant, or accessibility feature? Lushbinary will scope your use case, recommend the right speech models, and map a realistic build plan, no obligation.
8Frequently Asked Questions
What languages does MAI-Voice-2 support?
MAI-Voice-2 supports 15 languages and a range of locales, including English (US and Australia), Italian, French, German, Hindi, Spanish (Spain and Mexico), Portuguese (Brazil and Portugal), Korean, Chinese (Simplified), Turkish, Russian, Thai, Dutch, Romanian, and Hungarian. It also supports code-switching for pairs like Hindi-English and Spanish-English.
Can MAI-Voice-2 clone a voice?
Yes. MAI-Voice-2 supports zero-shot voice prompting from 5 to 60 seconds of reference audio across all supported languages, with built-in consent guardrails. Consent is enforced at the system level so only authorized, licensed voices can be synthesized in production; no unlicensed voice cloning is possible.
How accurate and fast is MAI-Transcribe-1.5?
MAI-Transcribe-1.5 achieves a 2.4% Word Error Rate on the Artificial Analysis leaderboard (ranked #3) and best-in-class WER across 43 languages on FLEURS. It can transcribe an hour of audio in under 15 seconds, up to five times faster on long audio than competing models, and keyword biasing can improve WER by up to 30%.
How many languages does MAI-Transcribe-1.5 cover?
MAI-Transcribe-1.5 covers 43 languages, up from 25 in the previous version, and Microsoft says it expanded coverage by 18 languages without compromising accuracy. It maintains best-in-class Word Error Rate on the FLEURS multilingual benchmark.
Where can I use MAI-Voice-2 and MAI-Transcribe-1.5?
Both are available in Microsoft Foundry and the MAI Playground. MAI-Voice-2 is also integrating into VS Code and the Dynamics 365 Contact Center, while MAI-Transcribe-1.5 is being integrated into Copilot, Teams, GitHub, and Dynamics 365 Contact Center. You can experiment with both in the DuoAI demo on MAI Playground.
๐ Sources
- Microsoft AI - Introducing MAI-Voice-2
- Microsoft AI - Introducing MAI-Transcribe-1.5
- Microsoft AI - Building a hill-climbing machine
Content was rephrased for compliance with licensing restrictions. Language coverage, accuracy, speed, and preference figures sourced from official Microsoft AI announcements as of June 2, 2026. All figures are vendor-reported and may change - always verify on Microsoft's website.
Building Voice or Speech Features?
From voice agents to multilingual transcription pipelines, Lushbinary builds production speech systems that are accurate, compliant, and fast. Let's talk about your project.
Ready to Build Something Great?
Get a free 30-minute strategy call. We'll map out your project, timeline, and tech stack - no strings attached.
Prefer email? Reach us directly:

