After Microsoft released its image generation model MAI-Image-2 on March 18, it once again released two speech-related models, MAI-Transcribe-1 and MAI-Voice-1, on April 2. In a short period of time, it continuously filled in both image and audio capabilities, which is seen as an important step forward for its multimodal AI strategy. These three models are not sporadic updates; instead, they form a complete puzzle—from visual generation, to speech understanding, to speech output—showing that Microsoft is trying to build foundational AI capabilities that can be directly embedded into enterprise workflows.
Microsoft MAI-Image-2 targets commercial image generation
MAI-Image-2, first launched by Microsoft on March 18, clearly places the emphasis on “commercial use” rather than mere creative generation. Compared with earlier image models that leaned more toward entertainment or experimental purposes, MAI-Image-2 places greater focus on output consistency and semantic accuracy. It can maintain composition consistency and complete detail under complex prompts. This makes it more suitable for scenarios such as brand marketing assets, product visuals, and advertising design.
For enterprises, the value of this kind of model is not whether it can generate stunning images, but whether it can continuously produce “usable and controllable” content—and that is the core that MAI-Image-2 strengthens.
Clipto got stumped! Microsoft releases meeting transcript model MAI-Transcribe-1
MAI-Transcribe-1, released on April 2 immediately afterward, focuses on speech understanding capabilities. The model’s positioning is quite clear: it is a foundational layer technology that converts speech into structured text data. It can handle real-time speech input and maintain high recognition accuracy in multilingual and different accent scenarios, while also having some resistance to background noise.
These capabilities are especially critical in enterprise settings. Whether it’s meeting transcripts, customer service call logs, or organization of media content, stable speech-to-text quality is relied upon. Once audio data can be accurately converted into text, subsequent search, summarization, and analysis workflows can be fully automated—this is also the key role MAI-Transcribe-1 plays within the overall AI architecture.
Use the MAI-Voice-1 model for customer service and podcast voice
Corresponding to this is MAI-Voice-1, which handles the speech output side. The focus of this model is to make the speech generated by AI closer to human performance, including naturalness in tone, rhythm, and emotion. This enables it to be used in scenarios such as customer service voice, AI assistants, video and audio dubbing, and even podcast production. Compared with past, more mechanical speech synthesis, MAI-Voice-1 emphasizes adjustable voice tone and style, so that voice is no longer just a tool for information delivery, but an interface with communication and expression capabilities.
Microsoft’s “see, hear, speak” AI model roundup of the three
If you look at the three in the same context, you can see that Microsoft’s rollout is not a single-point breakthrough, but a rapid push toward multimodal integration. MAI-Image-2 handles visual generation, MAI-Transcribe-1 is responsible for speech understanding, and MAI-Voice-1 completes speech generation; together, they form the basic capability structure of “see, hear, speak.”
Once these capabilities are combined with existing language models and cloud services, they can form a complete AI workflow—everything from data input, understanding, generation to output—carried out within the same system.
Features
MAI-Transcribe-1
(speech to text)
MAI-Voice-1 (text to speech) MAI-Image-2 (text to image) Main functions
Convert speech into verbatim transcripts
Generate natural, fluent, and emotionally expressive speech
Generate images based on text descriptions
Release dates
April 2, 2026
April 2, 2026
March 18, 2026
Key technologies and features
High noise resistance, automatic language identification
Emotion control, voice cloning (Voice Prompting)
Diffusion-based architecture, high realism
Supported languages
English, Chinese, Spanish, and 25 other languages
Currently limited to English (to be expanded to 10+ languages)
Primarily text input (no specific labeling of support for multiple language systems)
Pricing model
Per hour of audio $0.36
Per million characters $22.00
Varies by deployment platform (such as MAI Playground)
Input/output limits
Input: WAV, MP3, FLAC
Input: Plain text or SSML
Output: Maximum 1024×1024 pixels
This article about Microsoft releasing three AI “see, hear, speak” models targeting enterprise AI workflows for commercial use first appeared on Lianxin News ABMedia.