Microsoft readies new MAI voice and image models for Build 2026

Microsoft is reportedly preparing to unveil several new models at its Build 2026 developer conference, all grouped under the MAI branding the company has been building out in recent months. The models include MAI-Image-2.5, a generative image model, MAI-Transcribe-1.5 for speech-to-text tasks, and MAI-Voice-2, which is described as supporting multiple languages.
The MAI line represents Microsoft's effort to develop AI models in-house rather than relying entirely on third-party providers, including its close partner OpenAI. Earlier MAI models were introduced quietly through Azure AI Foundry and Microsoft's API offerings, positioning them as practical tools for enterprise developers rather than consumer-facing products.
A multilingual voice model is particularly notable given the competitive landscape in speech synthesis and real-time translation. Accurate, natural-sounding voice generation across languages remains a difficult problem, and enterprise demand for such capabilities in products like Teams, Copilot, and customer service tooling is significant. MAI-Transcribe-1.5 likewise fits into Microsoft's broader push to improve real-time and asynchronous transcription across its productivity suite.
Build 2026 is shaping up to be a dense event for AI announcements, and the MAI model family will likely be positioned as part of Microsoft's Azure AI platform strategy. Developers using Azure AI Foundry would be a primary audience, with these models potentially available through standard API access shortly after the event. Whether the image model competes directly with offerings like DALL-E or takes a different approach - such as focusing on editing or enterprise document workflows - remains to be seen ahead of the official unveiling.

