gen‑ai.news
← Back
Multimodal

MiniMax Releases MiniMax M3 with MSA Architecture Supporting 1M-Token Context, Native Multimodality, and Agentic Coding

MiniMax has released M3, a new foundation model that introduces the MiniMax Sparse Attention (MSA) architecture as its core technical contribution. MSA is designed to reduce the computational cost typically associated with long-context processing, allowing the model to operate over context windows of up to one million tokens without the quadratic scaling problems that standard attention mechanisms encounter at that length. For tasks involving long documents, extended conversations, or large codebases, this is a meaningful practical improvement.

The model is natively multimodal, meaning image and video understanding are built into the architecture rather than added through separate modules or adapter layers. This approach tends to produce more coherent cross-modal reasoning compared to systems where vision capabilities are bolted on after the fact. MiniMax M3 also supports computer use, which refers to the ability to interpret and interact with graphical interfaces - a capability that has become a point of competition among frontier model developers over the past year.

Agentic coding is another stated focus of M3. This goes beyond standard code completion, involving the model taking multi-step actions within a development environment - running code, reading output, and iterating toward a goal. The combination of a long context window and agentic coding support makes M3 particularly relevant for software development workflows where a model needs to hold large amounts of code or documentation in view simultaneously.

MiniMax is a Chinese AI company that has previously released image and video generation models, including the Hailuo video generation system. M3 represents a shift toward general-purpose language and reasoning capabilities alongside the company's existing media generation work. The release adds another capable long-context model to a space that includes offerings from Anthropic, Google, and others, and the MSA architecture is worth watching as a potential approach to making million-token context practical at inference time rather than just theoretically supported.

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

Multimodal

Industry leaders share new perspectives on generative media for startups

Google for Startups has published a new report examining how early-stage companies are approaching generative media tools and workflows. The findings draw on perspectives from founders and industry figures navigating this space. The report aims to offer practical context for startups integrating AI-generated image and video into their products.

Multimodal

Let us filter AI slop, you cowards

Content labels on AI-generated images and videos have become more common across major platforms, but critics argue that labeling alone is not enough. The Verge makes the case that YouTube, Instagram, TikTok, and others should go a step further and give users the ability to actively filter AI-generated content from their feeds. Without that option, labels function more as a disclosure footnote than a meaningful tool for audience control.

Multimodal

DaVinci Resolve 21 Officially Released With New Photo Editing, AI Tools, and Much More

Blackmagic Design has shipped the final release of DaVinci Resolve 21, marking one of the most substantial updates the software has seen. The version adds a dedicated Photo page for still-image editing alongside a set of AI-powered tools spread across the editing, color, audio, and visual effects areas of the application.