gen‑ai.news
← Back
Video

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA has released Cosmos 3, a foundation model built around a two-tower architecture that pairs an autoregressive vision-language model with a diffusion-based generative component. The goal is to unify three capabilities that have typically required separate systems: reasoning about physical environments, generating video of those environments, and producing action sequences suitable for robotic or autonomous systems.

The architecture uses a Mixture-of-Transformers design, which allows different parameter subsets to specialize for different tasks without requiring entirely separate model weights for each function. The autoregressive tower handles perception and language-grounded reasoning, while the diffusion tower is responsible for generating coherent video output that reflects physical plausibility. Connecting these two components allows the model to condition generation on structured reasoning rather than treating them as independent processes.

The release is positioned toward physical AI research - a category that includes robotics, autonomous vehicles, and simulation environments where understanding cause-and-effect in the physical world matters as much as visual quality. Earlier Cosmos models from NVIDIA focused primarily on world simulation and video generation; Cosmos 3 extends that scope by integrating action generation, meaning the model can not only predict what a scene looks like but also suggest plausible agent behaviors within it.

NVIDIA is releasing the model as open weights, continuing the pattern set by earlier Cosmos releases. For researchers working on embodied AI or simulation-based training pipelines, a single model that can reason, generate, and act within a shared representational framework reduces the engineering overhead of connecting multiple specialized systems. How well the unified approach holds up against purpose-built models in each individual domain remains to be tested, but the architectural direction reflects a broader trend in the field toward tighter integration between language reasoning and generative video.

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

Video

Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot

Nvidia used GTC Taipei to unveil several new tools aimed at physical AI applications, including a new world model, a larger autonomous driving model, and an open reference platform for humanoid robots. The announcements signal a continued push to make simulation and synthetic data central to how robots and vehicles are trained. Here is a closer look at what was shown and why it matters.