NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA has released Cosmos 3, a foundation model built around a two-tower architecture that pairs an autoregressive vision-language model with a diffusion-based generative component. The goal is to unify three capabilities that have typically required separate systems: reasoning about physical environments, generating video of those environments, and producing action sequences suitable for robotic or autonomous systems.
The architecture uses a Mixture-of-Transformers design, which allows different parameter subsets to specialize for different tasks without requiring entirely separate model weights for each function. The autoregressive tower handles perception and language-grounded reasoning, while the diffusion tower is responsible for generating coherent video output that reflects physical plausibility. Connecting these two components allows the model to condition generation on structured reasoning rather than treating them as independent processes.
The release is positioned toward physical AI research - a category that includes robotics, autonomous vehicles, and simulation environments where understanding cause-and-effect in the physical world matters as much as visual quality. Earlier Cosmos models from NVIDIA focused primarily on world simulation and video generation; Cosmos 3 extends that scope by integrating action generation, meaning the model can not only predict what a scene looks like but also suggest plausible agent behaviors within it.
NVIDIA is releasing the model as open weights, continuing the pattern set by earlier Cosmos releases. For researchers working on embodied AI or simulation-based training pipelines, a single model that can reason, generate, and act within a shared representational framework reduces the engineering overhead of connecting multiple specialized systems. How well the unified approach holds up against purpose-built models in each individual domain remains to be tested, but the architectural direction reflects a broader trend in the field toward tighter integration between language reasoning and generative video.

