May 22, 2026Video

Google Launches Gemini Omni for Video Generation and Editing

Google introduced Gemini Omni Flash at its I/O 2026 conference, positioning it as a model that can accept any combination of text, image, audio, or video input and produce edited or generated video as output. The conversational editing interface lets users refine results through back-and-forth prompts rather than rewriting full generation instructions from scratch, a workflow shift that brings video creation closer to how people interact with text-based LLMs.

Google's pitch for the model's realism rests partly on its training foundation. The company argues that grounding the model in broad factual knowledge - physics, history, cultural context - helps it handle things like fluid dynamics, lighting interaction, and gravity more convincingly than models trained on visual data alone. Whether that claim holds up at scale remains to be tested by users and benchmarks outside Google's own demos.

The model also supports user-defined visual language, meaning creators can specify a style, motion character, or effects palette and have it applied consistently across a generation. Digital avatars with the user's own voice are included as a feature, framed partly as a safety mechanism to govern how likeness is used in AI-generated content.

All output from Gemini Omni carries an embedded SynthID watermark, connecting it to the broader provenance infrastructure Google has been building. The model is rolling out initially to AI Ultra subscribers, with broader access expected to follow.

Read at Google Blog →

Share:X

Enjoy this story? Get the next one in your inbox.

Twice a week: the most important stories in generative image and video AI, distilled into a 2-minute read.

Free. Unsubscribe any time. No spam, ever.

Your next read

June 4, 2026Video

xAI updates Grok Imagine to 1.5 with image-to-video generation at 720p resolution

xAI has updated its Grok Imagine system to version 1.5, adding an image-to-video model that converts still images into short video clips at up to 720p resolution. The new model accepts text prompts to guide motion and style, and multiple generated clips can be joined into longer sequences.

June 3, 2026Video

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

NVIDIA has released Cosmos 3, an open omnimodal foundation model that combines a vision-language reasoning component with a diffusion-based video generator in a two-tower architecture. The system is designed to support physical AI applications by linking language-grounded reasoning with the generation of plausible world states and robot actions.

June 1, 2026Video

Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot

Nvidia used GTC Taipei to unveil several new tools aimed at physical AI applications, including a new world model, a larger autonomous driving model, and an open reference platform for humanoid robots. The announcements signal a continued push to make simulation and synthetic data central to how robots and vehicles are trained. Here is a closer look at what was shown and why it matters.