Google Launches Gemini Omni for Video Generation and Editing

Google introduced Gemini Omni Flash at its I/O 2026 conference, positioning it as a model that can accept any combination of text, image, audio, or video input and produce edited or generated video as output. The conversational editing interface lets users refine results through back-and-forth prompts rather than rewriting full generation instructions from scratch, a workflow shift that brings video creation closer to how people interact with text-based LLMs.
Google's pitch for the model's realism rests partly on its training foundation. The company argues that grounding the model in broad factual knowledge - physics, history, cultural context - helps it handle things like fluid dynamics, lighting interaction, and gravity more convincingly than models trained on visual data alone. Whether that claim holds up at scale remains to be tested by users and benchmarks outside Google's own demos.
The model also supports user-defined visual language, meaning creators can specify a style, motion character, or effects palette and have it applied consistently across a generation. Digital avatars with the user's own voice are included as a feature, framed partly as a safety mechanism to govern how likeness is used in AI-generated content.
All output from Gemini Omni carries an embedded SynthID watermark, connecting it to the broader provenance infrastructure Google has been building. The model is rolling out initially to AI Ultra subscribers, with broader access expected to follow.


