Google DeepMind CEO Demis Hassabis has revealed plans to integrate its Gemini AI models with Veo video-generating AI models to enhance Gemini’s comprehension of the physical world, as disclosed in a recent podcast interview.
According to Hassabis, Gemini was designed to be multimodal from its inception, with the goal of creating a “universal digital assistant” capable of assisting users in real-world scenarios. “We’ve always built Gemini, our foundation model, to be multimodal from the beginning,” Hassabis explained, “And the reason we did that [is because] we have a vision for this idea of a universal digital assistant, an assistant that […] actually helps you in the real world.”
The AI industry is witnessing a shift towards “omni” models that can process and generate multiple forms of media, such as audio, images, and text. Google’s latest Gemini models can produce audio, images, and text, while OpenAI’s ChatGPT can create images, including Studio Ghibli-style art. Amazon has also announced plans to launch an “any-to-any” model later this year, further illustrating this trend.
These omni models require vast amounts of training data, including images, videos, audio, and text. Hassabis indicated that Veo’s video data is primarily sourced from YouTube, a platform owned by Google. “Basically, by watching YouTube videos — a lot of YouTube videos — [Veo 2] can figure out, you know, the physics of the world,” Hassabis stated. Google had previously informed TechCrunch that its models “may be” trained on “some” YouTube content in accordance with its agreement with YouTube creators.
It is worth noting that Google broadened its terms of service last year, in part to access more data, including YouTube content, for training its AI models. This move is seen as a strategic effort to bolster its AI capabilities by leveraging its vast repository of online data.




