Google Strikes Back: Veo 2, Imagen 3 & Gemini 2.0

Google delivers SOTA AI models: Veo 2 for AI video generation, Imagen 3 for image generation, & Gemini 2.0 Experimental models – Flash, Advanced and Flash Thinking. Plus Whisk and Gemini-based apps

Dec 20, 2024

Figure 1. Gemini 2.0 scales new heights. AI image generated by Imagen 3 on Google’s ImageFX.

Introduction

Google really delivers.

After two years of playing catch-up to OpenAI, Google has been making waves in the AI world in the past two weeks with its groundbreaking new Gemini 2.0 model and related tool releases. It’s not just in one area, but across the board, Google now has leading AI models and the most innovations and useful products and features. Let’s dive in.

Veo 2 Elevates AI Video Generation

Google’s just-released AI video generation model, Veo2, sets a new SOTA standard with astounding quality, better physical realism and motion handling, and more refined user control.

Realism: Veo 2 generates videos with improved realism and fewer errors than its earlier version, with fewer visual artifacts. It can better recreate real-world physics and realistic human movement and expression, compared to other AI models. Veo 2 focuses on smoother, more natural movements, addressing the sometimes glitchy motion seen in earlier models.

Figure 2. Still of a video of Flamingos. Veo 2 is particularly good at photo-realistic nature videos, like it was sharing a Planet Earth clip.

Resolution and control: Veo 2 supports video generation at resolutions up to 4K, a significant differentiator from other AI models, including Sora (1080p), and can generate longer videos. Users have greater control over camera angles, movements, and shot styles, allowing for more cinematic and dynamic videos; this enables Veo 2 to create quality cinematic videos with consistent characters and photo-realism:

Veo 2 understands the unique language of cinematography: ask it for a genre, specify a lens, suggest cinematic effects and Veo 2 will deliver — at resolutions up to 4K, and extended to minutes in length.

Reviews on social media have been glowing, with users praising Veo 2 for its visual fidelity, realism, ability to accurately represent motion, prompt adherence, and ability to follow detailed instructions:

This came out of nowhere from Google. I knew Veo 2 would be fantastic, but I was not expecting anything like this at all in terms of quality. The physics in these videos are awesome. It's miles ahead of other video tools. – Jerrod Lew

Benchmarks also indicate Veo 2 is SOTA. In head-to-head comparisons, Veo 2 strongly outperformed competitors like OpenAI's Sora Turbo and Kling 1.5 both on overall preference and prompt adherence.

It’s not fully released, but you can join the wait-list for early access to Veo 2 through Google Labs FX Kitchen.

Figure 3. Veo 2 demo video, a photo-realistic video of a beekeeper tending beehives.

Imagen 3

Google has come a long way in image generation with their updated Imagen 3 image generation model, generating stunning high-resolution images with enhanced detail and quality. This week, Google announced Imagen 3 was improved, with better composed images, more diverse art styles, better text rendering, improved color balance and more.

Specifically, Imagen 3 now produces brighter, more vibrant images with improved color balance and fidelity. The model can now render a wider range of art styles with greater accuracy, from photorealism to impressionism, abstract art, and anime. Text rendering has been enhanced in the model. Finally, Imagen 3 can also be prompted to evoke specific photographer styles, camera angles, lighting, and fine textures, resulting in more visually compelling images.

As a result, Imagen 3 can create high-resolution images with exceptional clarity and fine detail. As our cover image Figure 1 shows, Imagen 3’s photorealistic AI images look like professional photography. Imagen 3 achieved state-of-the-art results in human-rated comparisons and in DrawBench benchmarks against other image generation models, winning on visual quality, prompt accuracy, and appeal.

Imagen 3 is now available on ImageFX or for developers via an API on Google’s Vertex.

Whisk for Image Remixing

A neat add-on tool to their image generation model is Whisk, a tool for remixing images. Put in a subject image and add in scene and style images, and Whisk will do a remix of the images.

Figure 4. A café on a wintry mountain rendered in Japanese art / anime style, using Google’s Whisk to mix multiple images.

Gemini-Exp-1206

When Google announced Gemini 2.0 last week, at the time they only released Gemini 2.0 Flash Experimental with it, leaving open the question of when a stronger model might arrive. Just a week later on Dec 17^th, another shoe dropped with the announcement that “2.0 Experimental Advanced” was available on Gemini Advanced:

Gemini Advanced subscribers can try out gemini-exp-1206, our latest experimental model. Significantly improved performance on coding, math, reasoning, instruction following + more.

Gemini 2.0 Experimental Advanced currently holds the top position on the Chatbot Arena leaderboard, showing significantly improved performance on complex tasks compared to previous Gemini models and slightly better than Gemini 2.0 Flash experimental. However it is named, it seems to be akin to a Gemini 2.0 Pro, a stronger model than Gemini 2.0 Flash but with similar multimodal capabilities.

Gemini 2.0 Experimental Advanced has real strengths on complex reasoning. It is particularly adept at handling coding challenges and solving mathematical problems; it can follow complex, multi-step instructions and plan; it has long context understanding and supports multimodal reasoning. It also supports function-calling and native tool use, making it suitable for agentic AI applications.

Gemini 2.0 Flash Thinking Experimental

Just when you thought it was over... we’re introducing Gemini 2.0 Flash Thinking, a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. - Logan Kilpatrick, Google

Figure 5. Gemini 2.0 Flash Experimental, Experimental 1206, and Flash Thinking are all available to try out on Google’s AI Studio.

Google’s ‘one more thing’ this week was Gemini 2.0 Flash Thinking, a reasoning model that utilizes test-time compute to enhance reasoning. Gemini 2.0 Flash Thinking is based on 2.0 Flash and explicitly shows its thoughts, unlike the o1 model. Showing reasoning traces is valuable, both to understand how the model reasons (perhaps useful in learning situations) and to potentially correct errors.

Gemini 2.0 Flash Thinking model is dramatically faster than o1, thanks to a faster underlying model. Alex Volkov of Thursdai shows SimpleBench result where Flash Thinking closely matches o1 results but an order of magnitude less latency. Deedy Das on X says:

Google really cooked with Gemini 2.0 Flash Thinking. It thinks AND it's fast AND it's high quality. Not only is it #1 on LMArena on every category, but it crushes my goto Math riddle in 14s—5x faster than any other model that can solve it!

Building it multi-modal 2.0 Flash enables it to reason on images and more. Overall, it’s really good, but I tested it on a Chess puzzle by uploading a graphic of a chess position; it flubbed it because it didn’t recognize the board position accurately.

More benchmarking is needed to tease out how good Gemini 2.0 Flash Thinking is as a reasoning engine, but the Chatbot Arena scores show Gemini 2.0 Flash Thinking besting o1-preview and o1-mini on Hard Prompts. The advantages of speed, multimodality and reasoning traces make it even more helpful.

Gemini-powered AI Apps

With Gemini 2.0, Google is launching a more aggressive effort to build useful agentic applications. Google touted several of them in their announcement: Project Astra, Project Mariner, and Jules. We mentioned these in our prior weekly. While these are in beta, other Gemini-based AI tools are available to use now.

Deep Research based on Gemini 1.5 Pro, is available on Gemini Advanced. Provide an input prompt request to research a topic, and it will scour dozens of websites, analyze and summarize it to get you a research report on it. Deep Research is now my go-to research gopher.

Google’s powerful and popular NotebookLM document-to-podcast tool got enhanced. Last week, Google announced NotebookLM gets a new look, audio interactivity and a premium version, called NotebookLM Plus.

Google’s AI Studio is a test kitchen for applications using Gemini AI models. It hosts several AI models and AI applications to try out:

Stream Realtime lets you run a multi-modal life AI interaction, where the AI can see your screen or camera, like Project Astra, and answer questions about it.
LearnLM 1.5 Pro Experimental is a tutoring AI tool, for guided learning or homework help with an AI.
Video Analyzer lets you get AI help to caption, make a transcript, summarize, analyze or repurpose a video. This is extremely helpful for content creators but has many other use cases as well.

Conclusion

Google’s releases have upstaged OpenAI’s 12 Days of releases, by delivering the most powerful AI models available to use now.

For consumers, you can try out the latest AI models on Gemini Advanced (paid subscription) to get Gemini 2.0 Experimental Advanced, Gemini 2.0 Flash Experimental, and Deep Research.

The free option to try out the experimental AI models is Google’s AI Studio, which hosts Gemini 2.0 Experimental Advanced, Flash Experimental, and Flash Thinking.

For developers, API access to the Gemini 2.0 experimental AI models via Google’s AI studio is free for now (albeit rate-limited). Google is playing the long game, they want people to use their products and are making it as easy a possible.

As of final writing of this, OpenAI just finished their 12 Days of releases, announcing a preview of o3, their next and most powerful reasoning AI model ever. However, it’s a preview. For AI models you can use right now, Google now offers the best.

AI Changes Everything

Discussion about this post