AI Week in Review 25.03.29

GPT-4o native image generation, Gemini 2.5 Pro, DeepSeek V3 0324, Qwen QVQ-Max, Qwen2.5-Omni-7B, Qwen2.5-VL-32B, Reve Image 1.0, Ideogram 3.0, ARC-AGI-2 benchmark, OpenAI adds MCP support.

Mar 29, 2025

A person and person giving each other a high five

AI-generated content may be incorrect. — Figure 1. GPT-4o native image generation rendered this image from a prompt that included detailed text to put on the whiteboard. High-five to OpenAI for this next-level feature.

Top Tools

This week’s most viral new release is OpenAI’s Native Image Support in ChatGPT with GPT-4o, which allows users to generate images directly in GPT-4o from textual prompts, producing detailed and accurate images. Being native to multimodal GPT-4o allows for more accurate interpretation of prompts and detailed, life-like imagery from natural language edits. It also allows highly accurate refining and restyling of images through conversing with GPT-4o, adhering to character consistency.

This new capability has excited users and sparked widespread experimentation online, with users generating great images with next-level faithful text rendering. These include images of Wikipedia pages, websites and UIs with correct text, branding visuals, cartoons and memes, from hyper-realistic to distinct artistic styles such as Studio Ghibli.

Users on X have had fun using GPT-4o native image generation to re-render real photos into Ghibli style faithfully, or other styles like Lego style, making for nice Ghibli-style family images, adding to the virality of this feature.

A cartoon of two people

AI-generated content may be incorrect. — Figure 2. GPT-4o native image generation can render meme images with accurate text in Studio Ghibli style from a prompt. AI twitter is having fun making Studio Ghibli memes with GPT-4o. There’s even a Ghibli-style Lord of the Rings trailer.

OpenAI CEO Sam Altman said "GPUs are melting" due to high demand for image generation in ChatGPT. It’s available to Pro and Plus users, but GPT-4o native image generation is quite slow, taking 1-2 minutes to generate an image.

OpenAI has made other improvements to GPT-4o recently, claiming these updates make GPT-4o “more intuitive, creative, and collaborative, with enhanced instruction-following, smarter coding capabilities, and a clearer communication style.”

AI Tech and Product Releases

Google released Gemini 2.5 Pro Experimental, reclaiming the top position on Chatbot Arena leaderboard with a multimodal AI model with integrated enhanced reasoning and a 1 million-long context. Benchmark results indicate significant performance gains across academic and human evaluation metrics, in particular achieving SOTA 18% on Humanity’s Last Exam. We shared more detailed on Gemini 2.5 Pro in “Gemini 2.5 Pro & DeepSeek V3 - Best Models Yet.”

DeepSeek released DeepSeek-V3 0324, updating its V3 Mixture-of-Experts 685B parameter base model to improve in reasoning and coding. The model improved on to match SOTA models like GPT-4.5 and Claude 3.7 Sonnet on benchmarks such as GPQA, AIME, and LiveCodeBench. DeepSeek-V3 0324 is open source released under the MIT License, with weights available on HuggingFace.

Qwen introduced the Qwen2.5-Omni-7B model, a multimodal voice and video chat model. The model is capable of processing text, images, audio, and video inputs and generating both text and natural-sounding speech outputs. Despite its relatively small size, Qwen2.5-Omni-7B demonstrates robust capabilities in multi-modal understanding and generating audio, making it a great voice chat model you can run locally in Colab.

The Qwen team released QVQ-Max, an AI visual reasoning model based on Qwen-Max:

QVQ-Max is a visual reasoning model that possesses both “vision” and “intellect.” It doesn’t just recognize the content in images; it combines this information to analyze, reason, and even complete creative tasks.

QVQ-Max is useful in parsing complex charts, spatial physics and geometry problems, and many everyday tasks needing visual reasoning or image comprehension.

The Qwen team also released Qwen2.5-VL-32B, a “smarter and lighter” Vision Language Model (VLM) focused on fine-grained visual understanding combined with reasoning and better alignment:

“Qwen2.5-VL-32B has focused on optimizing subjective experience and mathematical reasoning through reinforcement learning.”

Qwen2.5-VL-32B boasts superior performance on visual reasoning (scoring 70 on MMMU) as well as textual reasoning (68 on MMLU-Pro), out-performing recent excellent new AI models Mistral Small-3.1 24B and Gemma3 27B across benchmarks.

Reve launched Reve Image 1.0, a new diffusion-based AI image generation model that claims state-of-the-art performance in prompt adherence, hyper-precise detail, realism, and text generation. Reve Image 1.0 is currently ranked #1 for "image generation quality" by Artificial Analysis at Arena ELO of 1144, edging out contenders Recraft V3, Imagen 3, and Flux 1.1 Pro. Travis Davids enthuses on X:

Aesthetics are excellent! Solid prompt adherence as well. Censorship is also quite relaxed. The results can look very cinematic!

The Reve Image 1.0 model is available through Reve’s chat interface with a free preview for now, later moving to pay-per-image.

A collage of different objects

AI-generated content may be incorrect. — Figure 3. Reve AI image generation shows a great combination of hyperrealism, style consistency, and text adherence.

Ideogram launched version 3.0 of its AI image generation model with an impressive demo video, emphasizing significant improvements in text rendering, logo design, and overall image realism. Ideogram 3.0 also boasts SOTA performance in human evaluations, and it introduces a “Style References” feature that allows users to upload images to guide the generated aesthetic.

A close up of food

AI-generated content may be incorrect. — Figure 4. Ideogram continues to be a leading AI image generation model for text adherence, although other new AI image generation models like GPT-4o image generation and Reve Image 1.0 are equally impressive now.

The ARC Prize team has introduced a new ARC-AGI 2 benchmark to raise the bar on AGI testing, evaluating AI on tasks that are straightforward for humans but challenging for AI models.

ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans. Pure LLMs score 0% on ARC-AGI-2, and public AI reasoning systems achieve only single-digit percentage scores.

Are you smarter than an AI? Try some of the ARC-AGI-2 puzzles and find out.

OpenAI announced support for the Model Context Protocol (MCP) in its Agents SDK, with plans to extend this support to the ChatGPT desktop app and response APIs. As more servers are developed, MCP is gaining further momentum. Microsoft launched a Playwright-MCP server, and Weights & Biases released an MCP server integrated with its Weave platform to enhance LLM observability and evaluation.

Prince Canuma released version 0.0.3 of MLX-Audio, an audio package for speech synthesis on Apple Silicon devices, built on Apple's MLX framework. The update supports multiple text-to-speech models, including Kokoro, Sesame 1B, Orpheus, and Suno Bark, enabling high-quality speech and sound effect generation.

Observe.AI launches VoiceAI agents to automate customer interactions in contact centers. Observe.AI's VoiceAI agents can manage various customer inquiries and integrate with existing systems, aiming to improve customer experience and reduce costs compared to human agents.

Groq and PlayAI partnered to launch advanced text-to-speech model, Dialog, on Groq's platform. Dialog features natural-sounding speech and low latency, available in English and Arabic, and outperforms existing systems in user preference.

OtterAI is launching an AI Meeting Agent that can participate in meetings and verbally respond to queries. OtterAI is also launching an autonomous Sales Development Representative agent.

Microsoft is introducing AI-powered "deep research" tools, Researcher and Analyst, to Microsoft 365 Copilot. The Researcher tool uses OpenAI's Deep Research model with advanced search, while Analyst is built on OpenAI's o3-mini model for data analysis.

AI Research News

Anthropic shares conclusions from LLM interpretability studies in a blog post “Tracing the thoughts of a large language model,” including some intriguing findings:

Claude sometimes thinks in a conceptual space that is shared between languages.
Claude will plan what it will say many words ahead and write to get to that destination.

The blog post is based on a detailed Anthropic paper, On the Biology of a Large Language Model.

The Qwen team presented technical details on their multimodal Qwen2.5-Omni model in the Qwen2.5-Omni Technical Report. One innovation is the “Thinker-Talker” architecture to concurrently generate text and speech. In this framework, Thinker is an LLM tasked with text generation, while Talker is a dual-track autoregressive model that uses hidden representations from the Thinker to produce audio output. This and other innovations in Qwen2.5-Omni yield excellent performance in speech instruction following and speech generation.

AI Business and Policy

Elon Musk announced the merger of social media X and AI company xAI:

@xAI has acquired @X in an all-stock transaction. The combination values xAI at $80 billion and X at $33 billion ($45B less $12B debt). … xAI and X’s futures are intertwined. Today, we officially take the step to combine the data, models, compute, distribution, and talent. This combination will unlock immense potential by blending xAI’s advanced AI capability and expertise with X’s massive reach.

Nvidia is testing fully AI-generated ads with partners like Unilever to cut ad costs, while Meta and Google are excited about AI video generation to serve infinite personalized video ads. Brands will upload creative assets and demographic targeting information for custom ads.

AI Opinions and Articles

The leader of model behavior at OpenAI. Joanne Jang shares OpenAI’s thoughts on setting policy for new AI capabilities, explaining why OpenAI’s new image generation is less censored and buttoned-down than before:

We’re shifting from blanket refusals in sensitive areas to a more precise approach focused on preventing real-world harm. The goal is to embrace humility: recognizing how much we don't know and positioning ourselves to adapt as we learn.

They mention “seeing risks clearly but not losing sight of everyday value to users.” However, I would translate it into seeing competition as less censored and not wanting to lose user base to alternatives.

In “The Unbelievable Scale of AI’s Pirated-Books Problem,” the Atlantic shared the massive amount of (copyrighted) books, journal articles and other text data used to train Meta’s Llama models and many other LLMs.

AI Changes Everything

Discussion about this post