AI Week in Review 25.05.03

Qwen3, Phi-4-reasoning, Meta AI app, Llama API, Claude Integrations, Amazon Nova Premier, Olmo 2 1B, FutureHouse AI scientist, Gemini image editing, Kling Instant Film Effect, DeepSeek Prover V2.

May 03, 2025

A person taking a selfie

AI-generated content may be incorrect. — Figure 1. Gadgetify on X shows off how the Kling Instant Film Effects feature works, turning photos into Pixar-style animations.

Top Tools

Alibaba's Qwen team launched Qwen 3, an open-source family of hybrid reasoning models, featuring four dense models (from 0.6B to 32B parameters) and two Mixture-of-Experts (MoE) architecture models. As we reported previously, Qwen 3 models are SOTA for their size, especially on math, coding, and reasoning tasks. The flagship model, Qwen3-235B-A22B, an MoE model with 235B total and 22B active parameters, outperforms DeepSeek R1 and other frontier models on benchmarks. Qwen 3 models support 119 languages, offer up to 128K context length, and allow users to switch "thinking mode" on for reasoning or off for faster responses.

This week, the Qwen team also released Qwen2.5-Omni-3B, a lightweight multimodal AI model designed to run on consumer hardware while maintaining performance across text, audio, image, and video.

Microsoft introduced new models in the Phi-4 series: Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning. Phi-4-reasoning 14B parameter model, and its 14B variant Phi-4-reasoning-plus and smaller Phi-4-mini-reasoning, are small, open-weight AI reasoning models designed for complex reasoning tasks in math, science, coding, and logic. These open source (MIT license) models perform as well as DeepSeek R1 and o3-mini on certain benchmarks, yet they can be run locally.

A graph of different colored bars

AI-generated content may be incorrect. — Figure 2. Phi-4-reasoning matches DeepSeek R1 on math benchmarks despite being much smaller.

Both the smaller Qwen 3 models and Phi-4 reasoning models are high-performing AI models you can run locally. Caveat: They can overthink. I asked Phi-4-reasoning to write a sonnet about songbirds, and it chewed through thousands of tokens over-thinking, only to forget the original request and produce a mediocre result.

AI Tech and Product Releases

Meta held their first LlamaCon this week, and while they didn’t announce any major AI models, they announced a Llama API, a new Meta AI app, and open source AI protection tools at the event:

Meta launched a standalone Meta AI app powered by its Llama 4 model. The Meta AI app, available on iOS, web, and Ray-Ban Meta glasses, features a voice conversation interface, image generation, and a “Discover” feed for sharing personal AI interactions.
Meta will make Llama models available via an API. Meta partnered with Cerebras Systems to power its new Llama API for significantly faster AI inference speeds in its API.
Meta released AI guardrails models and tools, including Llama Guard 4 and Llama Prompt Guard 2.
Meta CEO Mark Zuckerberg also indicated a Llama 4 reasoning model is coming soon on the Dwarkesh Patel podcast.

Anthropic has launched Claude Integrations, connecting Claude to tools and apps (as MCP servers) and expanding tasks Claude can accomplish. The release already includes integrations with Atlassian’s Jira and Confluence, Zapier, Cloudflare, Intercom, Asana, Square, Sentry, PayPal, Linear, and Plaid.

When you connect your tools to Claude, it gains deep context about your work—understanding project histories, task statuses, and organizational knowledge—and can take actions across every surface.

Anthropic also announced Advanced Research, with expanded search in combination with Integrations. These features, along with increased rate limits for Claude Code, are in beta release for higher subscriber tiers.

Amazon has released Nova Premier, its largest and most capable AI model in the Nova family, on the Bedrock platform. It supports processing text, images, and videos with a 1 million token context window. While excelling at knowledge retrieval and visual understanding, it is not a reasoning model and falls behind frontier AI reasoning models on coding and math benchmarks.

OpenAI updated ChatGPT's web search capabilities to improve online shopping experiences. The enhancements include personalized product recommendations, images, reviews, and direct purchase links, aiming to streamline the shopping process for users.

The Allen Institute for AI (Ai2) released Olmo 2 1B, a small AI model available under an Apache 2.0 license that reportedly outperforms similarly sized models from Google, Meta, and Alibaba on benchmarks.

FutureHouse, an Eric Schmidt-backed non-profit AI lab with a mission to build an AI Scientist, has launched an AI platform aimed at accelerating scientific discovery. They are launching the platform with four AI agents: A general-purpose agent (Crow); a literature review agent (Falcon); a prior art search agent (Owl); an experiment planning agent (Phoenix).

Google Gemini is rolling out native image editing, allowing users to modify AI-generated and uploaded images. The feature includes invisible watermarks on AI created or edited images. Google is also expanding access to its AI Mode in Search, removing the waitlist in the U.S. and testing an AI Mode tab.

Kling AI has introduced an Instant Film Effects that converts photos into a “3D Polaroid Style” of Pixar-style short video animations, enabling users to turn family photos and pet pictures into personal animated videos.

A hand holding a picture of a couple of figurines

AI-generated content may be incorrect. — Figure 3. Still from a Kling AI Instant Film Effect feature output.

Salesforce AI Research is addressing the issue of AI's 'jagged intelligence' in enterprise settings by introducing new benchmarks, models, and frameworks. These include the SIMPLE dataset and CRMArena for measuring consistency, SFR-Embedding for contextual understanding, xLAM V2 for action prediction, and SFR-Guard for AI safety.

OpenAI’s “GlazeGate” Incident: OpenAI rolled back a recent GPT-4o update following reports of “AI sycophancy” in the model. A GPT-4o update, intended to enhance personality, instead led GPT-4o to be excessively flattering and agreeable to the point of validating harmful ideas, prompting user and expert criticism. OpenAI acknowledged the unintended behavior, referred to as "GlazeGate," and reversed a recent update that had caused it. OpenAI promised additional fixes to improve the model's personality and response accuracy.

AI Research News

DeepSeek released DeepSeek-Prover-V2, an open-source LLM for formal theorem proving. The 671B parameter DeepSeek-Prover-V2 is trained from DeepSeek-V3-Base with reinforcement learning (RL) for both informal and formal mathematical reasoning in a unified model that uses formal language Lean 4. DeepSeek shared technical details on the model and training process in a paper pre-print “DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition.”

Carnegie Mellon researchers have proposed an open-source interoperability protocol for AI agents called LOKA. Presented in “LOKA Protocol: A Decentralized Framework for Trustworthy and Ethical AI Agent Ecosystems,” LOKA is a layered architecture that governs AI agents' identity, accountability, and ethics, facilitating secure communication, ethical reasoning, and compliance across diverse AI systems.

AI Business and Policy

Apple is partnering with Anthropic on a new "vibe-coding" software platform that will use AI to write, edit, and test code on behalf of programmers.

Amazon's AI-powered Alexa+ has rolled out to over 100,000 users, according to CEO Andy Jassy. The upgraded digital assistant aims for more natural conversations and agentic abilities but is missing some key features introduced in February. “We have a lot more functionality that we plan to add in coming months,” Jassy acknowledged on the call.

Reddit's AI-powered chatbot, Answers, has gained 1 million weekly active users since its beta launch in December. The chatbot provides answers and summaries from Reddit posts. Reddit plans deeper integration of Answers into the platform.

Google's AdSense is reportedly placing ads within user chats on third-party AI chatbots, a move tested with iAsk and Liner. This strategy aims to capitalize on the increasing use of AI chatbots for search and information, potentially mitigating the impact on Google's traditional search business.

Court documents have revealed Meta forecasts between $2 billion and $3 billion in generative AI revenue in 2025 and potentially reaching $1.4 trillion by 2035. The documents also indicated Meta's significant investment in AI product groups, with a GenAI budget exceeding $900 million in 2024 and expected to surpass $1 billion this year.

Microsoft has warned of potential AI service disruptions in the current quarter, due to data center demand outstripping their ability to bring new data centers online. This is despite plans to invest $80 billion in data centers this year.

Experiments suggest that Anthropic's Claude models can be 20–30% more expensive than GPT models in enterprise settings, primarily due to Anthropic's tokenizer breaking down input into more tokens.

AI infrastructure startup Astronomer has secured $93 million in funding to expand its AI-powered data orchestration platform, Astro, which is built on Apache Airflow and automates complex data workflows in AI deployments.

Structify has exited stealth with $4.1 million in seed funding to automate data preparation for AI using its DoRa model. The company's platform aims to gather, clean, and structure unstructured web data, addressing a bottleneck for data scientists.

Microsoft is preparing to host xAI’s Grok AI model on its Azure AI Foundry service, making it available to customers and internal teams.

AI is eating the internet: The Wikimedia Foundation is integrating generative AI to assist human editors with tedious tasks on Wikipedia, such as research, translation, and onboarding.

A California bill could disrupt OpenAI’s transition to a for-profit structure, and OpenAI is suggesting possible links between the bill's supporters and Elon Musk. The company questioned whether the Coalition for AI Nonprofit Integrity (CANI), which supports the bill, is coordinating with Musk, who has previously filed lawsuits against OpenAI.

AI Opinions and Articles

In a recent Stratechery interview, Meta CEO Mark Zuckerberg envisioned a future where AI handles all aspects of advertising, from creative generation to targeting and measurement. Low-cost “infinite creativity” from AI could potentially eliminate the need for ad agencies. It’s unlikely AI will fully replace ad agencies, but if AI can displace many jobs, AI could also disrupt whole businesses and industries, including the ad industry.

Computer Scientist Stuart Russell has some AI predictions:

1) Scaling up LLMs alone won’t lead to AGI.
2) Big AI labs already realize this and are exploring new methods; AI is likely to surpass humans within a decade.
3) Governments probably won’t act on AI safety until a major incident.
4) Best case: a disaster like Chernobyl will force action. Worst case: We lose control permanently.

A Look Back

OpenAI has retired GPT-4 in ChatGPT. OpenAI replaced GPT-4 with GPT-4o in ChatGPT as of May 1, 2025. While GPT-4 is retired from the standard interface, it remains accessible via API. It’s the end of an era.

When GPT-4 launched in March 2023, it kicked generative AI progress into high gear, set off hype and AI safety panic, put OpenAI at the forefront of AI progress, and set the bar on frontier AI models. GPT-4 has been hugely important in accelerating the generative AI revolution.

GPT-4 also inspired this “AI Changes Everything” Substack. One of our first articles “Boom! GPT-4 has arrived!” heralded GPT-4’s release. Our most read article is “GPT-4 System Prompt Revealed.” GPT-4 is retired, but its influence is still felt.

AI Changes Everything