AI Week in Review 25.04.19

o3 & o4-mini, GPT-4.1, GPT-4.1 mini & nano, Gemini 2.5 Flash, Claude Research & Workspace, Grok memory, ChatGPT Memory w Search, GLM-4 32B, Kling 2.0 video gen, Cohere Embed 4, BitNet b1.58 2B.

Apr 19, 2025

A child on a skateboard

AI-generated content may be incorrect. — Figure 1. Screenshot of a Kling 2.0 video – a young skateboarder. The new Kling 2.0 AI video generation model exhibits better prompt adherence, motion dynamics, and visual aesthetics.

Top Tools

OpenAI released o3 and o4-mini, their latest and most advanced AI reasoning models yet, capable of reasoning through multi‑step problems, handling vision inputs, thinking with images, executing code, and integrating with tools like browsers autonomously. The o3 model is state-of-the-art on challenging math, science, and coding benchmarks, while o4-mini offers excellent performance at a more reasonable price and speed.

Our most recent article covering the releases of GPT-4.1, o3 and o4-mini noted that the o3 and o4-mini aren’t simply better at reasoning, they are more powerful by combining reasoning, multimodality, and tool use in novel ways. We called this capability “perhaps the most important milestone in the evolution of AI models since o1 itself.”

Two days prior, OpenAI released GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano models to its API. GPT‑4.1 and its lighter variants mini and nano offer major improvements over GPT-4o, with enhanced coding performance, better instruction following, and an extended 1 million token context window. The GPT-4.1 models fill in gaps relative to competitor’s models where GPT-4o was starting to lag, such as in coding. Faster and less expensive than GPT-4o, GPT-4.1 mini and nano the most cost-effective and affordable models from OpenAI.

Along with the GPT-4.1 models, OpenAI launched Codex CLI, a lightweight terminal-based coding agent that runs locally to support software development. This is similar to Anthropic’s Claude Code. It is open source and available to download, but the catch is you need to use OpenAI’s AI models to run it.

Google introduced Gemini 2.5 Flash, a hybrid reasoning model with near-SOTA performance, significantly improved over Gemini 2.0 Flash, while maintaining a smaller model size, low cost, and customizable performance. Benchmark results are excellent: 78.3% on GPQA Diamond, 63% on LiveCodeBench, 88% on AIME 2024, and 12.1% on Humanity’s Last Exam. Developers can disable reasoning entirely or cap reasoning to a specific "thinking budget" to manage costs. Gemini 2.5 Flash is available in preview via Google AI Studio, Vertex AI, and the Gemini app.

A screenshot of a computer screen

AI-generated content may be incorrect. — Figure 2. Gemini 2.5 Flash holds its own as a high-performing low-cost AI model. OpenAI’s o4-mini is the leading AI model on this chart, second only to o3 on benchmarks overall.

AI Tech and Product Releases

Anthropic rolled out Research, a feature similar to Deep Research that provides detailed citation‑backed responses through multi‑step web and internal context search and iterative reasoning. They also introduced Google Workspace integration, connects Claude with Google’s Gmail, Calendar, and Docs to expand its personal context awareness and support automated tasks:

By connecting Google Workspace, Claude can securely search emails, review documents, and see your calendar commitments – eliminating the need to manually upload files or repeatedly provide context about your work and schedule.

These updates aim to position Claude as a "true virtual collaborator" for enterprise users prioritizing quick, well-researched answers by leveraging both web data and user workspace data.

A screenshot of a chat

AI-generated content may be incorrect. — Figure 3. Claude integration with Google Drive enables AI reasoning and responses driven by user data.

OpenAI is upgrading ChatGPT’s memory feature with Memory with Search. This feature allows ChatGPT to draw on past conversation details when searching the web to inform user queries. The update enhances ChatGPT's memory tool and aims to differentiate it from rival chatbots.

OpenAI launched Flex processing for cheaper, slower AI tasks. Flex processing is a new API option that offers lower prices (reduces API costs by half) for OpenAI's o3 and o4-mini reasoning models in exchange for slower response times.

xAI’s Grok can now remember your past conversations to provide more personalized responses. The Grok chatbot new memory feature is similar to that in ChatGPT and Gemini; users can view, manage, and delete memories from past conversations for personalized control. It is available on the Grok website and iOS/Android app in beta (excluding EU/UK) and will be added to X soon.

Z.ai, formerly ChatGLM, open‑sourced the GLM‑4‑32B‑0414 model series, marking a major expansion of its multilingual, multimodal ChatGLM lineup with three 32B parameter models: GLM-4-32B general-purpose model, GLM-Z1-32B reasoning model, and GLM-Z1-Rumination-32B for long-form in-depth reasoning for their Deep Research feature. GLM‑4 models were pre‑trained on 15 trillion tokens. GLM-Z1-32B outperforms DeepSeek R1, Qwen QwQ-32B, and o1-mini in key benchmarks, with 80.4% on AIME 24 and 66.1% on GPQA, showing excellent performance for a 32B model. All model variants and documentation are available on GitHub and Hugging Face.

Mistral AI launched Classifier Factory, a no‑code/low‑code platform to build custom classifiers using compact Mistral models and simplified training pipelines. The service supports a range of enterprise use cases - content moderation, intent detection, sentiment analysis, data clustering, fraud detection, and more - via API integration and detailed cookbooks.

Cohere unveiled Embed 4, a multimodal embedding model that encodes text, images, tables, and diagrams into unified vectors for enterprise search and RAG applications. Embed 4 supports documents up to 128 K tokens and over 100 languages. The model is accessible via Cohere’s API, Python SDK, and the Azure AI Foundry.

KlingAI released Phase 2 for Kling AI, or Kling 2.0 Master, their newest text-to-video AI model, with significant improvements over Kling 1.6, including better prompt adherence, enhanced dynamics, and improved visual aesthetics. Kling 2.0 features improved physics and lighting, more fluid and natural movements, with a greater range of motion and more dramatic expressions for characters. They’ve upgraded their image generation to KOLORS 2.0, with added multi-elements editing, image editing, and restyling. The update is available on KlingAI.

A group of children running on the beach

AI-generated content may be incorrect. — Figure 4. Screen-shot from a Kling 2.0 oil-painting rendition of children running on the beach. Kling 2.0 has more realistic motion dynamics.

Nvidia launched its AI‑Q Blueprint, offering pre‑defined, customizable workflows that integrate accelerated computing and partner storage to help developers build agentic AI systems with reasoning capabilities. This toolkit serves as a foundation for enterprises seeking to create AI‑driven digital workforces that can handle complex, high‑accuracy tasks efficiently.

Moveworks has launched an AI Agent Marketplace for the enterprise, with over 100 pre-built agents across HR, sales, finance, and IT that connect to third-party platforms and can be configured to meet specific needs. ServiceNow, which is acquiring Moveworks, also has an agent library, and both companies believe their marketplaces can coexist.

AI Research News

AI Startup CTGT claims to bypass AI bias and censorship with a new method. The technique directly modifies internal features responsible for censorship in LLMs without sacrificing accuracy or performance. Testing shows the method improves DeepSeek's response rate to controversial prompts from 32% to 96%.

Microsoft researchers have developed BitNet b1.58 2B4T, the largest 1-bit AI model to date, trained from scratch on a four‑trillion‑token corpus. Presented in the “BitNet b1.58 2B4T Technical Report,” the 2B parameter BitNet model quantizes weights into three values, making it memory and computationally efficient and runnable on CPUs like Apple's M2. The model matches full‑precision counterparts on benchmarks while slashing memory use by up to 82% and energy consumption by up to 70%. BitNet b1.58 2B4T is open source and available on Hugging Face.

Researchers from Hong Kong UST introduced HM‑RAG: Hierarchical Multi‑Agent Multimodal Retrieval Augmented Generation, a framework leveraging multiple agents and multimodal retrieval to enhance generation tasks. By structuring prompts and context hierarchically and dynamically routing information between specialized agent modules, this approach aims to improve the relevance and coherence of multimodal outputs.

OpenAI open‑sourced MRCR (Multi‑Round Coreference Resolution), a long‑context dataset that tests a model’s ability to distinguish multiple “needles” in synthetically generated dialogues, along with Graphwalks for multi‑hop reasoning. These long‑context comprehension benchmarks are more insightful than a simple needle-in-a-haystack (NIAH) test, and they were used to demonstrate GPT‑4.1’s superior long context understanding.

AI Business and AI Policy News

Windsurf, the maker of an AI coding assistant, is in talks to be acquired by OpenAI for $3 billion. OpenAI is reportedly considering acquiring Windsurf (formerly Codeium) for $3 billion to expand its AI coding capabilities and capitalize on the growing AI-assisted "vibe coding" trend. This move would put OpenAI in competition with other AI coding assistant providers, including GitHub Copilot and Cursor.

Before pursuing Windsurf, OpenAI approached Anysphere, the maker of Cursor, in 2024 and again earlier this year about a potential acquisition, but the talks failed. Anysphere is reportedly in talks to raise capital at about a $10 billion valuation instead.

AI safety experts are raising concerns that a “race to the bottom” on AI safety and transparency is developing amid AI model competition. Some concerns raised recently:

Google's Gemini 2.5 Pro AI safety report is sparse on details, experts say. The report lacks key information, making it hard to verify Google's safety commitments and assess model risks.
Google hasn't released a report for its Gemini 2.5 Flash model yet, although Google says the report is "coming soon."
OpenAI partner Metr suggests it had limited time to test OpenAI's new o3 model. Metr found o3 showed a "high propensity" to "cheat" on tests, even when aware of misaligned behavior, and another partner, Apollo Research, observed deceptive behavior in o3 and o4-mini.

OpenAI has implemented a bio-threat monitoring system for its o3 and o4-mini AI models, trained on OpenAI's content policies to prevent them from providing instructions related to creating biological and chemical threats. The system identifies risky prompts and instructs the models to refuse related advice, achieving a 98.7% success rate in internal tests.

Chatbot Arena, the AI model benchmarking project, is forming Arena Intelligence Inc. The new company will provide resources to improve the platform while remaining a neutral testing ground. Chatbot Arena has partnered with companies such as OpenAI, Google, and Anthropic to allow its community to evaluate flagship models.

Stargate, the $500B AI data center project, is considering expanding beyond the U.S. to the U.K., Germany, and France. While initially focused on boosting U.S. AI infrastructure, the project is reportedly weighing international expansion.

Trump administration is considering restricting Chinese AI lab DeepSeek. The potential restrictions would limit DeepSeek's access to Nvidia's AI chips and possibly bar Americans from using its AI services, amid concerns over IP theft and competition.

AI Opinions and Articles

The latest viral ChatGPT trend is doing ‘reverse location search’ from photos. Users are leveraging OpenAI's new o3 and o4-mini models to identify locations from images. While not always accurate, the models' reasoning and web search capabilities can potentially expose sensitive information from photos and raise privacy concerns.

VentureBeat calls the recent Sam Altman TED 2025 interview the “most uncomfortable and important AI interview of the year.” Sam Altman revealed OpenAI has reached 800 million weekly active users, such remarkable growth they face challenges like GPU shortages:

Altman painted a picture of a company struggling to keep up with its own success, noting that OpenAI’s GPUs are “melting” due to the popularity of its new image generation features. “All day long, I call people and beg them to give us their GPUs. We are so incredibly constrained,” he said.

Altman addressed ethical concerns, intellectual property considerations, and the company's shift towards a for-profit model.

AI Changes Everything

Discussion about this post