AI Week in Review 25.05.17
AlphaEvolve, Codex AI coding agent, AM-Thinking-v1, Falcon-Edge, Windsurf SWE-1, GPT-4.1 in ChatGPT, Nous Research Psyche, INTELLECT-2 32B, ARI Enterprise, HealthBench. Grok went off the rails.

Top Tools
OpenAI launched Codex, a cloud-based AI coding agent for performing software engineering tasks like writing code, implementing features, and fixing bugs. Codex provides a "vibe coding" experience where users can interact with the Codex AI agent in a ChatGPT-like interface, and review, edit, and approve any suggested code changes. Codex allows multiple agents to work on coding tasks in parallel.
The system is powered by a custom model, Codex-1, which is built on o3 and specifically trained for coding using end-to-end RL focused on real-world tasks. Codex-1 improves over o3 on SWE benchmarks. OpenAI also announced updates to their previously released Codex CLI, which is to be powered by a codex-mini model.
Codex is being released as a research preview for ChatGPT subscribers; Pro, Enterprise and Team users get it now, and Plus and Edu will get access soon. Try it here.

AI Tech and Product Releases
A Chinese AI lab in Beike (Ke.com) has released AM-Thinking v1, a new open-source 32B dense reasoning model that is SOTA on math and code benchmarks: 85.3% on the AIME 2024, 70.3% on LiveCodeBench v5, and 92.5% on Arena-Hard, ompetitive with o1 and o3-mini despite being just a 32B model. The model supports hybrid reasoning via a "/think" reasoning toggle, and is optimized for local use. AM-Thinking v1 is available on HuggingFace.
The team shared a technical paper “AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale” to explain their training process:
Starting from Qwen2.5-32B base model, we apply SFT using a cold-start dataset that encourages a "think-then-answer" pattern and builds initial reasoning capability. During RL, we incorporate difficulty-aware query selection and a two-stage training procedure to ensure both training stability and progressive improvement in performance.
The Falcon team at Abu Dhabi’s TII have launched Falcon-Edge, 1B and 3B ternary BitNet LLMs for edge deployment. ( Blog, HF-1B, HF-3B ) It is available on HuggingFace.
Windsurf (formerly Codeium) launched its SWE-1 family of AI software engineering models, including SWE-1-lite and SWE-1-mini for all users, and SWE-1 for paid users. These models offer flow awareness and are able to optimize the entire software engineering process.
OpenAI is rolling out GPT-4.1 and GPT-4.1 mini AI models to ChatGPT, with GPT-4.1 mini becoming the new default for all users, replacing the GPT-4o series. These models, supporting a one million token context window, offer improved coding, instruction following, and efficiency over GPT-4o.
Nous Research has announced a decentralized cooperative-training network to train LLMs called Psyche, that allows for distributed, permissionless, collaborative AI model training:
Psyche is an open infrastructure that democratizes AI development by decentralizing training across underutilized hardware. Building on DisTrO and its predecessor DeMo, Psyche reduces data transfer by several orders of magnitude, making distributed training practical.
The Psyche dashboard shows the status of distributed training on a 40B parameter AI model.
Prime Intellect released INTELLECT-2, the first 32B AI model trained through globally decentralized RL training. The training used a framework for distributed LLM training similar to Nous Research's Psyche, showing the possibilities for decentralized training and deployment of AI models. They shared further details on training in a Technical Report.
You.com Launched ARI Enterprise, which boasts 4 times greater depth and breadth than the previous version and can connect to internal corporate data sources. You.com bills ARI as “the first professional-grade research agent” and claims it is superior to OpenAI on deep research benchmarks. The company is open sourcing its benchmarking methodology.
Notion integrated GPT-4.1 and Claude 3.7 into its platform for new AI enterprise capabilities. The all-in-one AI toolkit includes AI meeting notes, enterprise search, research mode, and model switching to reduce context switching.
The AI agent ecosystem is getting supported by tools for enterprises using AI agents:
Patronus AI launched Percival, an AI agent monitoring platform to automatically identify AI failures to improve AI reliability and governance. Percival detects over 20 failure modes and suggests optimizations, differentiating itself through agent-based architecture and episodic memory.
Frontegg.ai has launched its AI Agent Builder to simplify identity and access management for AI agents, providing authentication, integrations, security, and authorizations. This platform enables developers to build enterprise-ready agentic automation products by handling essential access controls.
Microsoft is testing a "Hey Copilot" voice activation feature for Windows 11 Copilot. Beta testers can try hands-free access to the AI app in Windows 11.
Google I/O is next week, promising AI advancements and product updates on Gemini AI models and AI agents like Astra and Project Mariner.
The Wall Street Journal reports Meta has again postponed the release of its flagship Llama 4 Behemoth AI model. Despite massive training runs, Llama 4 Behemoth is still not meeting performance targets, delaying the release to at least the fall. Meta has not publicly committed to a new timeline for the product's launch.
AI Research News
Google DeepMind announced AlphaEvolve, an AI agent for algorithm discovery and optimization. AlphaEvolve has already delivered earth-shattering results:
When presented with 50 open problems across mathematics, geometry, combinatorics, and other fields, Alpha Evolve rediscovered the best known solutions in 75% of cases and improved upon existing solutions in 20%.
AlphaEvolve shattered a 56-year-old record by improving on Strassen’s algorithm for matrix multiplication.
It also advanced mathematics knowledge with an improved lower-bound solution for the 11 dimension kissing number problem.
AlphaEvolve has paid for itself by improving Google’s data center compute orchestration, finding a better heuristic that clawed back 0.7% of Google's compute capacity, saving millions.
Alpha Evolve builds on DeepMind's previous AI successes, combining evolutionary methods, RL-based learning, and coding LLMs:
Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries.

OpenAI announced HealthBench, a physician-crafted benchmark for AI in healthcare for evaluation of AI in real-world medical situations:
Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses.
They released an associated paper “HealthBench: Evaluating Large Language Models Towards Improved Human Health” with the benchmark.
AI Business and Policy
Just in time for graduation, Cursor is offering university students one year of free Cursor Pro.
Poe's report reveals shifting AI model preferences, with OpenAI and Google gaining ground while Anthropic declines. Specialized reasoning models are rapidly growing, indicating a focus on precision over raw text generation.
OpenAI is planning a 5-gigawatt data center campus in Abu Dhabi with G42, potentially one of the world's largest AI infrastructure projects spanning 10 square miles. The UAE could import 500,000 Nvidia AI chips annually starting in 2025 for this campus.
Cohere doubled revenue this quarter, finding success in enterprise AI deployments. In addition, Cohere has acquired Ottogrid, an enterprise platform for automating market research.
Coders were hit hardest among Microsoft’s 2,000-person layoff in Washington, with over 40% of laid-off employees in software engineering roles. These cuts come after the CEO said AI now writes up to 30% of the company’s code, but Microsoft declined to comment if AI was a factor.
AI startup funding news:
Cognichip emerges from stealth with $33M to speed up chip design with AI. Cognichip is developing a physics-informed AI model to accelerate chip development, potentially cutting production times by 50%.
AI video generation tool Hedra raised $32m for its character-focused platform that allows users to create videos with AI-generated characters and transfer styles across media.
AI video creation startup Moonvalley has raised a total of $53 million from investors, adding $10 million to its previous $43 million. Moonvalley's Marey model offers customization and aims to protect users from copyright challenges through data licensing and safeguards.
Legal tech startup Harvey is in talks to raise over $250M at a $5B valuation.
Perplexity is reportedly in discussions to raise $500 million, valuing the company at $14 billion.
SoundCloud revised its terms of use following backlash over an AI model training clause, clarifying that user content and artist uploads will not be used to train generative AI models that might replicate content or likenesses without consent.
Anthropic admitted its Claude AI chatbot hallucinated an "honest citation mistake" in legal filings against music publishers, altering a genuine article's citation. The company's lawyers stated a manual check failed to catch the error, attributing it to AI hallucination.
Chip maker Arm is reorganizing its offerings around complete compute platforms for AI and introducing new product families organized by market and performance tiers.
AI Opinions and Articles
Elon Musk's xAI addressed concerns after Grok began inserting unsolicited references to South Africa’s "white genocide" in responses to unrelated user queries. xAI attributed this behavior to an "unauthorized modification" of Grok's system prompt and has since open-sourced the prompts on GitHub to enhance transparency.
This incident, like OpenAI’s sycophancy issue, highlights the potential for AI misalignment through manipulation of AI model system prompts. AI safety expert Esben Kran of Apart Research has developed the DarkBench benchmark to detect strategic manipulation by LLMs to identify these AI safety risks.