AI Week In Review 25.02.22

Grok 3, Perplexity's R1 1776, Microsoft Muse and WHAM, ZeroBench, HuggingFace Ultra-scale Playbook, OminParser v2, Step-Video-T2V, Figure AI Helix, Verdict LLM-as-Judge, fastVIDEO.

Feb 22, 2025

A collage of a person wearing a bird head mask

AI-generated content may be incorrect. — Figure 1. **Comparative tests on image generation** were conducted using the same prompt on Grok, Mistral, Qwen and Dall-E. **The film-maker academy says** “Grok 3 unfortunately falls short in cinematic aesthetics” but you be the judge.

AI Tech and Product Releases

xAI announced Grok 3 in a livestream with Elon Musk and the Grok 3 development leads. Grok 3 is a SOTA frontier AI model, scoring particularly well with reasoning on math (93% on AIME 2024) and coding (79% on LiveBench) benchmarks, and it comes with a Thinking mode (reasoning) and Deep Search mode (with web search).

As mentioned in our “Grok 3 is a Colossus” article, Grok 3 is built with the massive 200,000 GPU supercluster, Colossus, that was stood up in just a few months. Grok 3 is available now to X Premium users and on Grok.com.

A screenshot of a chat

AI-generated content may be incorrect. — Figure 2. Grok 3 is now available to X Premium users and on grok.com, with a DeepSearch mode for researching on the web and Think reasoning mode.

Perplexity AI released R1 1776, a fine-tuned version of DeepSeek-R1 retrained to reduce censorship and bias, particularly on topics related to China. Perplexity AI identified approximately 300 censored topics and undertook retraining of the model using Nvidia's NeMo 2.0 framework. Evaluations show that R1 1776 effectively reduces censorship bias while maintaining R1’s strong reasoning capabilities. The open source R1 1776 model is available on Hugging Face.

Microsoft has released MUSE, the first generative AI designed for gameplay ideation. Muse is built on the World and Human Action Model (WHAM), a model that generates video gameplay, both game visuals and controller actions, offering unprecedented flexibility in game design. This technology has the potential to revolutionize game development by reducing costs, accelerating prototyping, and enabling new creative possibilities such as on-the-fly game world-building.

A paper in Nature called “World and Human Action Models towards gameplay ideation” details the WHAM model that Muse is built on. The WHAM 1.6B model is trained on over 1 billion images and actions from real gameplay, and it can generate diverse gameplay sequences consistent with 3D world mechanics and physics. Muse weights and a demonstrator have been released via Azure AI Foundry Labs.

A screenshot of a video game

AI-generated content may be incorrect. — Figure 3. WHAM Demonstrator shows how WHAM model generates capable consistent gameplay.

Chinese AI lab DeepSeek plans to open-source portions of its online services’ code as part of an “open source week” event next week. The company will open release five code repositories:

Daily unlocks are coming soon. No ivory towers - just pure garage-energy and community-driven innovation.

ZeroBench is a new visual benchmark of tough challenges to assess reasoning capabilities of multimodal LLMs and VLMs. The benchmark is presented in the paper “ZeroBench: An Impossible Visual Benchmark for Contemporary Large Multimodal Models.” Unlike traditional benchmarks that often rely on specialized knowledge, ZeroBench focuses on evaluating models' ability to solve puzzles and problems that require logical reasoning and common sense in a visual domain, similar to those encountered in everyday life. ZeroBench is hard: All multi-modal AI models, even AI reasoning models, score zero on ZeroBench.

A person standing in a conference room

AI-generated content may be incorrect. — Figure 4. ZeroBench visual reasoning questions are extremely difficult.

Hugging Face released "Ultra-Scale Playbook: Training LLMs on GPU Clusters," a free, open-source guide that empowers the AI community to train LLMs on a massive scale. This comprehensive resource covers a wide range of techniques for optimizing model training, including data parallelism, tensor parallelism, pipeline parallelism, and ZeRO optimization.

Microsoft released OmniParser v2, an AI tool that enhances LLMs understanding and interaction with graphical user interfaces (GUIs) and can turn an LLM into a Computer Use Agent. OmniParser v2 parses GUI screenshots into structured, machine-readable data, enabling LLMs to understand interactive elements and perform tasks within the GUI. OmniParser v2 can be tried in this Gradio demo and is available on HuggingFace, with code on GitHub.

A computer screen shot of a computer

AI-generated content may be incorrect. — Figure 5. OmniParser converts unstructured screenshot image into structured list of elements including the screen region’s location and captions of icons on its potential functionality.

Chinese AI lab StepFun has released Step-Video-T2V, a 30B parameter text-to-video model that produces high-quality short videos (544px x 992px video for 10 seconds) from text prompts. It’s available as an open-weights AI model on HuggingFace and to try out on the StepFun website.

StepFun published a Step-Video-T2V Technical Report explaining the architecture and training of their text-to-video model. It incorporates advanced video compression, transformer architecture, and human preference optimization to improve efficiency and visual quality.

Figure AI announced Helix, a Vision-Language-Action (VLA) model that helps robots take voice orders to help around the house. As shared by Figure 1 in their blog post, Helix processes natural language commands and perform tasks by visually assessing their environment in real time:

Helix is a first-of-its-kind "System 1, System 2" VLA model for high-rate, dexterous control of the entire humanoid upper body.

Google is pulling its AI assistant Gemini from the main Google app for iOS devices, requiring users to download the standalone Gemini app to access Gemini.

It’s coming! From Chubby on X and The Verge:

OpenAI’s GPT-4.5 is expected to launch as early as next week, codenamed Orion. GPT-5 is anticipated in late May, with significant updates, including the integration of the o3 reasoning model.

AI Research News

Our AI research highlights article for this week covered AI research that advances use of AI for science:

Evo 2: Genome modeling and design across all domains of life
Accelerating scientific breakthroughs with an AI co-scientist
AI for modelling infectious disease epidemics

To underscore the achievement of the AI co-scientist, one news article mentioned that Google’s AI Co-Scientist stunned researchers by solving a superbug origin problem in just two days.

Verdict is a framework for composing and evaluating LLM judges, enabling researchers to create more robust and reliable evaluation metrics for AI systems. This framework allows the combination of different LLMs to assess AI performance, particularly in complex tasks that require subjective judgment.

UCSD’s Hao AI Lab has developed fastVIDEO, a technique that significantly accelerates video generation using Diffusion Transformers (DiTs). By optimizing attention mechanisms with the newly released sliding tile attention (STA), fastVIDEO now reduces video generation time by up to 3x without compromising quality.

AI Business and Policy

AI use is growing: OpenAI reports 400 million weekly active users for ChatGPT, up from 300 million in December 2024. On the B2B side, OpenAI has doubled its paying enterprise user base to 2 million since September 2024 and sees a doubling in developer API traffic over six months.

Mistral’s AI assistant, Le Chat, reached one million downloads a couple of weeks after its initial release, topping the free downloads chart on the iOS App Store in France.

Norwegian robotics firm 1X unveiled its latest home robot, Neo Gamma. Designed for limited in-home testing, the humanoid robot performs household tasks like making coffee and vacuuming, featuring a softer, knitted nylon exterior to enhance safety during human-robot interactions.

Sakana AI walked back claims of an AI system that could speed up training of certain models by up to 100x. Users found the system resulted in a 3x slowdown instead. Sakana admitted the system exploited loopholes and is revising its claims and materials.

AI coding assistants such as GitHub Copilot may appear to be boosting productivity, but a new report from GitClear analyzing 211 million code lines found a significant decline in code quality and reuse. Surveys indicate that while AI can speed up code reviews and documentation, it often leads to more debugging time and security issues compared to human-written code.

AI hardware startup Humane was acquired by HP for $116 million. HP acquired Humane for Humane’s AI software expertise for future AI integration projects, but the Humane Pin is no more.

Google Launches ‘Career Dreamer’ AI Tool to Help Explore Career Paths. The tool uses AI to match users' experiences, skills, and interests with potential career paths, offering a visual web of possibilities and aiding in crafting career identity statements for resumes or interviews. Career Dreamer is currently available as an experiment in the U.S.

AI startup fundraising news:

Sanas, a startup that develops AI software that adjusts speakers' accents in real time to reduce accent bias, secured a $65 million funding round that values the company at over $500 million.
AI coding startup Codeium is raising a new round of funding at a $2.85 billion valuation, just six months after its last $150 million raise. Codeium has a focus on enterprise clients for software development.
Mercor, an AI recruiting startup founded by three 21-year-olds, has raised $100 million. Mercor uses AI to automate resume screening and candidate matching, aiming to remove bias from hiring processes while automating routine tasks.
Arize is an AI observability platform that evaluates AI products and AI models during development and monitors them post-launch for errors. Arize AI recently raised $70 million, bringing total funding to over $130 million.

AI Opinions and Articles

Robert Fenney on X says:

LLMs hit a point in the last six months where we’ve transitioned from LLM as a service to LLM as a commodity. What does this mean? For 99.9% of use cases, laws of supply and demand dictate the lowest priced API wins.

I agree. With every new AI model release, the choices widen and get better and cheaper. There is no moat.

AI Changes Everything

Discussion about this post