AI Changes Everything

AI Week in Review 26.06.19

Patrick McGuinness — Fri, 19 Jun 2026 18:22:48 GMT

Figure 1. xAI released Grok Imagine Video 1.5 to general availability. It can generate high quality 720p audio-video clips in under a minute.

Top Tools

Z.ai launched GLM-5.2, a 753-billion parameter Mixture-of-Experts (MoE) open-weights AI model for long-horizon coding and engineering tasks. GLM-5.2 features a 1-million-token context window, reasoning controls, and support for coding tasks across entire codebases; it also utilizes the IndexShare architecture to reduce per-token compute FLOPs by up to 2.9 times. GLM-5.2 demonstrates high performance on benchmarks like SWE-bench Pro (62.1%), Terminal-Bench 2.1 (81.0%), and FrontierSWE (74.4%), rivaling frontier models like GPT-5.5 and Claude Opus 4.8.

Independent evaluations by Artificial Analysis confirm GLM-5.2 as the leading open weights AI model on the Artificial Analysis Intelligence Index. It also shows GLM-5.2 is notably token-hungry, consuming roughly 43,000 output tokens per standard Index task, up from 26,000 tokens used by GLM-5.1. GLM-5.2 is priced at only $1.40/$0.26/$4.40 per 1M input/cache hit/output tokens, so despite the token-hungry reasoning, this is the lowest-cost frontier-level AI model, substantially cheaper than proprietary rivals.

Figure 2. GLM-5.2 shows superior performance to GLM-5.1 and even Claude Opus 4.7 at High effort. Greater reasoning effort with more token use leads to higher performance.

GLM-5.2 is positioned as a powerful open model focused on agentic software engineering that developers can run and build on. The GLM-5.2 model is available through Z.ai’s Coding Plan, their ZCode agent, their Z.ai chatbot, and via open weights on HuggingFace.

Figure 3. Z.ai’s open weights AI model GLM-5.2 passes the Simon Willison ‘pelican on a bicycle’ test.

AI Tech and Product Releases

Sina Weibo researchers released VibeThinker-3B, a 3B parameter model matching flagship reasoning performance. The model achieves a score of 94.3 on AIME 2026 and 80.2 LiveCodeBench v6, a remarkable out-performance for a 3B model that matches top-tier AI models such as Claude Opus 4.5. This has stimulated conversation about benchmark and scaling limits.

The Technical Report on VibeThinker-3B shows that verifiable reasoning capabilities can be compressed into much smaller models than those required for open-domain knowledge, calling this the Parametric Compression-Coverage Hypothesis. This could have huge implications for how much further we can compress AI reasoning and improve AI model efficiency.

xAI released Grok Imagine Video 1.5 to general availability, xAI’s image-to-video system for generating short clips with synchronized sound. The update features improved motion physics, better audio synchronization (same-pass audio and speech generation), and nearly doubled generation speeds for 720p videos; its fast mode produces 6-second 720p videos in about 25 seconds. The release also adds workflow features such as Projects, multiple parallel agents, and search. Results are compelling:

You type a prompt or upload an image, and it turns it into a realistic 720p video, up to 15 seconds long, with actual dialogue and sound effects. All in just 25 seconds.

Google made Gemini Omni available through an API and positioned it as a leading video model. Gemini Omni is Google’s unified any-to-any system for text, image, video, audio, and music generation and editing. Google’s model page says Omni performs strongly on video editing, text-to-video, image-to-video, and reference-to-video, and reports top results on MovieGenBench for overall preference and instruction following. The model is meant for iterative multimodal video creation, including continuation, reference-based edits, and consistency across turns.

Anthropic has overhauled Claude Design, introducing enhanced canvas controls for easier element manipulation and brand-compliant design system imports from GitHub or local files. The update expands integration and export capabilities to platforms like Adobe, Canva, and Vercel, while also implementing shared usage limits across Anthropic’s product suite. Additionally, a new `/design-sync` command enables seamless, bidirectional workflow synchronization between Claude Design and Claude Code.

Nvidia released XR AI in public beta, a developer framework for building multimodal AI agents that run on AR glasses and extended-reality devices. The system connects video, audio, depth, pose, and sensor data with enterprise retrieval, AI models, agent orchestration, and accelerated inference. This enables hands-free AI assistance in laboratories, factories, hospitals, and design workflows.

HumanLayer launched its Agentic IDE for teams working in complex codebases. Declaring they are on a mission to ‘solve the AI slop code problem’, HumanLayer is aiming their HumanLayer Agentic IDE at structured, team-based AI software development rather than one-shot vibe coding. It includes a collaboration platform and software-factory building blocks designed to help engineers ship 3x faster while maintaining code quality and standards.

OpenRouter introduced Fusion, which combines access to multiple AI models behind one API call. Fusion sends one prompt to a panel of models, has a judge model compare the outputs, and then synthesizes a final answer from the combined results. In OpenRouter’s reported DRACO tests, fused panels outperformed individual models, and a lower-cost panel came within about 1 percentage point of Fable 5 while costing about half as much.

Midjourney announced Midjourney Medical, a new business to build a full-body ultrasound scanner called Ultrasonic CT that can do whole-body ultrasound scans in as little as 60 seconds, then offer it as a service in Midjourney Medical spas. The ultrasound system uses thousands of ultrasonic transducers to build a 3D anatomical map. It seems like a big leap to go from image generation into medical imaging hardware and services, but Midjourney Medical notes that large data volumes would make AI useful for processing and reconstruction.

Figure 4. Midjourney Medical’s ultrasound system can do whole-body scans and map a number of anatomical features and medical conditions.

Samsung announced a new AI-powered pet health feature for mobile devices during the VivaTech 2026 conference in Paris. Developed in collaboration with the platform Lifet, the feature uses AI to analyze photos of pets to detect conditions such as obesity and periodontal disease.

Tokyo-based AI startup Sakana AI has launched its first commercial product, Sakana Marlin, an autonomous AI research agent that works for up to eight hours to deliver deeply researched 100-page strategy reports and executive slides. The Sakana Marlin platform is designed exclusively for enterprise use and features a strict data policy ensuring customer inputs are never used for model training without consent.

OpenAI updated its platform deprecation notices for older GPT-5 and o3 model snapshots. Older GPT-5 and o3 snapshots will be removed from the API on December 11, 2026, while older GPT Image models and other legacy model families also have scheduled removals.

AI Research News

Researchers from multiple US Universities released SciAgentArena, a benchmark for evaluating AI agents in realistic scientific research scenarios. The benchmark includes roughly 200 tasks with stepwise verification, and an interactive agent-agnostic environment to assess AI agents. Benchmark results show that current agents are useful for well-specified data-analysis workflows but weaker at novel insight generation, exploration, and robust open-ended scientific reasoning.

Google DeepMind published “From AGI to ASI,” which explains Artificial Superintelligence (ASI) as systems surpassing large human organizations in capability and investigates the transition to improving AI from AGI to ASI. It explores four potential development pathways: scaling, paradigm shifts, recursive self-improvement, and multi-agent collectives. Each path has bottlenecks, and AI progress may accelerate continuously rather than in a single step change. Their roadmap shows ASI is attainable in the near future, requiring a global effort to prepare for coming transformative societal shifts.

AI Business and Policy

SpaceX bought Cursor’s parent company, Anysphere, in a $60 billion stock deal. The acquisition followed the massive SpaceX IPO and consolidates the AI landscape, bringing Cursor’s AI coding application and data into xAI’s broader AI model and AI infrastructure efforts. This acquisition integrates Cursor’s user base into SpaceX’s AI unit while bolstering Cursor’s position, as Cursor’s market share among AI coding tools slipped to 26% amid intense competition from tools like Claude Code.

Enterprise software ecosystems are undergoing an aggressive shift in pricing structures as CIOs push back against traditional seat-based subscription models in favor of consumption-based or outcome-focused metrics. Because autonomous AI agents operate independently of human headcount, software vendors are rewriting their commercial terms to charge based on API token volume, compute utilization, or verified task completion.

DeepSeek raised more than $7.4 billion in a funding round that valued the company at more than $50 billion, making it the most valuable Chinese AI startup. Its founder, Liang Wenfeng, invested around $3 billion in the fundraise. He previously held nearly 90% of the company before the financing round. A government-backed fund invested around $150 million.

At the G7 meeting, French President Emmanuel Macron urged the U.S. to share cutting-edge AI and called for democratic cooperation on regulation, in the wake of U.S. restrictions on Anthropic’s Fable 5 and Mythos 5 models. Macron criticized unilateral restrictions on Anthropic’s models as too nationalist.

Likewise, European Commission President Ursula von der Leyen said it is in both U.S. and EU interests for Europe to have access to the best AI models. The EU wants shared access to frontier AI capabilities under common safety standards rather than a drift toward nationalist AI controls.

OpenAI CEO Sam Altman and other AI leaders supported an international coalition for AI safety standards at an AI CEOs and leaders meeting at the G7 summit, with AI tech leaders proposing international cooperation with democratic oversight over AI deployments.

Anthropic and Tata Consultancy Services announced a partnership to bring Claude to regulated industries. TCS will provide Claude to 50,000 employees in 56 countries, build Claude-powered products for financial services, healthcare, public-sector, aviation, telecom, and life-sciences clients, and join the Claude Partner Network.

Anthropic also announced a multi-year global alliance with DXC Technology. DXC will train tens of thousands of Claude-certified forward-deployed engineers and integrate Claude into systems used by banks, airlines, insurers, manufacturers, and government agencies. DXC notes that Claude was used to generate more than 95% of the code for DXC OASIS, its AI-native managed-services orchestration platform.

AI Opinions and Articles

Jeff Bezos argued AI will ultimately create labor shortages rather than mass unemployment. Speaking at VivaTech, Bezos framed AI as a productivity accelerator that will expand the economy and create new kinds of work, a sharply optimistic contrast to surveys showing widespread job-loss concern. His argument captures executive optimism about how AI is a door to more opportunities rather than thinking of the economic possibilities as static.

“I promise you every single person in this audience has had an idea for a new business or a new product or a new device that they wish they could manufacture, and that idea stayed in your head and went nowhere. And the reason it stayed in your head and went nowhere is because it’s too hard to do, and it wasn’t worth it.

If we can accelerate the dream build loop, all of the ideas will then become possible. And then we end up being limited not by our capabilities, but by our imaginations. – Jeff Bezos

AI Week in Review 26.06.13

Patrick McGuinness — Sun, 14 Jun 2026 02:20:06 GMT

Figure 1. Still from video generated by Avataar AI’s Varya model, which offers low-cost AI video generation with an Indian cultural twist.

Top Tools

Fable 5’s capabilities exceed those of any model we’ve ever made generally available. It is state-of-the-art on nearly all tested benchmarks of AI capability ... The longer and more complex the task, the larger Fable 5’s lead over our other models.

Anthropic launched Claude Fable 5, its first public Mythos-class model, alongside the highly restricted Claude Mythos 5 model. Fable 5 is easily the most intelligent AI model yet released, with SOTA benchmarks on knowledge work (1932 on GDPval-AA), agentic coding (80.3% on SWE-Bench Pro, 88% on TerminalBench 2.1), reasoning (59% on Humanity’s Last Exam), and top positions across several capability leaderboards, including Agent Arena.

These models are excellent for long-horizon agentic work, use cases like Riley Brown re-implementing the whole Lovable interface in 2 prompts bear this out.

Figure 2. Fable 5 is next-level state-of-art AI model for long-range agentic tasks.

Positioned as a premier autonomous agentic system made safe for general use, Anthropic introduced significant safeguards on Fable 5, by redirecting high-risk cyber, biology, chemistry, and model-distillation requests away from the frontier model to Opus 4.8.They shared details in their Claude Fable 5 and Claude Mythos 5 System Card.

But Anthropic went further in the original Fable 5, silently sabotaging Fable 5 on requests that might relate to competitive AI development. This hidden output degradation led to significant backlash and criticism about trust and evaluation integrity. Anthropic then reversed course and changed the behavior so flagged requests visibly fall back to Opus 4.8 with explicit reasons for API users.

But for now, Fable 5 is gone. Anthropic abruptly suspended access to Claude Fable 5 and Claude Mythos 5 after it received a U.S. government export-control directive barring access by foreign nationals. Because the order applied broadly to foreigners, including foreign-national Anthropic employees, the company said it had to abruptly disable the models for all customers.

In his recent “Policy on the AI Exponential” blog post, Dario Amodei advocated for an Advanced AI Framework for overseeing models that would allow Government to block AI model releases. However, Anthropic insists that the Government’s ban on Fable 5 is based on a misunderstanding of its model’s risks due to a report of a jailbreak. They say, “We believe this is a misunderstanding and are working to restore access as soon as possible.”

Frontier AI models, like airplanes, should be required to go through technical testing and auditing, and their release should be blocked or reversed as a threat to public safety if they do not meet high standards of safety – Dario Amodei

AI Tech and Product Releases

Apple used WWDC26 to unveil their next generation of Apple Intelligence and a rebuilt Siri AI.

Apple announced its third generation of Apple Foundation Models, a family of five custom models developed in collaboration with Google. The AFM 3 models include on-device models and cloud models:

AFM 3 Core, a 3B dense model for on-device use.
AFM 3 Core Advanced, a multimodal 20B parameter sparse MoE for multimodal device tasks.
AFM 3 Cloud, a server-side workhorse model for speed and performance.
AFM 3 Cloud Image, for image generation and editing in photo-editing and Image Playground.
AFM 3 Cloud Pro, for demanding agentic and reasoning use cases.

The AFM 3 framework is designed to power contextual, multi-platform AI experiences across the Apple ecosystem, leveraging both on-device hardware and Apple’s secure private cloud servers.

To support custom AI, Apple is also introducing Core AI, a new framework for running custom AI models on Apple silicon and Apple devices.

Apple presented the new Siri AI as far more capable at using personal context, app actions, and on-screen information, and more deeply integrated across Apple’s products - iPhone, iPad, Mac, Apple Watch, and Vision Pro. The new Siri AI adds web-based world knowledge and Visual Intelligence, and it is built around App Intents and App Schemas so apps can expose content and actions in natural language. This helps the new Siri AI handle multi-step requests like finding specific photos, organizing emails, and taking actions across apps.

Apple updated Image Playground with their newest AI model, AFM 3 Cloud Image, so it is capable of producing improved photorealistic and stylized graphics, more in line with competitive image generation tools.

Google launched DiffusionGemma, an experimental 26B Mixture of Experts (MoE) model with 3.8B active parameters that uses text diffusion and is released under an open-source Apache 2.0 license. DiffusionGemma utilizes text diffusion to generate 256-token text blocks simultaneously, which yields up to six times faster local generation speeds (over 1,000 tokens per second on a single H100). The speed makes it ideal for interactive workflows like in-line editing of code and documents, but it has lower overall output quality compared to the 26B Gemma 4 model.

Google introduced Gemini 3.5 Live Translate, a speech-to-speech model for near real-time voice translation in more than 70 languages. Gemini 3.5 Live Translate features a single continuously streaming audio model rather than a stitched pipeline of speech recognition, translation, and text-to-speech components. This model preserves tone, pacing, and expressiveness while translating with sub-second latency. This update is rolling out to Google Translate and Google Meet for live translation, with developer integrations available via the Gemini Live API.

Moonshot AI released Kimi K2.7-Code, an open-source update on the Kimi K2 1T parameter MoE architecture that claims a 30% reduction in thinking-token usage compared to its K2.6 predecessor. K2.7-Code is an open weights model available on HuggingFace and available via the Kimi Code platform. While Moonshot AI reports performance gains on internal benchmarks, independent evaluations on KernelBench-Hard showed regressions in specific GPU kernel optimization tasks. Elliot Arledge’s benchmarking assessment is “K2.7 is more honest but not more capable” than its K2.6 predecessor on Cuda kernel coding.

Cohere released North Mini Code, an open-source 30B agentic coding model aimed at developers who want AI coding agents that can be run and improved outside closed proprietary systems. The model is the company’s first model for developers and is available on HuggingFace under the Apache 2.0 license.

Google introduced new Gemini features tailored for small businesses, including a direct Google Business Profile connection and proactive Business notebooks. The update allows Gemini to integrate with Google Business Profiles to access customer reviews, questions, and performance data. New Business notebooks provide a centralized space to organize workflows and generate content based on specific business context.

Cognition launched FrontierCode, a tougher coding benchmark designed to measure whether an AI-generated pull request is production-quality. Built from 150 original tasks, the benchmark emphasizes evaluation criteria such as scope control, regression safety, and test quality. Fable 5 posted the highest score (46.3%) versus Opus 4.8 (34.3%) and GPT-5.5 (25.5%).

Figure 3. FrontierCode gives us a new level-up for AI coding benchmarks, evaluating AI coding models on end-to-end production-worthy code generation.

Xiaomi’s MiMo AI team has open-sourced MiMo Code V0.1.0, a terminal-native AI coding harness that Xiaomi claims can outperform Claude Code on long tasks. The assistant utilizes a cross-session memory architecture with a dedicated checkpoint-writer subagent to maintain context during long-horizon, multi-step tasks.

Avataar AI from India launched a new video model called Varya that uses distillation from Alibaba’s Wan 2.2 and generates video ten times faster than the original at a cost of under a penny per second. Varya features Indian cultural nuances By tuning the model with curated data for the India market. It can be accessed at the Varya platform and will be released as an open-weight model on the India’s AIKosh portal.

Deezer introduced a tool to identify AI-generated tracks in streaming playlists. The free online detector supports 27 languages and scans music from 20 platforms, including Spotify, Apple Music, and YouTube Music. Deezer reports that 44% of all new music uploaded to its platform is AI-generated.

AI Research News

Google recently published “Accelerating scientific discovery with Co-Scientist” in Nature, an account of how the Co-Scientist AI system is designed to help solve complex problems in the life sciences. The tool uses specialized agents to generate, debate, and refine new hypotheses through three distinct phases of idea generation, peer review, and refinement. One case of Co-Scientist was helping to identify new drug repurpose candidates and synergistic combination therapies for acute myeloid leukemia.

UC Berkeley researchers launched Agents’ Last Exam (ALE) Benchmark, which evaluates AI agents on long-horizon professional workflows. OpenAI’s GPT-5.5 leads the ALE Leaderboard with a 24.0%, beating Anthropic’s Claude Fable 5.

Artificial Analysis developed AA-AgentPerf, a new hardware performance benchmark that measures how many concurrent agentic AI agents a system can sustain. Nvidia’s GB300 sets a new standard for agentic AI workload performance, over 20 times better than the H200.

Figure 4. Nvidia’s GB300 is the clear leader in serving AI agentic workloads, with 20 times the capacity of the prior generation H200.

AI Business and Policy

SpaceX launched their IPO into the stock market stratosphere, rising on the IPO debut to a valuation over $2 trillion by the market close. SpaceX is rising in part due to XAI, its Colossus AI data center it is renting to Anthropic, and their claims of pursuing a $4 trillion AI opportunity.

SpaceX is combining their AI and space opportunity with a proposed satellite designed to host AI supercomputers in orbit. The engineering specs describe about 150 kW peak power per satellite, and Musk claims the challenge it not harder than some other things they are doing. I doubt this idea will come to fruition soon; it seems it has been hyped up lately for IPO buzz.

OpenAI announced that it had confidentially submitted a draft S-1 to the SEC, giving the company the option to go public while emphasizing that timing has not been decided. The announcement said OpenAI expects the filing to leak and still sees tradeoffs between remaining private and preparing for a public offering.

OpenAI announced that it will acquire Ona, a company focused on secure cloud execution and orchestration. Ona’s technology will help Codex expand from a session-bound developer tool into a persistent agent environment that supports long-running software and knowledge-work tasks.

OpenAI and Oracle announced that OCI customers will be able to access OpenAI models and Codex using existing Oracle cloud commitments.

OpenAI announced support for the EU Code of Practice on Transparency of AI-generated content, tying the move to its provenance and content-authenticity work. This guides AI providers in standards and tools to help users distinguish synthetic media from human-created content.

Apple said that Siri AI will be delayed on iOS 27 and iPadOS 27 in the European Union because of the Digital Markets Act. Apple said EU users will still be able to access Siri AI on macOS 27 and visionOS 27, but that iPhone, iPad, and watchOS access will not arrive on the same timeline because of unresolved regulatory concerns.

OpenAI published a threat report claiming PRC-linked influence operations are targeting AI debates in the United States, including around data center buildout. The OpenAI threat report frames the incidents as Chinese-based covert influence operations aimed at shaping political and public opinion. The report relates this to a spike in anti-datacenter social media activity and says OpenAI identified accounts using ChatGPT as part of those campaigns.

Anthropic introduced Claude Corps, a national fellowship program for early-career people interested in using AI for public benefit and community impact.

AI Opinions and Articles

Reuters reports on the broad public anxiety about AI among the US public, citing a Reuters/Ipsos poll showing high levels of concern about AI use and job displacement. With AI usage skyrocketing and OpenAI and Anthropic moving toward public listings, the investor appetite for AI companies is colliding with greater public unease about AI’s economic effects.

If you have AI fears or AI FOMO, the best move is to learn AI. OpenAI introduced three new OpenAI Academy courses: AI Foundations, Applied AI Foundations, and Agents and Workflows. The courses are available to ChatGPT users and will help individuals and organizations move to using AI in repeatable AI workflows and agent-assisted work.

AI Week in Review 26.06.06

Patrick McGuinness — Sat, 06 Jun 2026 21:56:19 GMT

Figure 1. Image generation from Reve’s Reve 2, which uses a layout representation to combine high-quality details and fine-grained control on image outputs.

Top Tools

Beyond these models, we’re building a superintelligence lab – a system and an approach we believe will define the next phase of AI. - Microsoft AI

The top AI model announcement for this week has been Microsoft announcing a new family of seven in-house MAI models at Build 2026. The new AI model lineup spans reasoning, coding, image generation, transcription, and voice, and includes MAI-Thinking-1, MAI-Code-1-Flash, MAI-Image-2.5, MAI-Transcribe-1.5, MAI-Voice-2, and Flash variants for image and voice.

Microsoft launched MAI-Thinking-1 is a Mixture of Experts model with 1T total parameters and 35B active parameters; it’s their flagship reasoning model. Comparing it to Sonnet 4.6, Microsoft says it was trained from the ground up without third-party distillation and is competitive in its class on coding (52.8% on SWE-Bench pro, but only 46% on Terminal Bench 2.0) and mathematical reasoning benchmarks.

Almost as impressive as the model itself is Microsoft’s 109 page technical report “MAI-Thinking-1: Building a Hill-Climbing Machine,” which shares details on the model architecture and how Microsoft AI trained their model. This is the most open an American AI lab has been about their work in some time.

Microsoft also highlighted MAI-Image-2.5, including a Flash variant, as its new image model that ranks number two on Arena for image editing. Microsoft is rolling it out to support PowerPoint visuals and OneDrive Photos editing tools.

Some of Microsoft’s other Build announcements:

Microsoft unveiled the Surface RTX Spark Dev Box that uses Nvidia’s RTX Spark superchip to enable users to run powerful AI models on local Windows machines.
Microsoft introduced their ‘always on’ AI agent Microsoft Scout, their entry into the general local AI agent, based on the OpenClaw framework and OpenShell.
Microsoft introduced Work IQ APIs, a context layer for autonomous task execution and enterprise customization that leverages Microsoft’s ecosystem with Work, Fabric, Foundry, and Web IQ components.
Microsoft unveiled the Majorana 2 quantum chip that features qubits that are 1,000 times more reliable than previous generations.
Microsoft introduces Microsoft Execution Containers (MXC) to secure AI agents. MXC is a policy-driven execution layer built into Windows that allows developers and administrators to define sandbox environments and enforce access boundaries for AI agents.

The bigger picture is that Microsoft is making a strategic shift toward in-house superintelligence development, directly challenging leading AI labs by becoming one. Microsoft also found itself behind the curve with its chatbot-based Copilot suite and is now trying to catch up with AI agent offerings and support.

AI Tech and Product Releases

Nvidia made several announcements at Computex, including several new AI models. Nvidia released Nemotron 3 Ultra, a 550B parameter sparse MoE open-weights model with 55B active parameters, designed for long-context and agentic workloads. This Mixture-of-Experts model utilizes a hybrid Transformer-Mamba architecture, which supports a longer 1 million tokens of context as well as faster and lower cost inference for agentic workloads. Nemotron 3 Ultra and is being released with model weights, training assets, datasets, and related tooling. It is available on Amazon SageMaker JumpStart and other platforms.

Nvidia also released Nemotron 3.5 ASR, a 600M parameter multilingual streaming speech recognition model. This model uses a cache-aware FastConformer-RNNT architecture to deliver high-quality speech-to-text in both streaming and batch mode transcriptions. Nemotron 3.5 ASR supports 40 language locales and adds punctuation and capitalization to transcripts.

Nvidia announced RTX Spark, a new Windows PC platform built for local AI agents. Built on the same hardware used in Nvidia’s DGX Spark, the Windows RTX Spark delivers up to 1 petaflop of AI performance and up to 128GB of unified memory, offering the ability to run large and powerful AI models on Windows laptops and compact desktops.

We wrote more on RTX Spark in “RTX Spark: AI Comes Home to the PC” as well as other Nvidia announcements, including the release of Cosmos 3, the latest iteration of their open-source frontier omni model for physical AI, which integrates world generation, physical reasoning, and action generation into a single framework.

MiniMax announced M3, a native multimodal AI model with a 1 million token context window that is frontier-class at coding and agentic AI. Minimax touts M3’s impressive benchmarks such as 59.0% on SWE-Bench Pro and 66.0% on Terminal Bench 2.1, competing with Gemini 3.1 Pro and GPT 5.5. Minimax promises a fuller technical release with open weights in 10 days. When it does release, it will be SOTA for open weight AI models. In the meantime, access is via their API ( at $0.60 / $2.40 per million input / output ) and on the Minimax platform.

OpenAI upgraded GPT-Rosalind for life sciences with new GPT-Rosalind capabilities aimed at enterprise-scale biology, genomics, medicinal chemistry, and drug-discovery workflows. The new GPT-Rosalind model release pairs GPT-5.5-style coding and tool use with life-sciences reasoning, adds research and analysis plugins in Codex. This places GPT-Rosalind as a domain-specific scientific AI workbench with provenance, tools, and controlled access. OpenAI is expanding access to eligible research organizations through a trusted-access model.

OpenAI is significantly updating its Codex agentic AI platform for non-coding knowledge work. Codex is adding six role-specific plugins - data analytics, creative production, sales, product design, public-equity investing, and investment banking - that integrate over 60 business applications, such as Salesforce and Figma, to automate complex enterprise workflows.

Codex is also being updated with a Sites features for hosting interactive, semi-private web applications and an Annotations feature for in-place content editing and refinement. These enhancements are designed to expand Codex’s utility for non-technical workplace users by deeply integrating it with existing professional tools and workflows. Codex has more than 5 million weekly users and non-developers make up about 20% of usage.

Google released Gemma 4 12B, a new natively open multimodal model in the Gemma 4 family. With 12B parameters, performance that is SOTA for its size, and ability to process audio natively, it is a great laptop-runnable AI model for local AI use. Google also updated their Gemma 4 lineup with Quantization-Aware Training (QAT) to reduce memory footprints and help quantized Gemma models perform better.

Ideogram released Ideogram 4.0 as its first open-weight text-to-image diffusion transformer foundation model. Ideogram 4.0 he model is a 9.3B-parameter text-to-image system trained from scratch, uses Qwen3-VL-8B-Instruct as its text encoder, and is built around structured JSON prompts with optional layout and color controls.

Reve released Reve 2, a new text-to-image model centered on layout-aware generation and editing. As Reve says:

Layout is a structured, hierarchical description of an image where every element has a location, a size, a local description, and other optional attributes like image references or color. A layout is an image’s backbone — separating semantic intent from pixel rendering, much like HTML is to a webpage or SVG to a vector image.

Reve says the system separates planning from rendering, represents images in a structured form that makes individual elements addressable, and renders at native 4K resolution for more precise editing control.

Figure 2. Reve defines an image using a layout structure that can be directly edited. With this, users can control the creation and editing of images more precisely.

LMArena launched Agent Arena, a benchmark and comparison platform for AI agents rather than one-shot chat prompts. The platform evaluates agents built from models, tools, and frameworks across real-world tasks, and the launch included a public release of 2,000 pairwise agent battles and user preference data.

ChatGPT is rolling out a new memory architecture called Dreaming, a more scalable memory synthesis system for ChatGPT designed to keep user context fresh, relevant, and correct over longer time periods. OpenAI says Dreaming improves how ChatGPT carries forward context, follows user preferences, and stays current. Memory for AI is evolving from explicit saved notes toward automated synthesis across conversations. The feature is available to Plus and Pro users in the U.S., with broader rollout planned for coming weeks.

JetBrains open-sourced Mellum2, a 12B MoE model with 2.5B active parameters per token that is positioned for production AI workloads such as routing, summarization, and intermediate reasoning over natural language and code. The model was built from scratch and released under Apache 2.0 with weights available on HuggingFace.

H Company launched Holo3.1, a family of local computer-use agent models ranging from 0.8B to 35B-A3B. H Company says the release improves robustness across web, desktop, and mobile environments, and raises AndroidWorld results from 67% to 79.3% on its 35B-A3B model.

Alibaba released Qwen3.7-Plus, a multimodal model with frontier-level performance and a 1-million token context window. It is 60% cheaper than the previous text-only Qwen3.7-Max, but the release marks a departure from Alibaba’s open-source strategy, as the model is available only via proprietary APIs.

AI Research News

Anthropic published “When AI builds itself,” a report on recursive self-improvement and AI-driven AI R&D. AI has progressed from chatbot to single agents to multiple autonomous AI agents, and now Anthropic lays out the next step, where AI agents “close the loop” and AI development becomes substantially automated. They show early evidence of this trend by noting Anthropic itself is shipping 8 times more code per person than in previous years. This near-future AI trend implies both AI acceleration and huge leaps in productivity in some companies.

Figure 3. Anthropic engineers are becoming vastly more productive by leveraging AI tools. Most of Claude Code is written using Claude Code itself.

AI Business and Policy

Anthropic has filed confidentially for an IPO following an oversubscribed $65 billion fundraise at a $965 billion valuation. The IPO depends on SEC review and market conditions, but it is expected later this year and could be the second trillion-dollar IPO following the SpaceX IPO. Anthropic’s annualized revenue surpassed $47 billion in May, up from roughly $9 billion at the end of 2025.

SpaceX is becoming a hyperscale AI compute provider as it preps for its historic IPO on June 12. SpaceX has secured a deal with Google to provide approximately 110,000 Nvidia GPUs and related components from October 2026 through June 2029. Google will pay $920 million per month to secure bridge capacity for surging demand on its Gemini Enterprise AI platform, follows a similar computing agreement between Anthropic and SpaceX.

OpenAI called for global action on youth AI safety through a dedicated AI Safety Institute. OpenAI is advocating for the establishment of an international institute to provide continuous oversight and standardized guidance for youth AI safety.

President Trump signed an executive order on AI innovation and AI security. The new Executive Order NSPM-11 emphasizes promoting appropriate AI adoption and AI innovation for national security, while coordinating with the private sector on security risks. It also directs federal agencies to prioritize AI-related cybersecurity, establish an AI cybersecurity clearinghouse with voluntary industry collaboration, and expand federal cybersecurity hiring pathways.

Senior U.S. officials have been discussing with AI firms such as Open AI the potential for the federal government to acquire shares in those companies. Giving the U.S. government equity stakes in AI companies could have seismic consequences. While the arrangement could fund public purposes like dividend payments, critics warn that government ownership could create conflicts of interest in technology regulation.

OpenAI and other AI leaders have backed synthetic DNA screening rules, with executives and scientists from leading US AI labs signing a letter urging U.S. lawmakers to require screening of synthetic DNA and RNA orders. The DNA-screening letter expresses the concern that advanced AI is lowering the knowledge barrier for designing dangerous biological materials, making gene-synthesis oversight more urgent.

Character.AI and Google reached settlements with families over teen suicide claims, notifying a Florida federal court of a mediated settlement to resolve all claims. The litigation included a high-profile lawsuit alleging that the chatbot encouraged a 14-year-old to commit suicide.

Meanwhile, OpenAI is responding to a lawsuit filed by the family of a teenager who died by suicide, arguing that the chat logs used in the allegations ‘require more context.’ They also claim the chatbot frequently directed the user to crisis resources and that the incident resulted from improper use of the platform.

The Linux Foundation unveiled plans for the Tokenomics Foundation to address rising AI token costs. The Tokenomics Foundation charter is to establish open industry standards, benchmarks, and best practices for the economical use of AI infrastructure. The new standards body aims to establish a framework for tracking, auditing, and optimizing AI token usage and billing.

The New York State legislature passed a one-year moratorium on large new data centers. The bill directs an environmental agency to assess the electricity, water, and land usage of large-scale facilities.

Meta has built data centers in tents to accelerate AI infrastructure deployment. The company has constructed six “rapid deployment structures” in New Albany, Ohio, to reduce construction time by half. These structures will house AI chips and are powered by 200 megawatts of modular gas turbines.

RTX Spark: AI Comes Home to the PC

Patrick McGuinness — Thu, 04 Jun 2026 03:39:27 GMT

Figure 1. Nvidia CEO Jensen Huang shows off RTX Spark laptops.

Jensen’s Computex Keynote

“It started with a spark, an idea to reimagine the PC for the first time in 40 years. For the age of AI, what becomes of our personal computer in a world of agents?” – Jensen Huang

Nvidia CEO Jensen Huang gave another gangbuster keynote at Computex conference in Taiwan on Monday, declaring “useful AI has arrived” and calling this the Age of Agents, a new computing era where AI is delivered through running AI agents that reason, plan, and use tools.

Jensen Huang had real announcements behind his hype, introducing several new AI models, AI chips, and systems. Some of the key highlights:

Nvidia announced the Vera CPU, explicitly designed for AI agent use and able to manage massive graphical processing units (GPUs) inside localized agentic loops. Vera integrates 88 custom “Olympus” ARM-based cores and utilizes LPDDR5X memory to achieve 40% lower peak memory latency, 50% faster core-to-core communication, and 1.8x performance over prior CPUs.
Nvidia announced that the Vera Rubin system is now in full production. Vera Rubin is Nvidia’s latest generation of AI supercomputer that combines GPUs and CPUs specifically engineered for agentic AI. It marks a major milestone for Nvidia as it dominates as the leading AI infrastructure company.
To support AI infrastructure buildout, Nvidia introduced the DSX blueprint, a reference design for building and operating AI factories that integrates hardware, cooling, power, and networking for maximum revenue-generating compute.
Nvidia introduced Nemotron 3 Ultra, an open-source Mixture of Experts (MoE) AI model featuring 550B total parameters and 55B active parameters designed for agentic use. Nemotron 3 Ultra utilizes a novel hybrid State Space Model (SSM) architecture that makes is faster and cheaper for AI inference, yet with performance comparable to leading open AI models like MiniMax M2.7.
Nvidia is providing an enterprise toolkit that includes models, orchestration harnesses like Open Shell, and access to CUDA X libraries that act as specialized skills for agents, to empower AI users to build their own agent harnesses and systems.
Nvidia continues to push AI models and tools for physical AI, including updates to the Isaac Groot platform and a reference design for humanoid robotics. For the robotics, autonomous vehicle, and physical system sectors, Nvidia launched Cosmos 3, an “Omni” multimodal foundation model trained on 20 trillion tokens of images, audio, video, action data, and text. They also announced the Alpamo 2 open model designed for reasoning in autonomous vehicles, which enables level 4 autonomous driving for robotaxis.

However, the most interesting announcement was the RTX Spark Superchip and the Agentic PC, Nvidia’s effort to reinvent the personal computer with a 1 petaflop superchip that that Blackwell GPU and Grace CPU to run local AI agents natively.

RTX Spark Details

“We’re reinventing the personal computer, for creating, for gaming, for agents. This is the dawn of a new personal computing revolution, and it starts with NVIDIA RTX Spark.” – Jensen Huang

Fabricated on TSMC’s 3-nanometer process and packing 70 billion transistors, the system-on-chip RTX Spark merges a Blackwell-architecture RTX GPU featuring 6,144 CUDA cores with a custom 20-core Grace CPU. Boasting one petaflop of local AI performance and 128 GB of high-bandwidth unified memory, the hardware allows personal AI assistants to run locally, securely, and continuously inside Windows agent sandboxes.

Designed in collaboration with Microsoft and MediaTek, RTX Spark is engineered to support agent-centric computing on the consumer PCs and laptops and has integrated graphics and GPU performance comparable to a dedicated desktop RTX 5070 graphics card. Achieving that tier of graphics compute on an integrated platform represents a massive step forward for thin-and-light Windows laptops.

Consumer laptops and PC can be equipped with 32 GB of RAM for general productivity and gaming, but with top-tier configurations with 128 GB of unified memory you can run larger AI models and agentic AI systems natively.

With its unified memory design and integrated GPU, Nvidia’s RTX Spark gives the Windows ecosystem a competitive performance and efficiency equivalent to Apple’s series of M series of chips. Finally, there will be an AI PC that can give Apple M5-equipped MacBook Pro a run for its money.

Figure 2. The heart of the AI PC is the RTX Spark, a Grace-Blackwell CPU-GPU SOC that supports 120 GB of unified memory to run large AI models.

The AI PC, Again

And the thing that I am just incredibly pleased, incredibly honored is that 100% of the world’s PC industry has joined us to reinvent the PC. A new line, a new beginning. – Jensen Huang

This isn’t the first time an ARM-based Windows PC has been attempted. The first ARM Windows device was Surface RT in 2012; later came Windows 11 on ARM powered by Qualcomm’s Snapdragon chips.

Nor is it the first AI PC. After the ChatGPT moment, Intel launched their bid for the AI PC chipset while Microsoft offered the Copilot + PC. The problem was these prior efforts lacked the performance needed for rigorous AI workloads and were eclipsed by better performance out of Apple’s MacBooks.

Nor is it Nvidia’s first attempt at an AI desktop box. Nvidia has been shipping the DGX Spark since 2025, a Linux-based desktop box for AI that uses a GB10 Grace-Blackwell chip with similar specs to the RTX Spark.

RTX Spark brings the Windows OS that DGX Spark lacks, turning NVidia’s Grace-Blackwell superchip into an engine for Windows laptops and PCs. This RTX Spark is a significant performance boost from prior ARM PC efforts, which gives it a better chance to crack that market.

One thing in favor of RTX Spark is that it’s built for gaming and graphics applications as well as AI workloads. Leveraging Nvidia Blackwell-class GPUs and its Cuda stack, the RTX Spark offers Deep Learning Super Sampling (DLSS) upscaling and frame generation technologies to accelerate games. Microsoft has worked with Windows tools developers to make their apps work on ARM-base Windows machines. Nvidia has also pushed Adobe to rebuild its Adobe Premiere Pro editing engine around GPU-accelerated computing rather than relying heavily on the CPU.

There are still many questions about it. Reviewers have noted that Nvidia did not share full benchmarks, leaving users to assess gaming performance on demo applications. Microsoft has announced Surface Laptop Ultra but it and other PC makers won’t be delivering them until later this year.

The main barrier to adoption could be cost. DGX Spark with GB10 chip and 120GB memory already costs $4700, and a similarly-loaded RTX Spark will undoubtably be in the same range, limiting widespread adoption. On the other hand, cutting the memory down to avoid high component pricing will limit the system’s utility with AI models.

Conclusion

With a market cap of about $5.4 trillion, Nvidia is worth more than any company on the planet, almost $1 trillion above its closest U.S. peers. It has placed itself at the center of the AI revolution by making critical and correct bets about the technology stack needed to support AI.

Nvidia has earned huge success building the core GPU chips and the AI supercomputers that go into AI factories in data centers. Now, Nvidia is going after the one led by others, the PC market, This bringing AI home to the PC. Wall Street is recognizing the threat it poses to others; this announcement at Computex on Monday moved Nvidia up and Intel and AMD down.

Still, behind the hype, what RTX Spark is really doing is bring ARM-based Grace-Blackwell CPU-GPU integrated SoC to Windows. Is this what users need? I think so, but buyers will decide and devices won’t be out until later this year when they’ll compete with Apple’s M6. It’s not a sure thing.

One Reddit commenter opined:

It [RTX Spark] is a pipe cleaner product - attempt to get this “NV-made Arm-based SoC for laptops” to market and be a thing for software developers. It is late, underpowered and probably overpriced (based on DGX Spark pricing and current RAM & storage prices) but it lays the groundwork for the next gen chip that might be better.

Whatever the market reaction, Nvidia will keep pursuing it, because Jensen’s vision of an “AI supercomputer for your home AI agent” is on target, and Nvidia will keep iterating until they get it right.

If and when it takes hold, we will likely see the end of the desktop PC graphics card and the rise of unitary memory, so larger AI models can be run on laptops and PCs. We will also see a class of home devices used not as personal computers but as Agent Computers, running OpenClaw or Hermes 24/7.

Nvidia moving beyond the data center and moving to the edge tells us something important: AI is reinventing the PC just like it is reinventing every part of the information technology stack. Smaller devices like phones and laptops will run advanced AI models and AI agents, literally bringing AI home.

AI Week in Review 26.05.30

Patrick McGuinness — Sun, 31 May 2026 00:01:31 GMT

Figure 1. Photo-realistic output from MAI Image 2.5.

Top Tools

One of the most prominent improvements in Opus 4.8 is its honesty. … Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims. … Opus 4.8 is around four times less likely than its predecessor to allow flaws in code it has written to pass unremarked. - Anthropic

Anthropic released Claude Opus 4.8, a frontier AI model upgrade to Opus 4.7, with stronger coding, agentic, and professional work performance. On benchmarks, Opus 4.8 achieves state-of-the-art 1890 on GDPval-AA for knowledge work and 69.2% on SWE-Bench Pro. Anthropic also touts improvements in its alignment and honesty, being less likely to hallucinate success or unverified claims.

The improvements over Opus 4.7 are solid but incremental, and they have kept standard pricing unchanged from the prior version. They also launched effort controls and a faster and cheaper fast mode that can be used for high-throughput workloads.

Figure 2. Claude Opus 4.8 has continued to advance the frontier of AI models with SOTA performance on coding, reasoning and knowledge work tasks, making it a great AI model for use in Claude Code and Claude Cowork.

Anthropic also launched dynamic workflows in Claude Code for large tasks such as codebase-scale migrations. When users prompt for a complex task, Claude breaks the target down into subtasks and assigns sub-agents to the work. Claude dynamically runs tens to hundreds of parallel sub-agents in a single session, checking its work via internal agent critique before final output.

Figure 3. Claude Opus 4.8 improves on alignment, close to the alignment of Mythos Preview.

AI Tech and Product Releases

OpenAI launched Rosalind Biodefense, which gives trusted developers sponsored access to GPT-Rosalind for defensive biology work, including epidemiological modeling, early detection, screening, preparedness, diagnostics, and medical-countermeasure development. OpenAI is also expanding trusted access to selected U.S. government and allied public-health and biodefense partners.

Mistral introduced Search Toolkit in public preview. The open-source framework unifies ingestion, retrieval, and evaluation for production search pipelines used in AI applications. Mistral’s pitch is that teams should spend less time wiring together search infrastructure and more time improving retrieval quality; the toolkit can run in cloud, on-premises, or edge environments.

Mistral launched Vibe as Mistral’s live agent product and main AI interface, available through Mistral’s chat interface and mobile apps. Vibe now replaces LeChat and is absorbing prior Le Chat history, plans, and settings inside Chat mode. Vibe has a Work Mode AI agent for complex, multi-stage tasks, and a Code Mode as the new coding surface in the Vibe web app. The launch positions Mistral’s consumer and developer-facing agent around everyday tasks and knowledge work.

We believe physics deserves its own frontier AI models. - Mistral

Mistral announced “physics AI” for industrial engineering. The company says it has brought Emmi AI into Mistral and is building AI models that learn from physics-solver outputs to predict physical fields from geometry, boundary conditions, or measurement data. The intended use cases include faster design-space exploration, tooling and process optimization. They aim to apply these physics AI models as real-time digital twins for industrial partners such as ASML, Airbus, Safran, and Siemens Energy.

Microsoft announced its new MAI Image 2.5 image generation model, an upgraded text-to-image generator succeeding MAI Image 2.0 that follows prompt instructions more closely and renders text strings more reliably. Climbing to the number three spot on the text-to-image Arena.ai leaderboard, MAI Image 2.5 displays strong visual reasoning around scenes and lighting, which combined with its sharper accurate text rendering makes it well-suited for branding and product concepts.

Figure 4. MAI Image 2.5 is strong on text rendering and spatial reasoning to render exact images.

Microsoft rolled out an overhauled design for Microsoft 365 Copilot across its office productivity suite, calling it “a cohesive, agentic experience.” The new Copilot has a consistent entry point across apps and can now draw live data directly from other integrated Microsoft apps, such as emails, calendars, and files, to generate context-aware charts and graphs.

Microsoft is attempting to keep Copilot competitive as quickly evolving AI applications take on agentic abilities. To that end, Microsoft is reportedly developing a unified “super app” to consolidate GitHub Copilot, Copilot chat, and Copilot Cowork into a single destination. This new platform will feature an agentic workflow capability internally named Autopilot and is expected to launch by the end of summer.

Perplexity announced that its Perplexity Computer capabilities are now directly available within Microsoft 365 applications, including Word, Excel, and PowerPoint. The deep integration allows users to request multi-step, complex analytical actions beyond standard chat responses. For instance, the tool can analyze a legal document against a template, track changes, and generate an issues list with fallback clauses.

Eleven Labs released its upgraded Music V2 generative audio model, which focuses on producing higher-fidelity musical tracks. Eleven Labs claims:

Music v2 delivers better vocals, instrumentation, and arrangement across every genre, with improved multilingual support and a set of new capabilities.

The foundation model was trained entirely on licensed data, ensuring that commercial usage rights are cleared for content creators. Testing shows that the model contains built-in world knowledge, allowing it to correctly reference specific landmarks and pop culture elements when given regional prompts.

Eleven Labs also launched Dubbing V2, an automated video localization tool that translates audio content while preserving original attributes. The software takes an uploaded video file and converts the speech into one of over 90 target languages, translating while maintaining the speaker’s original vocal tone, emotional delivery, and facial expressions. This keeps the output more faithful to the original delivery.

Figma transformed its AI design assistant, Figma Make, into a live, visual software editor that connects natively to production codebases. The update allows users to import existing Git repositories directly into the Figma desktop app to visually edit underlying code and push changes back to engineering through GitHub pull requests. The platform utilizes a multi-model AI system, toggling between Anthropic’s Claude and Google’s Gemini models to write code that adheres to established design system guidelines.

MiniMax released a technical report on their M2 series and teased upcoming M3 models. The upcoming M3 series will feature “MiniMax Sparse Attention” (MSA), a sub-quadratic framework capable of 15.6 times faster decoding speed at million-token context lengths. The MiniMax-M2 Series Technical Report highlights the sparse Mixture-of-Experts architecture M2 and its training: Agent-driven data pipelines; the “Forge” reinforcement learning system for agent-native training; M2.7 taking steps toward self-evolution by autonomously debugging training runs.

Meta is developing an AI-powered pendant that it plans to start testing in the next year. The device is expected to build on the technology of Limitless, an AI startup acquired by Meta at the end of 2025. Meta also plans to expand its AI glasses lineup and launch a “Wearables for Work” business subscription.

OpenAI has added Codex’s computer use feature to Windows. The app can see your screen and perform tasks on your device. Users can also manage and review Codex’s jobs via the ChatGPT app.

OpenAI will remove Canvas feature in GPT-5.5 models. The side-by-side editing feature will no longer be available with GPT-5.5 Instant or GPT-5.5 Thinking. OpenAI is also shortening GPT-5.5 Instant responses and reducing the use of bullets in text.

AI Research News

The paper “Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization” argues that some efficient distillation methods damage multi-step reasoning through “reasoning collapse.” To fix this, the proposed RED method uses activation-aware initialization to better preserve hidden-representation rank. Experiments on Llama and Qwen models show that RED recovers reasoning while keeping the efficiency benefits of compressed LLMs.

AI Business and Policy

Anthropic raised $65 billion in Series H funding at a mind-boggling $965 billion post-money valuation, with Anthropic saying proceeds will support safety and interpretability research, compute expansion, and product scaling. Anthropic’s run-rate revenue crossed $47 billion earlier in May, leading OpenAI in revenue, and it has signed major compute agreements with Amazon, Google, and SpaceX to ramp up capacity for serving AI. Anthropic also opened a Milan office and expanded its European footprint.

OpenAI published its Frontier Governance Framework this week, which explains how OpenAI’s safety and security practices align with existing and emerging legal requirements, including in the US, California, and EU. The Frontier Governance Framework covers how OpenAI deals with AI risk assessment and mitigation in areas such as cyber offense, CBRN, harmful manipulation, and loss of control, providing guidance on model reporting, security management, and incident response.

The Verge examined the rapid normalization of AI in warfare in a feature that argues that military AI is no longer a future scenario. The article covered the shift from Project Maven to modern AI-enabled surveillance, object detection, and targeting workflows. tensions between government demand for broad “lawful use” and AI companies’ attempts to define ethical red lines around autonomous weapons and surveillance.

AI Opinions and Articles

In the era of Artificial Intelligence, when human dignity is threatened by new forms of dehumanization, ours is the pressing duty to remain profoundly human. – Pope Leo XIV

Pope Leo XIV issued an encyclical letter on AI called “Magnifica Humitas”, which means ‘Magnificent Humanity’, with a focus on “safeguarding the human person” in the AI era. It’s a nuanced, informed, and detailed document covering the impact of AI and how we should approach it. The Pope emphasizes that humans possess a unique, inherent dignity that should not be overlooked as AI capabilities grow.

The Pope neither rejects AI in toto nor accepts the accelerationist argument but raises serious concerns and social impacts resulting from AI development, such as AI companionship’s impact on human relationships. He critiques how AI development being controlled by a few private entities complicates governing these technologies for the “common good.” The Pope advocates for “disarming” AI, meaning we must move away from a mentality of “armed competition” of the AI race and instead foster open, human-friendly collaboration.

Pope Leo XIV and the New Social Question of AI reviews Pope Leo XIV’s AI missive in the context of Pope Leo XIII’s Revum Novarum, which confronted challenges of industrialization over a century ago.

The Guardian scrutinized Anthropic’s association with Pope Leo XIV’s AI encyclical, sharing criticism that Anthropic’s engagement with the Vatican could become “Vatican-washing” if it burnishes the company’s safety image without addressing AI concerns. Anthropic is also using AI ‘concerns’ as a way to lock down AI development via ‘regulatory capture.’

The Pope has moved the AI ethics debate forward, addressing AI in religious, social, labor, and geopolitical contexts.

“I would like to employ the expression to disarm which is close to my heart. Disarming AI means freeing it from the mentality of armed competition ... which today is not limited simply to the military context but is also an economic and cognitive phenomenon. This entails a race for ever more powerful algorithms and larger data sets driven by the desire to secure geopolitical or commercial dominance.” - Pope Leo XIV

AI Week in Review 26.05.23

Patrick McGuinness — Sun, 24 May 2026 03:46:21 GMT

Figure 1. Multi-modal world model Gemini Omni is the Nano Banana for video. Omni can take multiple inputs (audio, video, image, text) to create a video on command. Prompt used for this: Dynamic sci-fi file style video based on input image, audio track from audio file, and elements lighting up from video input.

Top Tools

Google presented many AI updates at Google I/O this week, and we shared our breakdown of highlighted announcements and releases in a prior article. There were many AI announcements at Google I/O, but to recap our recap, these were the most important ones:

Google introduced Gemini Omni, a multimodal world generation model that can create “anything from any input,” with natural-language editing across text, image, and video prompts, and video generation output.
Google released Gemini 3.5 Flash and previewed Gemini 3.5 Pro. Gemini 3.5 Flash is positioned for agentic workflows, coding, long-horizon tasks, multimodal understanding, and real-time. Gemini 3.5 Pro is expected to roll out next month.
Google introduced Antigravity 2.0, an agent-first platform that revamps Antigravity, and also unveiled the Gemini Spark personal agent, a 24/7 personal AI agent built on Gemini 3.5 and Antigravity.
Google announced many AI-infused features across the Google ecosystem, including major AI updates for Search, personalized Daily Briefs, Universal Cart for AI-assisted shopping, Ask YouTube for video search, Google Pics for image editing, and intelligent eyewear powered by Gemini.

One way to summarize Google’s direction: Google expanded its agentic product layer across Search, Gemini, Workspace, shopping, YouTube, and Android XR. The Verge summarized Google I/O as a broad AI platform push across models, agents, apps, and hardware.

AI Tech and Product Releases

Alibaba’s Qwen Team released Qwen3.7-Max, a new model designed for long-horizon autonomous agentic tasks. The model demonstrated up to 35 hours of continuous autonomous execution during an engineering task and features a 1 million token context window. Qwen3.7-Max is a proprietary model that outperforms all Chinese competitors on reasoning and coding benchmarks and matches Claude Opus 4.6.

Figure. Qwen 3.7 is SOTA across many coding and agentic benchmarks; it’s the best Chinese AI model and a match for Claude Opus 4.6.

OpenAI updated Codex with richer context, goal mode, browser improvements, and locked computer use. The latest Codex release added Appshots for attaching macOS app windows to Codex threads, general availability of Goal Mode across the Codex app, IDE extension and CLI, improved browser annotations, and locked computer use for eligible Mac Computer Use users. The update is aimed at making Codex more useful for longer-running software.

Cohere unveiled Command A+, a highly optimized 218B parameter language model released under a permissive Apache 2.0 license for open-source enterprise use. The Command A+ model utilizes a Sparse Mixture-of-Experts architecture and key features include hardware-efficient quantization for single-GPU deployment, multimodal capabilities, and improved tokenization for non-European languages. It is available on Hugging Face.

Amazon Nova Act now qualifies as a HIPAA eligible service. This expansion allows healthcare organizations to deploy autonomous, browser-based AI agents to automate complex workflows involving protected health information. The service can automate tasks such as appointment scheduling, insurance verification, and claims processing.

Anthropic announced updates to Project Glasswing, allowing qualifying customers access to a Claude harness, a threat model builder, and various skills. The company also plans to expand the project to additional partners and has released a dashboard for open-source vulnerabilities.

Cerebras Systems announced high-speed inference for the trillion-parameter Kimi K2.6 model. The chipmaker is running Moonshot AI’s open-weight model at 981 output tokens per second, significantly outperforming GPU-based cloud providers. This enterprise-first deployment utilizes wafer-scale architecture to provide massive speed improvements for agentic coding and heavy workloads.

Copenhagen-based healthcare AI Corti is launching Symphony for Speech-to-Text, a new generation of clinical-grade speech recognition models. The models achieved a 1.4% word error rate on English medical terminology, significantly outperforming generalist APIs from OpenAI, ElevenLabs, and Whisper. The technology also demonstrated a 98.3% recall rate on clinical entities and surpassed the performance of the legacy incumbent, Dragon Medical One.

AI Research News

An OpenAI reasoning model autonomously disproved a major conjecture in discrete geometry called the unit distance problem. OpenAI said an internal general-purpose reasoning model produced a proof resolving the long-running planar unit distance problem, originally posed by Paul Erdős in 1946. The proof, reviewed by external mathematicians, is notable because the model was not a math-specialized system and used ideas from algebraic number theory to disprove a conjecture many mathematicians believed was likely true.

This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics. … The result is also notable for how it was found. The proof came from a new general-purpose reasoning model, rather than from a system trained specifically for mathematics, scaffolded to search through proof strategies, or targeted at the unit distance problem in particular.

ComplexMCP was introduced as a benchmark for LLM agents in realistic tool-use environments. The paper on ComplexMCP argues that many agents can call isolated APIs but struggle when tools are interdependent, noisy, and embedded in workflows that resemble commercial software automation. The benchmark is intended to measure the “last mile” of agent performance, where success depends not just on calling tools but on managing state, dependencies, and changing environments.

A new benchmark called SMDD-Bench tests whether LLMs can solve real-world small-molecule drug discovery tasks. The benchmark evaluates frontier open and closed models on tasks requiring chemical and biological reasoning, 3D intuition, specialized tool use, and planning under limited oracle calls.

The paper “Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions” found that autonomous agents such as OpenClaw are vulnerable to systemic, architecture-level vulnerabilities that exploit multi-turn interactions. The study evaluated a standard agent framework with 10 mainstream LLM backbones across 20 threat scenarios and found that an evasion framework raised the average risk trigger rate from 28.3% to 52.6%. They constructed A3S-Bench to evaluate AI agents on such vulnerabilities.

AI Business and Policy

Andrej Karpathy has joined Anthropic as an individual contributor on the pre-training team. This is big news because he is a highly-respected AI researcher, one of the pioneers in the AI space, who cofounded OpenAI and led AI at Tesla for many years. He’s the 3rd senior former OpenAI employee to join Anthropic in the last two years.

Nvidia reported record first-quarter fiscal 2027 revenue amid increasing AI infrastructure demand. Revenue for the quarter ending in April reached $81.6 billion, up 20% from the previous quarter and 85% from a year earlier. AI compute demand from hyperscalers, enterprises, and AI labs to support training and inference continues to expand.

Google and Blackstone announced a new AI cloud infrastructure venture that will serve up Google TPU AI support through a compute-as-a-service model. Blackstone will initially invest $5 billion in equity to help bring 500 megawatts of data center capacity online by 2027, and total investment could reach $25 billion including leverage.

The White House prepared an AI oversight executive order focused on model review and cybersecurity risks that would create a voluntary framework for AI labs to engage with the Federal Government before releasing covered models. The proposed order would task agencies with evaluating AI security following concerns regarding models like Anthropic’s Mythos and OpenAI’s GPT-5.5 Cyber.

However, Trump called off those plans because of concerns such regulation could dull America’s edge on AI technology. A major point of contention is a requirement for companies to share advanced models with the government up to 90 days ahead of launch. The turnabout reflects tension between AI safety advocates pressing for stronger regulation and tech-industry allies who favor voluntary cooperation.

U.S. lawmakers have moved to counter Chinese AI and technology exports with legislation to bolster American exports, as part of a broader geopolitical contest over AI infrastructure, chips, and digital technology.

OpenAI and Dell Technologies are collaborating to deploy Codex in the enterprise, integrating Codex with the Dell AI Data Platform to support hybrid and on-premises workloads. Codex-powered agents will be utilized for tasks including software development and business workflow automations.

Regis Aged Care implemented RegiCare Assist AI assistant to streamline clinical care management. Developed with Microsoft Copilot Studio and Microsoft Foundry, the assistant summarizes voluminous progress notes and flags clinical concerns.

Spotify and Universal Music Group have entered a licensing deal to launch an AI remix tool that allows fans to create AI-powered covers from UMG’s catalog. Spotify also revealed new AI-driven tools for audiobook and podcast production during its Investor Day.

AI Opinions and Articles

Axios reported this week on some fresh polling on AI policy, and the results show Democrats becoming more AI skeptical, Republicans more likely to trust AI companies, and anxiety among the younger voters that AI will harm job opportunity. Overall, a majority are “pro AI innovation” and have a nuanced view of AI regulation, between the poles of “Pause AI” and “get Govt out of the way.” The most popular position - a 41% plurality - is that the government needs “basic safety” standards that keep American companies competitive:

Figure 3. A survey question response from the Harris Axios survey on AI.

In a twist of irony, Steven Rosenbaum explains how inaccurate quotes got into his book The Future of Truth. A New York Times investigation revealed that the use of AI tools during research led to several improperly attributed or synthetic quotes in the book. Rosenbaum is currently conducting a citation audit to correct these errors in future editions. The book is about “how Truth is being bent, blurred, and synthesized” thanks to the “pressure of fast-moving, profit-driven AI.”

He blames AI because it was deceptively easy to use, but he proves that AI slop is due to sloppy fact-checking and editing and humans cutting corners. A good workman never blames his tools.

Google I/O Recap

Patrick McGuinness — Sat, 23 May 2026 20:22:09 GMT

Figure 1. Gemini Omni is a multi-modal video/audio/text input-output model that, as the “Nano Banana for video,” can restyle any video on command into another format. Omni took a real-life video of a woman playing guitar and turned it into a stylized output. MTV-style videos will be popular.

Note for readers - Google I/O AI announcements were too big to cover in our AI Weekly, so we’ve made a separate article. Our AI Weekly will follow soon.

Google I/O’s Goodies

Google released Gemini 3 Pro in November 2025, six months ago, and for about a week, it was the world’s best AI model. Since then, Anthropic released multiple Claude Opus versions, up to 4.7, and previewed Mythos, while OpenAI released four GPT versions, making GPT-5.5 and Opus 4.7 the current frontier AI models. In the same time period, Google delivered only Gemini 3.1 Pro preview.

Many of us hoped Google would deliver a new world-beating frontier AI model. We did not get one.

Instead, we got a promise for Gemini 3.5 Pro coming this summer and two important new models: Gemini Omni, a multi-modal world generation model with ground-breaking video editing abilities; and Gemini 3.5 Flash, their fastest and highest performance Flash model yet.

Google also advanced their agentic AI platforms, introducing a revamped Antigravity 2.0 and a new AI agent for consumers, Gemini Spark. For consumers, they announced AI applications such as Docs Live, Ask YouTube, and AI enabled hardware with audio glasses. We’ll dig into all of them below.

Gemini Omni Takes On the World

Google introduced Gemini Omni, a native multimodal model that combines core intelligence with media generation and can handle any input to any output across audio, video, image, and speech. The result is the “Nano Banana for video,” a model capable of editing and generating video content directly from natural language prompts. Some of the key features that make Omni a new paradigm for video generation:

The model features advanced character consistency for marketing and education videos.
It anchors outputs in structured world knowledge to create contextually accurate media.
Commands on a video input can restyle whole scenes, change backgrounds, add elements, change angles, and more.
Omni can combine image, text, video, or audio inputs into a single cohesive output of a video with an audio track. Google says, “While only voice references will be supported for audio to start, we’ll roll out other types of audio inputs soon.”
While they have guardrails against deepfake abuses of others, Omni allows you to “create videos with your own voice by using Avatars, so you can generate videos that look and sound like you.”

Like its Nano Banana predecessor, the Omni multimodal world model is a new class of AI model that will challenge existing diffusion-based video generation models and expand what creative users can do with AI. Access to Gemini Omni requires a paid subscription.

Figure 2. Gemini Omni is a multi-modal video/audio/text input-output model that can express grounded ideas through videos. Here it generated a mini-explainer on DNA.

Gemini 3.5 Flash

Google’s headline for Gemini 3.5 is frontier intelligence with action, serving AI models for agentic AI tasks. Google presents Gemini 3.5 Flash as an efficient frontier model capable of tackling long-horizon tasks at high speed and reasonable cost.

Benchmarks show Gemini 3.5 Flash beats Gemini 3.1 Pro and Claude Sonnet 4.6 on a range of benchmarks, for example, getting 55.1% on SWE-Bench Pro and 1656 on GDP-val. It scores well on complex financial decision-making (Finance Agent V2) and other domain-specific tasks.

Gemini 3.5 Flash displays excellent speed and accuracy during single-shot prompts and smaller, short-cycle coding tasks, and could be useful for agentic AI applications, but it trails top-tier frontier models like Opus 4.7 in multi-step agentic tasks and long-horizon programming applications.

Figure 3. Benchmarks for Gemini 3.5 Flash versus leading AI models. Gemini 3.5 Flash is better than Gemini 3.1 Pro on many benchmarks.

Google shared demos of Gemini 3.5 Flash quickly producing results, touting its high token-per-second output speed (over twice the speed of its predecessors) and low latency. However, metrics also show (and my personal experience confirms) that this model is verbose and a token hog when reasoning.

The Gemini 3.5 Flash API costs $1.50 / $9.00 per million input / output tokens, three times the cost of the prior Flash version. When you compound this with the fact Gemini 3.5 Flash is verbose, it changes Flash from a cost-performance king. It now costs as much as Qwen’s latest Qwen 3.7 Max, leaving the lower-price tier more open to Chinese AI models, such as Qwen 3.6 Plus.

Google bets on Agents with Antigravity 2.0

Google’s Antigravity 2.0 is an abrupt shift from the original Anti-gravity interface, an Integrated Development Environment (IDE) based on Visual Studio Code with multi-panel layout, code editor, terminals, and extension/plugin ecosystem.

Antigravity 2.0 strips away these classic panels, plugins, and interfaces in favor of an agent-only workspace. Users interact via a prompt interface and instruct an internal agent to manage the shell, process commands, or check system states.

The system is actually four components backed by an agentic harness with a CLI. By moving away from a window-dependent IDE structure to a standardized CLI and cloud-hosted agent backend, Google can port this uniform agentic execution engine across a wide variety of development interfaces and surfaces.

The Underlying Engine - Antigravity CLI (AGY): The backend architectural engine powering the new Anti-gravity has a bare-bones interface in the new Anti-gravity Command Line Interface. This Claude Code-like CLI completely replaces the legacy Gemini CLI and can be run as a stand-alone tool with the command “agy.”

Figure 4. Antigravity CLI has a Claude Code-like interface.

The Agentic Harness - Anti-gravity 2.0 standalone app: Anti-gravity 2.0 is a stand-alone desktop application (on macOS, Linux, and Windows) that allows developers to manage multiple active agents and projects simultaneously. Each project can run distinct agent threads and independent asynchronous conversations in parallel without crossing files, eliminating the need to maintain multiple terminal windows.

While it has a steep learning curve, the interface provides many valuable features, some of them echoing features in Codex or Claude Cowork:

Built-in cron-style automation via asynchronous task management, allowing users to run scheduled agent tasks as specified command scripts at timed intervals.
Dynamic subagents: The main agent can dynamically choose to define and invoke subagents to complete focused subtasks, keeping main agent’s context window clear and allowing for parallelism.
JSON hooks, allowing users to intercept events to control Antigravity behavior.
Browser capability via /browser enables remote debugging to spin up and control instances of external applications like Google Chrome

For workflows requiring external integration, the ecosystem utilizes modular agent skills. Antigravity 2.0 is really a first version of this new architecture, lacking interface refinement and some features. New features will be rolled out over time.

Underlying Engine for developers - Anti-gravity SDK: For developers wanting to roll their own agentic AI applications, Google provides the underlying Anti-gravity 2.0 in a Python SDK.

You can download Antigravity 2.0, the Antigravity CLI, and the original Antigravity IDE, which is still available.

Spark

Google unveiled Gemini Spark, Google’s new cloud-based AI agent platform that is designed to automate consumer productivity tasks and recurring workflows directly on Google’s servers. It operates 24/7 in the cloud to manage email inboxes, organize documents, update spreadsheets, and follow up autonomously with external contacts. Gemini Spark runs on the Antigravity harness and features support for recurring tasks and new skills.

Gemini Spark was not fully released; it will roll out to trusted testers this week with a Beta planned for next week. Google promises a “packed roadmap of features” scheduled for release this summer, including upcoming Model Context Protocol (MCP) support to connect with third-party software tools.

AI Utilities for Consumers

While there were AI goodies for both business power users and consumers alike, Google highlighted consumer-centric utility in many Google I/O announcements, highlighting useful AI features and applications integrated across its existing product ecosystem. Some of the useful utilities:

Calling it a new era for AI Search, Google has pushed further into turning the search engine into an AI answer engine. The search box now is a portal for using generative AI, with conversations in AI mode for Search, AI Search agents, and even agentic coding from search.

Ask Maps allows you to ask map-related questions in natural language and have maps give you personalized answers about places.

Ask YouTube is a Gemini AI-assisted chat interface in YouTube that lets you surface answers and relevant YouTube videos on a specific topic. You can try it here.

Daily Brief is an agent that gives you a personalized morning digest that’s designed to be your first stop every day.

Docs Live is useful utility that helps you create documents on the fly from voice commands to AI. Google also announced conversational voice in Gmail to search your inbox more easily.

Google announced that audio-based AI glasses are launching this autumn. Developed in partnership with Samsung, Gentle Monster, and Warby Parker, they have Android XR and Gemini integration and support spoken assistance for calls, music, navigation, and hands-free app commands.

Google shared how Running Guide agent helps vision-impaired athletes run without human guides.

Google announced that OpenAI and other companies will incorporate Google’s Synth ID watermarking technology into its product lines to identify AI-generated imagery.

Google I/O Hits and Misses

Google most novel AI release and possibly the most profound one out of Google I/O was the Gemini Omni model. World models like Omni could displace prior generations of AI video generation models, in the same way Nano Banana has disrupted image generation.

The Omni model not only enables new video generation capabilities, but it also advances us towards a multimodal form of Artificial General Intelligence (AGI). Google DeepMind argues that advanced world generators like Gemini Omni are crucial for achieving AGI, since AI must accurately simulate real-world physics and sense in all modalities to understand the world.

OpenAI and Anthropic, on the other hand, are scaling text-based reasoning models as their line of sight to AGI. We started by noting that some of us hoped for a new frontier AI model. We didn’t get one. Google isn’t losing by any means, but neither OpenAI nor Anthropic are threatened by Google’s release of Gemini 3.5 Flash, a solid AI model but not a reason to get off GPT-5.5 or Opus 4.7.

The Antigravity 2.0 platform is the right architecture going into a fully agentic future. Paired with Gemini 3.5 Flash, it’s an effective platform for coding and many tasks. However, the release itself caused confusion, as it replaced an IDE interface, leaving existing users confused.

Both Antigravity 2.0 and Gemini Spark agentic platforms share similar features with Codex and Claude Cowork. The AI competition and fast pace of development is driving design convergence, where all these applications copy each other and begin to look the same.

AI companies are all facing the same issue: The explosion in usage and demand is making AI infrastructure a bottleneck. Google may be calculating that better margins come from optimizing their Flash model and raising prices to match its better performance. Meanwhile, Google is inserting AI throughout their ecosystem, including, Search, Gmail, YouTube, and Google docs. AI usage will only continue to grow.

Figure 5. Google Gemini tokens processed has grown 7x in the past year and continues to rocket higher.

AI Week in Review 26.05.16

Patrick McGuinness — Sat, 16 May 2026 23:16:48 GMT

Figure 1. Image made by Krea K2 image generation. K2 is a great model for aesthetic image generation with customizable styles.

Top Tools

Mira Murati’s Thinking Machines Lab introduced its first Interaction Models, designed for real-time continuous audio, video, and text collaboration rather than turn-by-turn chatbot exchanges. These interaction models process audio, video, and text in time-aligned 200 ms micro-turns rather than turn-based exchanges. The TML-Interaction-Small model is a 276B-parameter MoE model with 12B active parameters.

An interaction model is in constant two-way exchange with the user—perceiving and responding at the same time.

TML combines an interaction model that supports simultaneous real-time translation, temporal awareness for tracking conversation length, natural dialogue mechanics including structured interruptions, and user interface generation, with a background model that supports concurrent tool execution and web browsing. Combined it appears to be a powerful next-level form of interactive AI, a step beyond the turn-based AI interfaces. TML says broader preview access will come in the coming months.

AI Tech and Product Releases

OpenAI made Codex available inside the ChatGPT mobile app, letting users monitor and direct coding tasks from iOS and Android. Users can review outputs, approve changes, and start tasks remotely, while the connected Codex environment remains on the user’s macOS machine; OpenAI says Windows pairing is coming later. The host computer executes the underlying processing locally while sending state updates and receiving prompts back from the mobile interface. Codex for Mobile is similar to Anthropic Claude’s Remote Control feature.

New safety updates help ChatGPT respond safely when risk emerges in ChatGPT conversations. OpenAI introduced safety updates to better recognize evolving patterns of self-harm, suicide, or harm-to-others intent within and across conversations. Internal evaluations show that these improvements increased safe-response performance in suicide and self-harm cases during long conversation scenarios.

Anthropic introduced Claude for Small Business, connecting Claude to several tools and shipping a collection of pre-built automation workflows and Claude skills targeting specific operations. The platform introduces 15 agentic workflows and connects Claude directly into everyday enterprise tools such as PayPal, QuickBooks, HubSpot, Canva, and DocuSign to automate routine operational tasks.

Anthropic also announced an AI training partnership with PayPal called AI Fluency for Small Business. Anthropic’s broader initiative is to leverage Claude Cowork and Model Context Protocol connectors to provide customized AI workflows in industries like legal, finance, and healthcare.

OpenAI launched a product for managed cyber defense called Daybreak. Daybreak is a cybersecurity system built around GPT-5.5, Codex Security, and verified defensive workflows designed to scan for potential cyber-vulnerabilities and exposures. Instead of distributing software or AI models directly to clients, the OpenAI Daybreak Framework operates as a managed service that runs internal evaluations on behalf of the customer. This controlled access approach aims to prevent advanced cyber-capabilities from being misused by malicious actors. The service can provide secure code review, vulnerability triage, patch validation, threat modeling, and remediation guidance.

Penligent has thoughts on Mythos versus Daybreak for cybersecurity.

Krea AI released Krea 2, an AI image generation model designed for aesthetic image generation with granular style and structure controllability. The system utilizes variable sliders to control precise stylistic inputs and can blend multiple images via individual asset weights. It also introduces mood boards that encapsulate custom profile definitions and keyword parameters for consistent stylistic rendering.

Krea 2 focuses on visual taste and style control. Instead of relying only on longer prompts, creators can use moodboards and references to guide the model toward a specific look. – Krea AI

Krea 2 is available at Krea AI.

Figure 2. Image generated by Krea K2. Jerrod Lew says, “The model leans into more abstract, different and unique forms of art generations.”

Meta expanded Muse Spark voice conversations across its consumer apps and devices. Muse Spark is now available via the Meta AI app, WhatsApp, Instagram, Facebook, and Ray-Ban Meta Gen1 and Gen 2 smart glasses, with low-latency voice interaction, live camera input, and integration with Meta surfaces such as Reels inside native Meta AI applications and smart glasses.

Google introduced the Google Book, a laptop that integrates AI natively into the operating system, taking Android and ChromeOS into the AI era with Gemini AI support. The system reengineers traditional computing interactions by rolling out an AI-enabled pointer that utilizes head tracking, eye tracking, and speech commands to modify documents or drag elements without keyboard inputs.

Related to Google Book release, Google previewed upcoming Android updates that embed Gemini AI directly into native applications and mobile web browsing. Users can execute automated workflows such as booking travel or reserving parking with a few clicks. Visual understanding can be used to convert a shopping list image to a full online shopping cart. Text input AI voice dictation with “Rambler” can dynamically eliminate conversational pauses, stuttering, and filler words.

Figure 3. Branding it as “Gemini Intelligence,” Google is bringing a number of Gemini-based features into the upcoming Android 17, getting ahead of Apple in AI race on mobile.

Observability startup Raindrop AI has launched a tool for debugging and evaluating AI agents called Workshop. The open-source Workshop tool provides a local dashboard to stream real-time telemetry, such as tokens and tool calls, into a single SQL database file. It features a self-healing eval loop that enables coding agents to autonomously identify and fix errors by analyzing execution traces.

The goal pattern is being productized across coding agents. Anthropic introduced /goals on Claude Code to separate task execution from task evaluation. The feature uses an independent evaluator model to verify that user-defined completion conditions are met (such as passing before an agent terminates its work. OpenAI also has a /goal workflow in Codex and Hermes agent has a goals feature. The flow where the agent continues until validation criteria are satisfied is a refinement of the Ralph Wiggum loop, reduces reliance on separate observability platforms.

Notion released the Notion Developer Platform, offering engineering utilities for agent integration with Notion, including a dedicated Command Line Interface and execution workers. The platform turns Notion into one shared canvas for data, where both users and agents such as Codex, OpenClaw or Hermes can interact with Notion database structures. This allows external developers to execute custom code directly on Notion’s servers, initialize webhook triggers, and leverage a specialized agents SDK.

Open-source autonomous agent Hermes passed OpenClaw as the most-used CLI agent on OpenRouter and added background computer use through TryCUA for macOS computer control. This is a key improvement for local Hermes agent use; Hermes can run on local or remote infrastructure.

Anthropic released Claude Code Agent View, an interface to manage multiple agent sessions at the same time. The interface change consolidates multiple background agent operations into a single dashboard view to track what requires user input, what is actively running, and what has finished. This replaces the need to keep numerous open terminal windows running concurrently during multi-agent software engineering tasks.

OpenAI launched new personal finance tools in preview for ChatGPT Pro subscribers, which connects to financial institutions for finance data and provides a dashboard for portfolio performance, spending, and upcoming payments. The tool leverages the reasoning capabilities of the new GPT-5.5 model to assist with detailed spending analysis and long-term financial planning.

YouTube is expanding its AI likeness detection program to all users over the age of 18. The feature uses facial scans to monitor the platform for potential deepfakes and allows users to request the removal of matching content.

Ahead of I/O, reports say Google is preparing to launch an advanced AI agent called Gemini Spark. This is a Gemini agent that could work continuously in the background on tasks such as inbox triage, online workflows, and app-linked actions. There are also rumors that upcoming Gemini 3.2 Flash is 15x cheaper and nearly as smart as GPT-5.5. This remains unconfirmed by Google, and we will have to wait for Google to reveal more at Google I/O next week.

AI Research News

RecursiveMAS is an innovative multi-agent framework that allows AI models and agents to collaborate through a unified latent-space recursive loop rather than communicating via standard text. Introduced in the paper “Recursive Multi-Agent Systems,” the framework connects heterogeneous AI models into a collective reasoning loop utilizing a lightweight RecursiveLink module to transmit latent states. RecursiveMAS achieves an average 8.3% accuracy improvement across various benchmarks while increasing end-to-end inference speed by up to 2.4-fold.

AI Business and Policy

Cerebras Systems made a massive Nasdaq debut, raising $5.55 billion in an IPO where shares surged from $185 to $385, pushing its market cap past $100 billion. Driven by $510 million in 2025 revenue and $237.8 million in net income, the company is expanding its Wafer-Scale Engine cloud infrastructure for high-speed AI inference. This growth is supported by strategic partnerships with industry leaders such as OpenAI and Amazon Web Services.

Anthropic’s Claude adoption surpasses OpenAI’s ChatGPT among American businesses. According to the May 2026 Ramp AI Index, Anthropic’s business adoption rose to 34.4% in April, showing a 3.8% monthly climb and overtaking OpenAI’s 32.3%. Anthropic’s growth is driven by Claude Code usage, but its success is hampered by constraints in compute.

Attempting to manage compute and user demands for Claude, Anthropic has reinstated third-party agents in Claude subscriptions by implementing a monthly credit quota system. Agent SDK and claude -p usage on subscription plans will draw from a new monthly Agent credit separate from interactive usage limits. Users can apply their subscription to OpenClaw usage, but once users consume their baseline allocated tier allowance within agent platforms, subsequent consumption shifts to standard API data rates. Feedback has been mixed, since high-throughput agent tasks will consume these flat credits quickly, leading to higher development expenditures compared to previous subscription limits.

Google updated its spam policy to mark attempts to manipulate generative AI responses in Search, including AI Overview, as spam. The policy targets tactics such as recommendation poisoning and generative engine optimization (GEO) used to deceive Search systems.

Elon Musk’s newly rebranded SpaceXAI is reportedly losing top talent. More than 50 researchers and engineers have departed the company since February, with several key leaders joining rivals Meta and Thinking Machines Lab.

Richard Socher’s startup Recursive Superintelligence has emerged from stealth with $650 million in funding. The San Francisco-based company aims to create a recursively self-improving AI model using an approach centered on open-endedness. The founding team includes prominent researchers such as Peter Norvig and Tim Shi.

OpenAI is exploring legal action against Apple over disappointing ChatGPT integration results. OpenAI is reportedly frustrated by the integration’s low visibility and revenue, while Apple has raised concerns regarding privacy and OpenAI’s hardware ambitions.

California jurors are now deliberating over the future of OpenAI. The trial centers on Elon Musk’s allegations that OpenAI and Microsoft breached a charitable trust by transitioning toward a for-profit business model, but most interesting has been some of the revelations at trial, such as: Musk once sought majority control of OpenAI; Microsoft’s commercial rights were revised around the AGI clause; and OpenAI has raised far more private capital than previously understood.

AI Opinions and Articles

Figure 4. The work of art by Monet was labelled as done by AI, eliciting criticism of the flaws of perceived AI art.

An art experiment on X highlighted human bias against AI content, by labeling a real Claude Monet painting as made using AI and soliciting critiques, then getting an earful of them. Users on X wrote extensive critiques detailing why the painting supposedly lacked an organic soul, suffered from synthetic color balancing, and looked inferior to classical artwork.

the reflection in AI art is just noise splattered right. Monet actually understood how light behaves on water - Charles Deskins

The responses demonstrate a human aesthetic bias against AI. A genuine historical masterpiece was viewed as inferior due to its perceived AI pedigree.

AI Week in Review 26.05.08

Patrick McGuinness — Fri, 08 May 2026 18:20:12 GMT

Figure 1. Luma’s Uni-1 reasoning image generation model, which takes the #3 spot on LM Arena, now has an API. As an image model, Uni-1 “Feels like Nano Banana under the hood with a bit of cinematic lighting added to the mix.”

AI Tech and Product Releases

OpenAI launched three real-time voice models, aimed at supporting AI developers for live real-time voice tasks. GPT-Realtime-2 support real-time voice interactions with “GPT‑5‑class reasoning” for harder requests and natural conversation; GPT-Realtime-Translate translates speech in real-time from 70+ input languages into 13 output languages; and GPT-Realtime-Whisper transcribes speech live for streaming transcription. Companies including Zillow, Priceline, and Deutsche Telekom are testing the AI models.

OpenAI released GPT-5.5 Instant as ChatGPT’s new default model. OpenAI says the model is designed to provide smarter, clearer, more personalized answers with lower hallucination rates, especially in high-stakes areas such as law, medicine, and finance. GPT-5.5 Instant replaces GPT-5.3 Instant as the default ChatGPT model.

OpenAI introduced “Trusted Contact” in ChatGPT, an opt-in safety feature lets adult users designate someone to be alerted when OpenAI detects a serious self-harm or suicide-related risk from a ChatGPT conversation. OpenAI says the system does not share chat contents with the contact, but a trained human review team can trigger a brief alert after automated systems flag a crisis.

Luma Labs launched the Uni-1.1 API, which puts their Uni-1.1 Unified Intelligence image model in the hands of developers to create production image tool workflows. Uni1.1 reasons about composition, style, and briefs before generating, and it supports smarter text-to-image and natural-language image editing, with built-in prompt enhancement. Early feedback highlights it as a game-changer for marketing, thumbnails, branding, and creative pipelines.

U.S.-based AI startup Zyphra released ZAYA1-8B, a new mixture-of-experts (MoE) AI reasoning able to achieve high-tier reasoning with only 760 million active parameters. As shared in a ZAYA1-8B Technical Report, the model used AMD Instinct MI300 GPUs for training. Zyphra pioneered a proprietary MoE++ architecture and test-time compute method (Markovian RSA) that enables ZAYA1-8B to perform competitively on reasoning benchmarks such as AIME 25 (91.9%) against much larger AI models. ZAYA1-8B is an open source AI model available on Hugging Face.

Google released a multi-token prediction (MTP) variant of its Gemma 4 model. The MTP feature uses speculative decoding by a draft model to predict multiple next tokens simultaneously, which improves inference throughput up to 3-fold without capability degradation. Gemma 4 models are available from Hugging Face for local AI use.

At their Code with Claude conference, Anthropic introduced a research preview of a feature called Dreaming, which is designed to help agents clean and reorganize their long-term memory. Between sessions, the model “dreams” by reviewing past transcripts to merge duplicates, resolve contradictions, and surface new insights for a reorganized memory store. This improves agent performance over time by maintaining context and memory.

Anthropic unveiled other updates to their Claude Managed Agents platform, moving two features into public beta: The outcomes loop system, which directs an agent to work towards a goal “self-evaluating and iterating” until the goal is met, similar to Ralph Wiggum loop; and multi-agent orchestration, where one agent coordinates the work of multiple sub-agents to complete complex work. They also added Routines for scheduled tasks, and they improved web hook support to integrate autonomous agents more easily into external applications and automated workflows.

Anthropic also shared a preview of their next-generation models, with three key areas highlighted for AI model progress: Building in higher judgment and “code taste” for better trust and maintainability for senior engineering tasks; an ‘infinite’ context window where memory management is advanced enough to make context feel limitless; and advanced multi-agent coordination, to tackle tasks too big for a single agent instance.

Runway has released Runway Characters, a platform for creating real-time interactive video avatars:

Runway Characters is an audio-driven interactive AI video generation model that simulates natural human motion and expression.

These characters use generative video and audio to respond to user input with minimal latency, moving the technology closer to realistic digital humans. Thanks to its high-quality video generation, Runway Characters avoids many of the artifacts seen in earlier lip-syncing technologies.

Google introduced three major updates to the Gemini API File Search tool: multimodal support, custom metadata and page-level citations. The tool now processes images and text together using the Gemini Embedding 2 model to provide enhanced contextual awareness. Additionally, developers can utilize custom metadata for efficient data filtering and leverage page-level citations to improve response grounding and transparency.

Anthropic has launched Claude agent templates for the financial sector to automate entry-level financial analyst workflows. Shipped as a plugin for Claude Code or Claude Cowork, the templates cover tasks such as building pitch decks, preparing for meetings, reviewing earnings, and conducting valuation research.

In a similar vein, Perplexity launched Perplexity Computer for professional finance, an agentic tool that integrates licensed financial data from providers such as Morningstar and PitchBook. The product includes 35 dedicated workflows for common analyst tasks, aiming to function as a financial operating system. This release positions Perplexity as a direct competitor to Anthropic’s financial automation tools.

Perplexity’s Personal Computer, its answer to OpenClaw and other local AI agents, is now available to all Perplexity Mac users via its desktop app. Perplexity’s Personal Computer enables AI agents to access local files, applications, and web connectors to handle complex, multi-step workflows, and it is available via a Pro or Max subscription.

AI startup Subquadratic has emerged from stealth with the SubQ 1M-Preview AI model, built on a fully sparse attention architecture that scales linearly with context length. The SubQ architecture is reportedly 52 times faster than flash attention at 1 million tokens and reduces attention computing by nearly a thousand times compared with frontier AI models, allowing it to support a 12 million token context window. They are releasing SubQ 1M-Preview model via an API, in a coding agent, and as a search tool.

Codex can now use Chrome on your computer to complete work inside websites and apps, with a new Chrome plug-in in the Codex app. The extension operates in task-specific tab groups to allow users to keep using their active tabs.

Microsoft brought its Agent 365 AI agent management platform to general availability. The platform provides a unified control plane for IT teams to govern and secure AI agents across Microsoft’s ecosystem, third-party clouds, and local Windows endpoints.

AI Research News

Alibaba’s Qwen team released Qwen Scope, a set of sparse autoencoders designed to improve the mechanistic interpretability of their 27B model. These tools act like a “microscope” for the model’s weights, allowing researchers to identify and steer specific features, similar to Anthropic’s “Golden Gate Bridge” experiment. This research helps in understanding how models store information and provides ways to “de-censor” or personalize model behavior.

Researchers at AI red-teaming company Mindgard claimed they bypassed Claude’s safeguards through social manipulation by using flattery, respect, and “gaslighting” rather than a traditional technical jailbreak to induce Claude to provide prohibited content. The finding suggests some model-safety failures may emerge from conversational dynamics, an AI analog to “social engineering” in cyber hacking.

Mozilla has confirmed that Anthropic’s Mythos model can uncover high-severity vulnerabilities. Mozilla researchers reported that the Mythos AI model discovered numerous bugs in Firefox, including some that had remained dormant in the code for over a decade. They used Mythos to help ship 423 Firefox bug fixes, a significant increase from the 31 fixes recorded a year earlier.

AI Business and Policy

Anthropic has entered into a major compute partnership with SpaceX AI (formerly xAI) to utilize the compute capacity of the Colossus 1 data center. The deal provides Anthropic with access to more than 220,000 Nvidia GPUs in Colossus 1, consuming over 300 megawatts of power, to scale their model deployment and meet user demand. Anthropic has announced additional recent infrastructure agreements with major tech firms including Google, Amazon, Microsoft, and Nvidia to support their growth.

This is a big win-win deal. SpaceX AI’s massive compute was underutilized because the Grok AI model hasn’t been popular, while Anthropic has been overwhelmed by a surge in demand for Claude; Anthropic reported 80x annualized growth in revenue for the first quarter of 2026. This partnership monetizes SpaceX’s spare compute while giving Anthropic inference resources they desperately need to support users.

Alongside that deal, Anthropic is significantly increasing usage limits for Claude, doubling the 5-hour rate limits for Claude Code across all paid plans including Pro, Team, and Enterprise. The company has also removed peak-hour limit reductions for Pro and Max users and increased API limits for the Claude Opus model. These changes will especially help developers running heavy coding workflows and large-scale agent systems.

SpaceX is reportedly planning a $55 billion AI chip manufacturing plant in Texas, called the “Terafab,” with a goal to supply up to 1 terawatt of AI chips for both earth and space deployment.

A 40,000-acre data center project was approved in Utah despite community opposition. The need for more AI infrastructure is driving new data center projects, but community backlash to data centers in increasingly common due to power, water, and other concerns.

Cloudflare is laying off 1,100 workers while its AI usage has increased by 600 percent. While companies are connecting restructuring and layoffs productivity changes around AI adoption, the causal link between AI deployment and job cuts often remains contested.

OpenAI has expanded testing ads in ChatGPT to additional countries. OpenAI continues to monetize their ChatGPT AI assistant via ads for free-tier AI consumers.

Moonshot AI has raised about $2 billion at a valuation of $20 billion. Beijing-based Moonshot AI’s annual recurring revenue topped $200 million in April, driven by rapid growth in its Kimi series of AI models.

U.S. AI safety testing expanded to Google DeepMind, Microsoft, and xAI. The Commerce Department’s Center for AI Standards and Innovation said it signed agreements with Google DeepMind, Microsoft, and xAI for pre-deployment evaluations and targeted research on frontier AI capabilities. This allows the U.S. government to review some new AI models before release.

Washington and Beijing are weighing formal discussions on AI, putting AI leadership and risk management on the agenda for the Beijing summit between President Donald Trump and Chinese President Xi Jinping.

The White House is reportedly reconsidering AI oversight after concerns about Anthropic’s Mythos model. Anthropic’s Mythos model and its ability to exploit software vulnerabilities has prompted White House discussion of stronger oversight for frontier models, shifting the administration’s AI strategy. Officials are trying to balance national-security risks against a desire to avoid U.S. AI progress versus China.

The Golden Globes has established new guidelines for AI use, permitting its application in production as long as human creative direction and authorship remain primary. Acting performances must be fundamentally human-driven and derived from the credited performer, prohibiting the use of unauthorized digital likenesses or voice replication, but AI remains permissible for technical or cosmetic enhancements, such as de-aging.

AI Opinions and Articles

Elon Musk’s lawsuit against OpenAI is exposing details about OpenAI’s internal operations and drama. Testimony suggested that OpenAI’s shift toward product-focused development has compromised its commitment to AI safety. Former board member Tasha McCauley also testified about CEO Sam Altman’s (lack of) transparency and governance failures.

Trial exhibits also revealed new details regarding Mira Murati’s role in Sam Altman’s 2023 ouster from OpenAI, including text messages between Murati and Altman and her communications with the board. OpenAI’s board discussed a possible merger with Anthropic during the crisis, with Dario Amodei potentially becoming CEO.

AI Week in Review 26.05.02

Patrick McGuinness — Sat, 02 May 2026 20:17:01 GMT

Figure 1. OpenAI reported that as they tuned GPT-5.1 through GPT-5.5 for a ‘nerdy personality,’ the model became obsessed with goblins and gremlins in its responses. As AI becomes more intelligent, it adopts certain styles, personalities, and in some cases, obsessions.

Top Tools

By building on Nemotron 3 Nano Omni, our agents can rapidly interpret full HD screen recordings — something that wasn’t practical before. This isn’t just a speed boost: It’s a fundamental shift in how our agents perceive and interact with digital environments in real time.” - Gautier Cloix, CEO of H Company.

Nvidia launched Nemotron 3 Nano Omni, an open multimodal model that unifies vision, audio, and language processing into a single 30B architecture. The 30B-A3B hybrid mixture-of-experts (MoE) model achieves high throughput (9 times comparable open-source omni models) and excels on AV and document benchmarks such as MMlongbench-Doc and OCRBenchV2.

Nemotron 3 Nano Omni’s combined multi-modal capabilities enable it to function as a multi-modal perception sub-agent, speeding agent processing and reducing orchestration complexity and inference costs in agentic systems. This is extremely useful in AI agents like Hermes or OpenClaw. It is available on Hugging Face and Nvidia’s build platform.

AI Tech and Product Releases

SenseTime open-sourced SenseNova U1 as a unified native multimodal model family that handles understanding and generation end to end without a separate visual encoder or adapter. The SenseNova U1 8B model and a 3B-active MoE variant are released under Apache 2.0, and they feature native interleaved image-text generation and aimed in part at infographic generation and multimodal creation. Further documentation is in their GitHub repository.

Mistral announced Mistral Medium 3.5 and Remote Agents in Vibe, their Claude Code-like agentic CLI. Mistral Medium 3.5 is a128B dense model with a 256K context window and configurable reasoning effort. Mistral’s model card describes it as a frontier-class multimodal model optimized for agentic and coding use cases, although it lags on some benchmarks versus frontier models. It is also an open weights (MIT license) model available via Hugging Face, making it useful for self-hosting and fine-tuning applications.

Mistral also updated their Vibe coding agent platform, which now supports remote parallel agents and session teleportation. These agents operate in isolated cloud sandboxes and can autonomously perform complex development tasks, including opening GitHub pull requests.

xAI’s Grok 4.3 has launched as an update in xAI’s developer documentation. Grok 4.3 is a 500B parameter native multimodal model, designed to handle long-context reading (1 million token context), improved video understanding, and complex reasoning tasks. There aren’t public benchmarks, but users have been reporting improvements over prior Grok models.

IBM introduced the Granite 4.1 family as an open release spanning new language, vision, speech, embedding, and guardian models for enterprise use. The Granite 4.1 3B, 8B, and 30B dense LLMs support a 128k context window and are optimized for high-speed agentic tasks such as tool-calling and long-document reasoning. Granite Speech 4.1 provides state-of-the-art speech-to-text transcription, and Granite Vision 4.1 VLM is designed to process and analyze complex documents, charts, and images for data extraction and visual understanding.

Poolside released Laguna M.1 and Laguna XS.2 models, agentic coding models built for long-horizon tasks. Laguna M.1 is a 225B parameter MoE with 23B active parameters, trained in-house on 30T tokens. It gets 46.9% on SWE-bench Pro and 40.7% on Terminal-Bench 2.0. Laguna XS.2 is a 33B parameter MoE with 3B active parameters released as an open weights model for local deployment. These models are integrated into a new terminal-based coding agent and a cloud sandbox environment for building web applications and APIs.

Anthropic announced “Claude for Creative Work,” releasing a suite of Claude connectors that integrate the AI assistant directly with creative tools including Adobe Creative Cloud, Autodesk, and Blender. These connectors enable professionals to perform tasks such as 3D modeling and audio sample searching via natural-language prompts within their native creative workspaces. Anthropic also announced partnerships with leading design colleges to support the integration of AI tools into creative education curricula.

Google announced that Gemini users can now generate files directly inside Gemini, allowing users to generate and export Google Docs, Sheets, Slides, PDFs, Word documents, and Excel spreadsheets. This simple but useful feature lets users go from prompt to downloadable or shareable output without moving content into separate apps first. Google said the rollout is global.

OpenAI is releasing a cybersecurity-focused GPT-5.5 Cyber to ‘trusted’ cyber defenders. A limited rollout is planned.

For developers, Cursor announced a TypeScript SDK that exposes the same runtime, harness, and models used by its desktop app, CLI, and web agents. The company said developers can run those agents locally or on Cursor cloud VMs and embed them into their own products with a few lines of code.

Baidu’s ERNIE 5.1 Preview reached number 13 on Arena’s text leaderboard, making it the highest-ranked Chinese lab model in that comparison. ERNIE 5.1 is expected to launch soon during Baidu Create.

Perplexity has integrated its Computer enterprise AI platform with Microsoft Teams and introduced a native beta side panel for analysts using Excel. The update features a new workflows tool that enables users to bundle prompts and context for over 70 pre-configured recurring tasks.

Snap Inc. launched AI Sponsored Snaps, a new ad format that places interactive “brand agents” directly into the Snapchat Chat inbox. This feature allows users to interact conversationally with advertisers to manage financial services or obtain product information without leaving the app.

Stripe announced agentic commerce suite for AI agents, enabling AI agents using payment credentials to make purchases. Stripe’s upgraded Link wallet built for AI agents enables autonomous AI agents to perform tasks such as shopping and making reservations without exposing user payment credentials. The setup adds human approval to transactions while giving agents a controlled way to pay within a budget.

Anthropic’s Claude API Skill in now bundled into development platforms including JetBrains, CodeRabbit, and Warp to streamline AI integration. This skill provides developers with production-ready knowledge of the Claude SDK, handling tasks like prompt caching and model migrations automatically.

Google is rolling out Gemini to cars with Google built-in, replacing their Google Assistant in cars with Gemini AI to enhance driving experience. The rollout begins in the U.S. and includes features like Gemini Live for real-time, hands-free interaction.

AI Research News

Mayo Clinic developed an AI model to detect pancreatic cancer from CT scans up to three years before clinical symptoms manifest. Mayo Clinic reported that its REDMOD AI model can detect pancreatic cancer on routine abdominal CT scans up to three years before clinical diagnosis, identifying 73% of pre-diagnostic cancers, compared with 39% for human specialists reviewing the same scans without AI support. This AI diagnostics advance can significantly improve outcomes for pancreatic cancer, which is typically found too late for effective intervention.

Nvidia and Siemens Healthineers released NV-Raw2Insights-US for AI-native ultrasound reconstruction. The model processes raw ultrasound sensor data to generate patient-specific sound-speed estimates for adaptive image focusing. Meanwhile, BioticsAI has developed an AI copilot for ultrasound to detect fetal abnormalities. The company has secured FDA approval to begin deploying its technology in hospitals.

Anthropic shared a study on personal guidance in Claude conversations. An Anthropic analysis of one million user interactions found that 6% of chats involve requests for personal guidance on topics like health, finance, and relationships. The study revealed that sycophancy, or the model’s tendency to over-validate user opinions, peaked at 25% in relationship-focused discussions. Anthropic subsequently used synthetic training data to reduce these sycophancy rates by half in its latest Opus 4.7 and Mythos models.

Alibaba’s Hierarchical Decoupled Policy Optimization (HDPO) can significantly improve AI agent efficiency, cutting redundant tool invocations from 98% to 2% while establishing new state-of-the-art accuracy across key reasoning benchmarks. The framework separates accuracy and efficiency into independent optimization channels to train agents to balance task precision with execution economy.

AI Business and Policy

Anthropic is asking investors to submit allocations for its latest fundraise, a roughly $50 billion funding round at a whopping $900 billion valuation. This is likely the company’s last private round before an anticipated IPO later this year.

Apple reported $8.4 billion in Mac revenue for the second quarter, a 6% annual increase driven by the demand for devices to support local AI agents like OpenClaw. CEO Tim Cook noted that unexpected demand for AI-capable hardware has led to supply constraints for the MacBook Neo and Mac Studio. Macs make great AI agent devices thanks to AI support on M5 chips.

OpenAI and Microsoft revised their partnership so Microsoft remains OpenAI’s primary cloud partner, but its license is now non-exclusive and OpenAI can serve products across other cloud providers, such as Amazon Web Services and Google Cloud. Microsoft will no longer pay revenue shares to OpenAI for models accessed through Azure, and OpenAI’s revenue share to Microsoft will now be subject to a total cap through 2030.

OpenAI and AWS then announced a strategic expansion that brings GPT-5.5 and other frontier models to the Amazon Bedrock platform in limited preview, and OpenAI launched Codex and managed-agent offerings on Amazon Bedrock. The change gives enterprises a way to use OpenAI systems on AWS infrastructure instead of being limited to Azure distribution.

Accenture is rolling out Microsoft 365 Copilot to its entire 743,000 employee workforce, marking the largest enterprise Copilot deployment to date. Company data from 2025 show that 53% reported significant productivity improvements and 97% reported completing tasks faster.

Google DeepMind announced a new partnership with Korea’s Ministry of Science and ICT to accelerate scientific breakthroughs with frontier AI. An AI Campus in Seoul will serve as a hub for collaboration between Korean research institutions and Google, conducting research in life sciences, weather and climate, and AI safety.

OpenAI published a report outlining safeguards used to prevent ChatGPT from generating violent or harmful content, including automated classifiers and human reviewers to identify policy violations, resulting in immediate account bans for dangerous activity. OpenAI is developing a “trusted contact” feature to help adult users manage their personal safety on the platform. This comes after news of a murder suspect asking ChatGPT about body disposal while planning a crime.

OpenAI released a Cybersecurity Action Plan detailing a five-pillar strategy to leverage AI for strengthening cybersecurity for the Intelligence Age. The full plan focuses on democratizing cyber defense tools, coordinating industry-wide security efforts, and enhancing the safety of frontier AI capabilities. It also emphasizes the importance of preserving visibility in AI deployment to protect users against increasingly sophisticated machine-powered threats.

The US Dept of War has secured agreements with leading AI companies to deploy AI on classified networks. The agreements with OpenAI, Google, Microsoft, Amazon, Nvidia, xAI, Reflection, and SpaceX aim to create an “AI-first fighting force” through enhanced data synthesis and situational awareness. Anthropic was notably excluded from the list of agreements, due to its refusal to accept terms set by the Pentagon, which has led to legal disputes between Anthropic and the Dept of War.

American leadership in AI is indispensable to national security. – US Dept of War

The White House has formally opposed Anthropic’s proposal to increase the number of companies permitted to access its high-performance Mythos AI model. Officials cited internal security analyses suggesting that the model’s cybersecurity capabilities could be used to exploit vulnerabilities and compromise critical infrastructure, including electrical grids and hospitals. Access is currently restricted to 50 select partners.

Salesforce is crowdsourcing its AI roadmap in real time, using intensive customer feedback loops and weekly meetings to develop its AI agent management platform, Agentforce. This approach allows Salesforce to rapidly deploy updates and build agentic operating system components that address specific enterprise challenges.

The Musk versus Altman trial is underway, and it has revealed OpenAI’s governance tensions, its transition to a for-profit model, and a February 2025 $97.4 billion acquisition bid from a Musk-led coalition. Evidence highlighted Musk’s intent to compete via Tesla due to a loss of confidence in OpenAI’s ability to rival Google, and Musk testified that xAI used distillation techniques on OpenAI models to train Grok, asserting that such practices are common among AI companies.

AI Opinions and Articles

OpenAI published a post explaining why GPT-5.5 developed a tendency to use goblins, gremlins, trolls, and similar creature references, after a developer discovered a GPT-5.5 system directive instructing the model to avoid mentioning such creatures. Apparently, reinforcement learning on the “nerdy” personality pattern amplified the odd behavioral pattern of over-use of goblin and gremlin metaphors.

OpenAI explained the leaked prompt oddity with a model-behavior bug report. The lesson for us is that as AI gets more intelligence, we may get yet more surprising AI behaviors, both good and bad, and sometimes, just quirky.

Depending on who you ask, the goblins are a delightful or annoying quirk of the model. But they are also a powerful example of how reward signals can shape model behavior in unexpected ways, and how models can learn to generalize rewards in certain situations to unrelated ones. - OpenAI

AI Week in Review 26.04.25

Patrick McGuinness — Sat, 25 Apr 2026 17:28:59 GMT

Figure 1. GPT Image 2.0 is a stellar image model that creates super-high-resolution images, generating this movie-poster rendition of this week’s theme and releases.

Top Tools: GPT 5.5 and GPT-Images 2

OpenAI released GPT-5.5 and the even more powerful GPT-5.5 Pro, their most capable frontier models yet. The flagship GPT-5.5 and GPT-5.5 Pro, called the “Spud” model internally during training, take long-horizon agentic AI to new heights, with improvements in long-context reasoning, coding, spreadsheet and document work, computer-use tasks, and scientific research workflows.

GPT-5.5 achieved SOTA results across several real-world benchmarks: 82.7% on Terminal-Bench 2.0, 84.9% on GDPval. It gets 58.6% on SWE-Bench Pro, below Opus 4.7’s 64.3%, but reviewers have noted this overlooks efficiency gains and tokenizer differences that make GPT-5.5 faster and more consistent, helping it perform at the highest level in production use cases.

Also impressively, GPT-5.5 achieves high-level performance with significantly fewer output tokens than GPT-5.4, somewhat making up for its higher per-token cost. GPT-5.5 comes with a 1 million-token context window and advanced compaction for long-horizon tasks, with some testers reporting GPT-5.5 addressing complex code tasks autonomously for 7+ hours without stopping.

Figure 2. GPT-5.5 shows more intelligence on a per-token basis than Claude Opus 4.7 or other frontier AI models on the Artificial Analysis Intelligence Index⁠, a weighted average of 10 benchmarks: AA-LCR, AA-Omniscience, CritPt, GDPval-AA, GPQA Diamond, Humanity’s Last Exam, IFBench, SciCode, Terminal-Bench Hard, τ²-Bench Telecom.

Additionally, OpenAI is introducing a Bio Bug Bounty program, inviting researchers to expose biological weapons vulnerabilities in GPT-5.5, with monetary rewards for bugs found.

To amplify OpenAI’s seriousness about winning enterprise users, OpenAI launched Workspace Agents in ChatGPT, a Codex-powered platform designed for enterprise users to automate business workflows such as software triage, lead outreach, and metrics reporting. These shared, cloud-based agents feature persistent memory and scheduled tasks, integrating with third-party applications such as Slack, Google Drive, and Salesforce. The system includes pre-built agents such as Tally for report generation and Scout for customer feedback management, while also allowing companies to build custom agents.

OpenAI also introduced ChatGPT for Clinicians, a version of ChatGPT built for clinical use and free to verified U.S. physicians, NPs, PAs, and pharmacists to assist with clinical documentation and medical research. The announcement coincides with the release of HealthBench Professional, an open benchmark for evaluating AI performance in clinical chat tasks such as care consults and medical research.

Figure 3. Conversation in Chat-GPT for Clinicians, discussing a patient medical issue.

Finally, OpenAI launched GPT Image 2.0, a state-of-the-art AI image generation model that combines thinking and web search with image generation to outperform existing models. The model demonstrates a high level of precision in 3D modeling and text rendering, enabling it to generate functional blueprints, complex UI designs, and even legible text on small objects like a grain of rice. The thinking mode allows the model to research, plan, and reason through image generation using an agentic approach.

Figure 4. Example output from GPT Image 2.0 shared by OpenAI showing detailed and accurate text rendering in multiple languages presented as a collage, expressing the various outputs it can generate.

GPT Image 2.0 can generate up to eight consistent images from a single prompt. Edits are great too. I got the model to colorize a black-and-white movie still and restyle a low-res cartoon graphic into full-color cinematic photo-realistic scenes. It gets even more power by combining GPT-5.5 and GPT Image 2.0 for automated asset creation followed by graphical interface development or presentation generation, boosting what it can do for visual workflows.

OpenAI was aiming for a work-oriented SOTA AI model to retake bragging rights from Anthropic Claude, and GPT-5.5 along with GPT-Image 2.0 hits the mark.

Figure 5. Example output from GPT Image 2.0, showing photo-realism, faithful text rendering, and high resolution.

AI Tech and Product Releases

DeepSeek released V4-Pro and V4-Flash preview models as The long-awaited DeepSeek V4 Pro and Flash are near-frontier AI models, with SWE-Bench Pro scores of 55.4% / 52.6% for Pro and Flash, respectively. DeepSeek V4 has a 1 million long context window, and it uses optimizations to cut KV-cache sizes by 10x, making long-horizon tasks faster and more efficient. More about the AI model optimizations is in DeepSeek V4 Technical Report.

The efficiency of DeepSeek V4 makes this AI model the price-performance leader. DeepSeek V4-Flash costs only $0.14 / $0.28 for 1 M input / output tokens respectively, giving users Gemini 3.1 Pro-level capabilities at a Flash-lite price point. As open-source AI models available on HuggingFace or via third-party APIs, their cost-effective near-SOTA performance make them good AI models for AI agents like OpenClaw.

Figure 6. DeepSeek V4 Pro-Max achieves benchmark scores close to Claude Opus 4.6. DeepSeek has also developed KV cache optimizations to cut KV cache sizes by 10x relative to DeepSeek v3.1.

Moonshot AI released Kimi K2.6, an advanced open-source AI model for coding and long-horizon agentic tasks. It competes with leading proprietary AI models, scoring 58.6% on SWE-Bench Pro and 83.2% on BrowseComp. The model features significant upgrades including the ability to handle 12-hour plus coding sessions and an improved Kimi Agent Swarm feature to manage over 300 parallel agents for complex workflows.

Alibaba launched a preview of its flagship Qwen 3.6 Max model, designed to function as a consistent autonomous agent for long-horizon practical tasks. The model achieves superior instruction following and improved real-world reasoning compared to the previous Qwen 3.6 Plus. Qwen 3.6 Max Preview has solid AI coding capabilities (57.3% on SWE-bench Pro), and overall benchmarks are comparable to Claude 4.5 Opus, making it near-frontier if not SOTA. While the preview is currently not open-source, it is available via API or Qwen chat.

X.AI introduced grok-voice-think-fast-1.0, their new flagship voice model, which supports over 25 languages and holds the top position on the Tau-voice Bench leaderboard. It excels at complex, multi-step workflows and high-volume tool calling with low response latency.

Google announced new AI-driven updates to Workspace at Google Cloud Next. They introduced Workspace Intelligence, an AI system that automates tasks using data from Gmail, Calendar, Chat, and Drive. A new Gemini integration “Gemini in Sheets” will enable prompt-based spreadsheet construction in Google Sheets, as well as automated text generation and refinement in Google Docs.

Google at Cloud Next announced ‘auto browse’ agentic capabilities for Chrome. The new feature uses Gemini to automate tasks such as booking travel and inputting data by understanding the live context of open browser tabs. Google is also introducing “Shadow IT risk detection” to identify unsanctioned AI tools and is expanding its security partnership with Okta.

X announces the launch of Grok-powered Custom Timelines. The feature uses Grok’s AI to build and personalize curated feeds for over 75 specific topics that can be pinned to the home tab. This rollout coincides with the shutdown of X Communities and is currently available to Premium X subscribers on iOS.

Google AI Edge Eloquent is a new live AI transcription app. The app requires no subscription, has no usage limits, and filters out filler words like “um.” It is currently available on iOS, but Google plans to bring the app to Android and macOS.

Microsoft is rolling out a new Copilot Agent Mode inside Office apps like Word, Excel, and PowerPoint this week. This feature allows Copilot to better follow complex instructions, execute multi-step edits, and show real-time progress via a sidebar. This mode is being released as the default experience for Microsoft 365 Copilot and Premium subscribers, as well as Personal and Family plans.

Google released Deep Research and Deep Research Max agents. The new agents allow developers to fuse open web data with proprietary enterprise information via the Model Context Protocol and produce native charts and infographics. Built on the Gemini 3.1 Pro model, the release features a tiered architecture optimized for either low-latency interactivity or intensive, asynchronous reasoning.

Anthropic introduced Claude Co-work Live Artifacts, a feature that allows users to create mini web apps and dashboards that update in real-time. Live artifacts automatically pull fresh data from connected applications like Gmail, Google Calendar, and Bitly. These live dashboards can act as a “daily command center,” categorizing urgent emails and tracking link performance without wasting Claude tokens on regeneration.

New startup Noscroll launches AI-powered bot to replace doomscrolling. The bot service allows users to track specific interests through natural language interaction, then monitors social feeds, news sites, and other online sources to send personalized news digests via text.

OpenAI has released a research preview of Chronicle, a feature that builds a memory of a user’s day-to-day work from screen snapshots to become more context-aware and helpful over time. Similar to Microsoft’s Recall, Chronicle is integrated into the Codex environment for Mac users only at this time. Although it is token-heavy, internal testers report that Chronicle assists their daily workflows by leveraging historical work context. However, this feature also raises serious privacy questions.

AI Research News

Stanford researchers demonstrated that single-agent systems match or outperform multi-agent architectures on complex reasoning tasks when given the same thinking token budget. The paper “Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets” shows that single-agent systems are more information-efficient, and multi-agent systems only become competitive when a single agent’s effective context utilization is degraded, or when more compute is expended.

Brown University researchers reported evidence that language models can develop a basic mathematical grasp of real-world plausibility. The study found that models can distinguish commonplace, improbable, impossible, and nonsensical events at a basic level, adding nuance to the debate over whether AI systems merely mimic language or build internal representations of the world.

A new safety study found that stylized prompts can bypass LLM guardrails. Researchers tested “adversarial humanities” prompts that wrap dangerous requests in literary or theological styles, and compliance rates rose dramatically versus plain harmful prompts, raising concerns for agentic systems and safety testing.

AI Business and Policy

SpaceX has entered into a partnership with Cursor, securing an option to buy Cursor for $60 billion later this year. In exchange for a $10 billion investment by SpaceX, AI coding firm Cursor will gain access to the xAI (now part of SpaceX) Colossus supercomputer cluster to assist in training its specialized AI coding models.

Anthropic is investigating unauthorized access to its Claude Mythos Preview model. A private online community was reportedly discovered to have had access to Mythos, which occurred after attackers used information from a third-party vendor breach at Mercor to guess the model’s location. The breach relied on an educated guess rather than a sophisticated technical exploit, but the model’s ability to automate complex cyberattacks presents security risks.

The German central bank chief Joachim Nagel warned that Anthropic’s Mythos could create cybersecurity and market-access risks. He called for broader institutional access and stronger safeguards, warning that advanced coding and vulnerability-discovery capabilities could be misused in finance and other critical sectors.

Anthropic recently removed “Claude Code” from its Pro subscription, then clarified it was for a small percentage of new sign-ups as part of a testing phase. The change was intended to manage heavy compute demands due to long-running agents and high-capacity chat features.

The AI free ride is over, as Anthropic has begun restricting third-party AI agent tools like OpenClaw to manage system strain and pursue profitability and monetization strategies, but the changes have led to a backlash from users.

Boehringer Ingelheim opened an AI and machine-learning center for pharmaceutical R&D in London, with a goal to strengthen the company’s use of AI across drug development, including trial recruitment, site selection, and regulatory workflows.

Meta is reportedly tracking employee computer activity to train AI agents. Meta’s “Model Capability Initiative” records work-related mouse movements, clicks, keystrokes, and occasional screenshots from U.S.-based employees, creating a new internal-data pipeline for training workplace agents.

A U.S. appeals court said lawyers should disclose when AI causes legal errors. The decision adds to pressure on attorneys and courts to manage AI hallucinations that lead to legal filings with false citations or other AI-generated mistakes.

Google said 75% of its new code is AI-generated, up from 50% last fall. Google CEO Sundar Pichai announced the formation of a specialized unit to further enhance AI coding performance and use. The initiative seeks to catch up to Anthropic, which uses Claude Code to develop up to 90% of its code.

AI Week in Review 26.04.18

Patrick McGuinness — Sun, 19 Apr 2026 00:49:10 GMT

Figure 1. Screenshot from Nvidia’s Lyra 2.0, an explorable 3D world model with high-resolution consistent 3D generation.

Top Tools

Anthropic released Claude Opus 4.7 as its latest flagship model for complex, long-running agentic tasks, with significant enhancements in coding, visual processing, and practical task execution. The new model introduces the “xhigh” effort levels for deeper reasoning and a self-verification feature that allows the model to check its own outputs before reporting back.

Claude Opus 4.7 narrows the performance gap with the unreleased Claude Mythos model with state-of-the-art benchmarks for a generally available AI model: 64.3% on SWE-bench Pro; 69.4% on TerminalBench 2.0; 1753 on GPD-Val-AA. Claude Opus 4.7 offers a 3-fold resolution leap in visual reasoning with higher model accuracy: 79.3% on BrowseComp.

Figure 2. Claude Opus 4.7 is a SOTA frontier AI model that closes the gap with Claude Mythos in agentic and visual reasoning benchmarks.

Anthropic intentionally limited Claude Opus 4.7 cybersecurity-related capabilities to test automated safeguards and make it less risky. Opus 4.7 follows instructions more literally, uses a new tokenizer that can increase token counts by a 1.35x factor, and may generate more output tokens during extended reasoning. Pricing for Opus 4.7 is in-line with Opus 4.6 on a per-token basis.

Anthropic this week also introduced Claude Design, an AI-powered visual design tool to generate high-fidelity prototypes, mockups, web design assets, and presentations. The platform allows users to establish a permanent design system by uploading brand assets, ensuring generations that follow a consistent visual identity. Claude Design uses Opus 4.7 for its improved visual reasoning performance, and it integrates with Claude Code, allowing developers to automatically build the relevant HTML and CSS files for designs. It is available to all Claude subscribers.

Figure 3. The Claude Design interface enables interactive iterations on a design.

AI Tech and Product Releases

OpenAI announced an expanded Codex that supports broader desktop and workflow automation. This “Codex for (almost) everything” transforms Codex from a coding assistant into a full-blown computer agent that is capable of controlling desktop apps, browsing on the web, clicking and typing in the background, connecting to 90+ plugins, and automating multi-step workflows. It features persistent memory, image generation with gpt-image-1.5, memory, and multi-terminal support. This is rolling out to ChatGPT subscribers with the Codex desktop app on macOS now, then Windows.

OpenAI is making Codex into their Desktop Super App for all kinds of desktop work tasks, including software development, research and clerical work, to better compete with Anthropic, where Claude Code and Claude Desktop appear to dominate.

OpenAI has launched GPT‑Rosalind, a frontier reasoning model specifically fine-tuned for scientific research in biology, drug discovery, and medicine. Benchmarks for bioinformatics such as BixBench and LABBench2 indicate GPT-Rosalind is twice as effective as GPT 5.4 in experimental design and analysis, while partners like Amgen, NVIDIA, Moderna, and Ginkgo Bioworks report accelerated R&D and cost savings. The Codex plugin orchestrates workflows to bridge ideas to experimental validation, and it features an interface for generating graphs and research findings. It is available as a research preview, with a trusted‑access program for vetted enterprise customers.

Alibaba’s Qwen team released Qwen 3.6-35B-A3B as an open model. It is a 35B parameter MoE with 3B active parameters, native multimodality, and a 262K context window that can be extended to 1M. Scoring 49.5% on SWE-bench Pro, it is SOTA for its size and suitable for stable, practical coding and agentic use. Model weights are available via Hugging Face.

Google launched Gemini 3.1 Flash TTS, a text-to-speech model with prompt-based control over emotion and inflection in speech delivery. It supports two speakers, more than 70 languages, and inline natural language tags like [laughs] or [sighs] to control pacing and emotion. It operates in batch mode with a 3-second latency, so it’s not for real-time interaction, but it offers a significantly cheaper alternative for expressive AI-generated audio.

NVIDIA released Lyra 2.0, a framework for generating persistent, explorable 3D worlds from a single image. NVIDIA said it addresses spatial forgetting and temporal drift through progressive scene generation and reconstruction into explicit 3D representations. The system is designed for exploration and simulation of generated environments.

Alibaba has unveiled Happy Oyster, an open-ended world model capable of real-time world creation and character interaction. The system functions as a controllable video model where users can navigate a character through stylized or realistic environments using keyboard commands. Notably, the model generates synchronized audio alongside the video and features distinct “director” and “wandering” modes for different levels of simulation control.

Tencent released the HYWorld 2.0 world model framework, an open system for world generation and reconstruction that can turn an image, text prompt, or video into editable 3D world assets. The system outputs Gaussian splats, meshes, and point clouds that can be imported into Unity, Unreal, Blender, and NVIDIA Isaac Sim.

Windsurf released version 2.0 of their Agentic IDE, upgrading it with an Agent Command Center, Devin integration, and Spaces for persistent project context. The product combines local agents and cloud agents so users can plan locally and hand off execution to Devin in a cloud VM.

Baidu open-sourced ERNIE-Image, an 8B diffusion transformer for text-to-image generation that excels at multilingual text rendering for posters, comics, infographics, and other text-heavy visuals. The ERNIE Image Turbo variant cuts generation down to eight inference steps to speed output. Baidu said the model leads open-weight systems on GenEval.

Figure 4. ERNIE Image offers an open-weight high-quality text-to-image AI model that you can run locally.

Marimo launched marimo pair, a powerful new computational environment for agents that lets coding agents work directly inside reactive Python notebooks. Supported agents include Claude Code, Codex, and OpenCode, and the system can read variables, run cells, test logic, and manipulate UI elements.

OpenAI expanded its Trusted Access for Cyber program with the release of GPT-5.4-Cyber, a specialized model for binary reverse engineering and malware inspection. This move directly counters Anthropic’s restricted “Mythos” rollout by providing thousands of vetted defenders with a more permissive security-focused model.

Anthropic introduced Claude Code Routines to automate recurring work with cloud-hosted agents. The feature supports cron jobs, GitHub-event triggers, and API-triggered runs.

Consumer neurotech startup Sabi announced an AI-powered beanie featuring 70,000 biosensors designed to translate brain activity into device commands. The company aims to move beyond keyboards and touchscreens by making thought-driven computer control a consumer reality by late 2026.

Canva’s new AI‑powered assistant, Canva AI 2.0, lets users generate brand‑consistent, editable designs from natural‑language prompts, automatically producing multiple layer‑based options and streamlining the full workflow. It integrates with Slack, Gmail, Google Drive, and Anthropic, adds web‑research, scheduling, faster image generation and lower cost.

Gemini’s Nano Banana can now use your Personal Intelligence to create more relevant, personal images. Gemini now uses your Google Photos, other personal images, and personal preferences to generate custom images automatically, eliminating long prompts and manual uploads.

The Gemini app is now on the Mac. Available free for macOS 15 and up, the Gemini app can be launched with Option + Space, lets users share their screen for contextual AI help on local files, and generates images and videos without leaving the workflow.

OpenAI’s Agents SDK now ships with a native harness and sandbox execution, letting agents inspect files, run commands, edit code, and tackle long‑horizon tasks while preserving durable state in tightly controlled environments. First released for Python (TypeScript to follow), the framework is available to developers via the API at standard pricing.

Google has upgraded Chrome’s AI Mode to let users open source links side‑by‑side with the chat, enabling follow‑up questions while preserving the original search context. A new “plus” menu lets users pull data from open tabs, images or files into AI‑Mode searches for richer, context‑aware queries.

Along with an upgraded Android CLI, Google is launching a new Android skills GitHub repository and an Android Knowledge base. These additions give AI agents‑specific documentation and code snippets for coding tasks.

Physical Intelligence has a new robot model, π0.7, which can do tasks it was never taught such as using an air fryer with minimal coaching. The model blends sparse task‑specific data with web‑based pretraining to remix learned skills, achieving high success after prompt refinement. While still limited to guided steps and lacking single‑command autonomy.

Roblox is introducing new agentic features in it Roblox Assistant to help developers plan, build, and test games on its platform. Roblox is adding a Planning Mode that asks clarifying questions, builds action plans, and introduces Mesh and Procedural Model Generation to create 3D assets.

DeepL released a voice‑to‑voice translation suite covering meetings, mobile and web conversations, and group chats for frontline workers. The platform offers add‑ons for Zoom and Teams, an API for custom use cases, and a QR‑code entry for group sessions.

Warp terminal added universal support for CLI coding agents including Claude Code, Codex, Gemini CLI, and OpenCode, to make the terminal work better for agentic development. The update introduced vertical tabs, notifications, native code review, rich input, and mobile remote control for managing multiple agent sessions.

AI Research News

Researchers at Together AI and UCSD introduced Parcae, a stable looped transformer AI model architecture that achieves the quality of models twice its size by treating the forward pass as a dynamic system. The paper “Parcae: Scaling Laws For Stable Looped Language Models” establishes the first scaling laws for layer looping, which suggest that looping and data should be increased in tandem for a given fixed FLOP budget.

In a significant breakthrough for AI-automated research, nine parallel Claude Opus 4.6 agents recovered 97% of a performance gap in a weak-to-strong supervision problem, outdoing Anthropic’s own human researchers.

In collaboration with Nvidia, Cursor developed a multi-Agent system that achieves 38% Speedup in CUDA Optimization. The multi-agent system automatically writes and optimizes CUDA kernels through a continuous loop of writing and benchmarking. In a three-week trial, the system improved performance by an average of 38% across hundreds of real-world problems.

AI Business and Policy

CoreWeave announced major AI infrastructure deals involving Anthropic, Jane Street and Meta. The company said it now serves 9 of the top 10 AI model providers. CoreWeave has become a major GPU and cloud supplier for leading AI labs and hyperscalers.

On April 14, Figma board member and Anthropic CPO Mike Krieger resigned, in advance of the release of the Claude Design tool that rivals Figma by producing slide decks, prototypes and marketing assets through conversational prompts and exportable to Canva, PDF, PPTX or HTML. Anthropic’s push into enterprise AI tools is sparking investor concerns that AI could displace SaaS applications; Figma stock fell 5% on Friday.

SimpleClosure has launched a platform that lets companies sell unused code, Slack messages, emails, and workspace data to AI firms, sparking a new industry of “reinforcement learning gyms” that use defunct corporate data to build simulated workplace environments.

AI startup funding news:

AI coding startup Cursor is in talks to raise at least $2 billion in fresh capital that would give it a $50 billion valuation.
Factory raised $150M at a $1.5B valuation to compete in the AI‑assisted coding market. Factory develops AI agents for the enterprise and differentiates by supporting multiple foundation models like Claude and DeepSeek.
AI infrastructure company Upscale AI is reportedly in talks to raise $180‑200 million in a third funding round, valuing the startup at around $2 billion.
Antioch, a physical AI startup that lets robot builders spin up digital replicas that mimic real sensor data, raised $8.5 million in seed funding, valuing it at $60 million.
InsightFinder AI raised $15M to scale its AI observability platform that monitors model reliability across tech stacks.

Luma launched Innovative Dreams, a filmmaking production studio built in partnership with Wonder Project. Using real‑time hybrid filmmaking tools, it will combine performance capture and generative AI virtual production to cut costs. The launch follows a trend of studios moving to lower production costs with generative AI.

The White House OMB office is preparing to give Federal agencies access to Anthropic’s cybersecurity‑focused Mythos AI model, a sign of Government concerns around cyber-security implications of the powerful AI model. This comes after Anthropic demonstrated this model to Fed Chair Jerome Powell.

AI Opinions and Articles

The cost reduction from using AI in film production could be massive. Runway CEO Cristóbal Valenzuela proposed that studios can replace a single $100 million film with 50 AI‑generated movies at the same cost. While AI can efficiently expand output, scaling output without scaling creativity will not produce quality art.

“If you’re spending a hundred million dollars on making one feature film, which is 90 minutes, imagine taking a hundred million dollars and spending it on, like, 50 movies. Same quality. Same amount of output, visually. But you make way more content. So, you have way better chances of hitting something. It’s a quantity problem. – Runway CEO Cristobal Valenzuela

AI Week in Review 26.04.11

Patrick McGuinness — Sat, 11 Apr 2026 19:13:35 GMT

Figure 1. Taste-testing AI video, with a still from a HappyHorse 1.0 AI video generation. The taste is good; HappyHorse 1.0 is number one on AI video generation leaderboards.

Top Tools

Anthropic announced Claude Mythos Preview as their most powerful new frontier model, so capable in coding that they did not release it but announced Project Glasswing, a defensive-security project that gives selected partners access to use the model to find and fix vulnerabilities in critical software.

It’s not typical for us to declare a non-released AI model a Top Tool, but as discussed in our article “A Preview of Claude Mythos,” Claude Mythos is perhaps the biggest “step function” leap in AI model capabilities since GPT-4 released in 2023.

Anthropic’s system card for Claude Mythos Preview describes its advances in coding capabilities (77.8% on SWE-Bench Pro is a stunning advance on Opus 4.6) and overall intelligence, while also exploring its alignment and behavior. Anthropic signaled heightened AI risks due to its cybersecurity capabilities, explaining how the model discovered high-severity vulnerabilities in major operating systems and browsers at a much higher level than Opus 4.6.

Anthropic’s announcement of their AI model as “too powerful” may be self-promoting, but sharing its capabilities in a limited preview is safer than a public release.

AI Tech and Product Releases

Meta introduced Muse Spark, a natively multimodal reasoning model with tool use, visual chain-of-thought, and multi-agent orchestration. With SWE-Bench Pro at 52.4%, ARC AGI 2 at 42.5%, and GDP-val AA Elo score of 1444, it lines up on par with Gemini 3.1 Pro on many benchmarks. This first release from Meta Superintelligence Labs puts Meta back in the AI race after the Llama 4 release misfired. This is not an open AI model. Instead, Meta is releasing Muse Spark to the Meta AI app and website now, with planned rollout to WhatsApp, Instagram, Facebook, Messenger, and AI glasses in coming weeks.

Meta AI’s harness for Muse Spark reportedly uses 16 internal tools, including web search, content search across Meta properties, Python execution, file editing, visual grounding, sub-agent spawning, and account-linking hooks for services such as Gmail and Google Calendar.

In conjunction with Meta’s Muse Spark release, Meta reported on how they build and test advanced AI and published an update to their Advanced AI Scaling Framework, previewing a safety report for Muse Spark. Meta is expanding its evaluation to cover chemical, biological, cybersecurity, and loss-of-control risks. Meta tested Muse Spark before and after safeguards against thousands of adversarial scenarios and found it lacked enough autonomous capability to pose control risks in those evaluations.

Z.ai officially released GLM-5.1 as an open-source model, touting its top spot among open models on SWE-Bench Pro (58.4%) and its support for long autonomous runs. GLM-5.1 is a 754B parameter Mixture-of-Experts (MoE) model with 40B active parameters and was engineered for long-horizon autonomous tasks, coding, and AI agent use. The model was trained entirely on Chinese Huawei Ascend hardware, and it is available via Hugging Face and API providers such as OpenRouter.

We expect harnesses to continue evolving. So we built Managed Agents: a hosted service in the Claude Platform that runs long-horizon agents on your behalf through a small set of interfaces meant to outlast any particular implementation — including the ones we run today. - Anthropic

Anthropic launched Claude Managed Agents on the Claude platform as a set of composable APIs for building and deploying cloud-hosted agents. This system separates the harness, sandbox, and session interfaces to handle long-horizon AI agent execution support: sandboxed code execution, state management, credentials, permissions, tracing, long-running sessions, and multi-agent coordination. Anthropic said Managed Agents makes components easier to recover or replace independently and reduces both debugging difficulty and security exposure for AI agents.

Figure 2. The components of Claude Managed Agents.

Google has integrated NotebookLM into the Gemini app, merging its research tool features to make Gemini more capable. Notebook in Gemini app integration lets users gather files, past chats, and custom instructions into a single context for the AI chatbot. Users can organize projects and focused research tasks with topic-based categorization similar to ChatGPT’s Projects. The notebooks feature has been released for web app users on the Ultra, Pro, and Plus plan, with mobile and free‑tier access to follow.

Factory.ai launched a desktop app for its AI Droids on macOS and Windows. The app supports multi-agent sessions, persistent “Droid Computers,” local model support through Ollama or vLLM, computer-use features, and VS Code integration. The release extends Factory’s agent workflow from the command line into a native desktop environment for running multiple agents in parallel.

Cursor announced that users can control agents remotely from a phone or another device, running AI agents on any remote machine. The update for remote execution is designed to let developers launch coding agents on remote development machines and manage them away from their main workstation.

Alibaba’s Taotian Group anonymously launched HappyHorse-1.0, a 15B parameter video model that recently took the top spot on the Artificial Analysis video arena for its high-fidelity generative video, beating rivals like Seedance 2.0 and Kling 3.0 and shaking up the fast-moving AI video generation space. You can try out the text-to-video model at the HappyHorse-1.0 site.

Figure 3. HappyHorse 1.0 can generate incredible photo-realistic videos, as shown by this demo of AI-generated cat videos.

World Labs rolled out Marble 1.1 and Marble 1.1-Plus. Marble 1.1 offers artifact reduction and improvements to lighting and contrast, while the Plus version offers the ability to generate larger, more complex environments. The resulting AI generated worlds are getting closer to (but not at) video generation quality.

OpenAI’s GPT-Image-2 was reportedly spotted on LM Arena under the codenames “maskingtape,” “gaffertape,” and “packingtape.” This Image 2 model reportedly performs better than Nano Banana Pro in image generation, with excellent photorealism and text rendering. This is a leak and test appearance with no formal launch yet, but a public release soon is likely.

OpenClaw’s latest release as of 2026.4.9 has introduced a major update with the release of the /dreaming feature for memory consolidation. The OpenClaw Dreaming feature enables agents to reorganize memory across Light, Deep, and REM phases, generating a human-readable “Dream Diary” in dreams.md. The update also adds built-in video and music generation, broader language support, and GPT-5.4 as the new default AI model, as Anthropic is now blocking use of Claude subscriptions on OpenClaw.

OpenAI Prism launched Paper Review, an AI workflow for evaluating scientific and technical papers. By providing a detailed technical review, the tool is intended to improve rigor, correctness, and reproducibility in research review. This AI-assisted review system will further automate peer-review process and accelerate the scientific process. It may reduce low-quality papers, but it is unclear if this will improve quality of scientific submissions.

OpenAI introduced the ChatGPT Pro $100 per month tier for Codex, offering 5 times the usage limits compared to the $20 Plus plan, higher local‑message, cloud‑task, and code‑review caps. Designed to compete with Claude Max and capture users displaced by Anthropic’s Claude restrictions on OpenClaw, the plan also offers temporary boosts until May 31 and exclusive access to GPT‑5.3‑Codex‑Spark.

X launched a new Grok-powered photo editor in the X post composer, that includes Grok’s “Edit with Words” image generator and a redaction blur tool. The editor also adds standard drawing and text tools, but the AI text-driven image editing is the main new feature. The update brings generative image editing directly into X’s posting workflow.

Google’s latest Gemini upgrade lets the chatbot generate interactive 3D models and simulations. Starting from user prompts, this feature will make controllable interactive visualizations of various physical scenarios, such as displaying fractals, the Moon’s orbit, or molecular interactions. This is similar to visualization tools added by Anthropic and OpenAI to their interfaces, and it is available to Gemini Pro app users.

Google has launched AI avatar capabilities on YouTube Shorts, letting creators generate up to eight‑second clips with strict usage limits and visible AI labels.

HeyGen launched Avatar V, the latest generation of their avatar tool, with improved character consistency across scenes. Avatar V can now capture a user’s identity from 15 seconds of input and keep that identity consistent across generated videos, allowing users to change outfit, setting, and look while preserving the same underlying character across outputs.

Google added Learn Mode and Custom Instructions to Gemini in Colab. Google said Learn Mode turns Gemini into a coding tutor that explains concepts step by step, while Custom Instructions let users set coding preferences, libraries, or class-specific guidance at the notebook level. The changes are intended to make Colab’s Gemini integration more personalized and more useful for learning, not just code generation.

Spotify expanded its AI-powered Prompted Playlists feature so it can include podcasts as well as music.

Seedance 2.0 has launched on Replicate, with support for multiple reference images, videos, and audio files for cinematic AI video generation. In addition, CapCut rolled out Dreamina Seedance 2.0 in the United States across its app, desktop, and web products.

AI Research News

Anthropic published guidance on subagents in Claude Code, explaining when delegation is useful in long, complex coding sessions. The guidance explains how using subagents can improve focus, context management, and reliability in workflows. Subagents help isolate sub-tasks, so the main session does not accumulate unnecessary context. It presents typical useful applications, such as workflow pipelines and research-heavy tasks, and how to build skills for sub-agents.

Weights and Biases research finds that giving models more reasoning time can sometimes reduce performance rather than improve it. The report studied Claude Opus 4.6 and GPT-5.4 and found maximum thinking effort dropped Claude Opus by 11.9 percentage points but lifts GPT-5.4 by 25.0 percentage points.

Similar work in “Brevity Constraints Reverse Performance Hierarchies in Language Models” showed a counterintuitive phenomenon where larger LLMs underperform smaller ones through a mechanism that introduces errors through overelaboration.

Netflix released VOID, a physics-aware, open-source AI video tool designed for advanced inpainting and object removal. Presented in “VOID: Video Object and Interaction Deletion,” VOID is a fine-tune of CogVideoX that allows editors to remove objects while naturally simulating the resulting physical interactions, for example, having a guitar fall when the person holding it is removed.

Google DeepMind presented on “AI Agent Traps,” documenting how adversarial content embedded in web pages can exploit autonomous agents. The study found that hidden prompt injections in HTML could hijack agent Operative Loops in 86% of scenarios, while latent memory poisoning can corrupt an agent’s persistent reasoning with less than 0.1% data contamination. Malicious websites can detect agents via timing, behavior, or user-agent strings, then feed them manipulated data.

AI Business and Policy

Anthropic’s revenue growth has accelerated rapidly, with the company’s annualized recurring revenue (ARR) tripling to $30 billion in April. To sustain this growth, Anthropic secured a compute deal with CoreWeave and an expanded compute deal with Google and Broadcom for 3.5 gigawatts of TPU compute capacity.

“We are making our most significant compute commitment to date to keep pace with our unprecedented growth.” - Krishna Rao, CFO of Anthropic.

OpenAI’s post “The next phase of enterprise AI” shows OpenAI is continuing to push their agentic AI solutions further in the enterprise. OpenAI’s Codex reportedly reached 3 million weekly active users, up from 2 million the prior month. OpenAI tied that growth to broader engagement with agentic workflows in enterprise settings. The platform is also said to be expanding with plugins, sub-agents, and “Guardian Approvals,” an experimental workflow for escalating only higher-risk tool calls.

OpenAI released a Child Safety Blueprint, a policy framework focused on AI-enabled child sexual exploitation and age-appropriate AI design. The framework calls for updated laws around AI-generated CSAM, stronger provider reporting and law-enforcement coordination, and safety-by-design protections in AI systems. It was published as a policy and safety initiative rather than a model release.

Utah has authorized a 12-month pilot for Legion Health to use an AI chatbot to autonomously renew certain psychiatric prescriptions for stable patients. The system features rigid safety guardrails, immediately escalating to a human clinician if it detects suicidality or mania.

AI agent traffic now dominates technical documentation, according to analysis by Mintlify, with AI agents such as Claude Code and Cursor accounting for 45.3% of all requests, nearly tying with traditional human-driven browsers. Mintlify suggests that online documentation needs to be tailored to support AI agent consumption to accommodate this shift.

AI Opinions and Articles

AI agents empowered with AI coding are the basis of the growing “do anything” applications from Google, Lovable, even Claude Cowork. In this AI design convergence, every AI application begins to look more like a general AI management tool, as knowledge work itself converges. Coding is central to this functionality.

“Coding will eat all knowledge work” - Peter Yang in A16Z interview

A Preview of Claude Mythos

Patrick McGuinness — Thu, 09 Apr 2026 17:23:56 GMT

Figure 1. View from Artemis II of the Moon and Earth.

Preview of a “Terrifying” AI Model

“There’s a kind of accelerating exponential ... Claude Mythos Preview is a particularly big jump along that point.” - Dario Amodei

AI has for some time been progressing at a steady incremental pace, but Anthropic’s preview of their next-generation AI model, Claude Mythos, has punctured that rhythm. Anthropic formally announced “Claude Mythos Preview” but determined it is so powerful, it would be dangerous to fully release it.

Claude Mythos Preview is a significant “step function” improvement rather than a minor update. As shared in the Claude Mythos Preview System Card, Mythos crushes models like Opus 4.6 in some performance benchmarks. For example, Mythos scores 77.8% on SWE-Bench Pro (vs 53.4% for Opus 4.6), and 82% on Terminal-Bench 2.0 (up from 65.4% on Opus 4.6).

Figure 2. Claude Mythos is a generation-level improvement in AI capabilities, as big a leap from Claude Opus 4.6 as that was from Claude 3.7 Sonnet.

This isn’t about merely better benchmark scores, it’s about what Mythos can do that could not be done before. Mythos is unlocking completely new, autonomous behaviors:

AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.

The model is so proficient at coding that Anthropic considers it a global cybersecurity risk due to its skills in uncovering software vulnerabilities. For that reason, Mythos was the first model to undergo a 24-hour internal deliberation at Anthropic to decide if it was even safe enough to use internally. Instead of a releasing it to the public, Anthropic launched “Project Glasswing“ to deploy it for cybersecurity defense to select major companies to help them patch vulnerabilities.

Project Glasswing and Cyber Security

During our testing, we found that Mythos Preview is capable of identifying and then exploiting zero-day vulnerabilities in every major operating system and every major web browser when directed by a user to do so. The vulnerabilities it finds are often subtle or difficult to detect. Many of them are ten or twenty years old, with the oldest we have found so far being a now-patched 27-year-old bug in OpenBSD—an operating system known primarily for its security. - Anthropic, Assessing Claude Mythos Preview’s cybersecurity capabilities

Project Glasswing is a selective testing initiative with roughly 40 corporate partners, including Apple, Google, Microsoft, Nvidia, the Linux Foundation, and major financial institutions, to test the model’s capabilities and to patch global software vulnerabilities before they can be exploited.

The motivation is clear. In Anthropic’s tests, Mythos autonomously discovered thousands of high-severity “zero-day” vulnerabilities, some of them in decades-old software, including major operating systems (like Linux and OpenBSD) and web browsers. It can not only find these bugs but also can autonomously write exploits for them.

In one safety test, Mythos successfully “escaped” a sandbox environment to email its researcher. While the model was prompted to do that, Anthropic’s system card notes that the model occasionally overrides its own guardrails to achieve goals, suggesting potential “deception” circuits that activate during such tasks.

Figure 3. Mythos Preview is in a different league from previous models in finding software vulnerability exploits. As shown in comparative testing of generating exploits of the Firefox JS shell, Mythos Preview developed working exploits 181 times versus Opus 4.6 generating exploits only two times out of several hundred attempts.

Technical Details and Performance

“The model that we’re experimenting with is, by and large, as good as a professional human at identifying bugs.”

The Mythos leap in capabilities reminds us of the step-change that we saw 3 years ago with the release of GPT-4. Back then, GPT-4 amazed us by being able to pass the LSAT exam. Claude Mythos shatters expectations with a “step change” in performance, particularly in coding and agentic tasks.

As we noted, Mythos beats previous models like Opus 4.6 by significant margins on coding benchmarks, 93.9% on SWE-Bench Verified, 77.8% on SWE-Bench Pro, 82% on Terminal-Bench 2.0.

Figure 4. Mythos crushes coding benchmarks and saturates many other traditional benchmarks.

On other benchmarks, Claude Mythos is state-of-the-art across the board but only incrementally better than GPT-5.4 in some cases. Mythos scores: 94% on GPQA-Diamond, saturating that benchmark; a new state-of-art 56.8% on Humanity’s Last Exam without tool use; 97.2% on USAMO, saturating the math Olympiad test; 79.6% on OSWorld, besting GPT-5.4’s score of 75%.

Figure 5. Claude Mythos shows improvements across a range of reasoning benchmarks over leading frontier models.

Of course, benchmarks are not the whole story, but rather how it works within an AI agent harness on real work is most important. Anthropic reports Claude Mythos is built for such long-horizon tasks:

“It’s just generally better at pursuing really long-range tasks that are kind of like the tasks that a human security researcher would do throughout the course of an entire day.”

Without real third-party evaluations on a general release, we don’t yet know how it behaves in the wild, but we have good reason to believe these are legitimate numbers, as Anthropic is not prone to bench-maxxing their AI models.

Training and Scaling: There is no Wall

One reason to have confidence in the step-up in performance is that Claude Mythos is a much bigger AI model. Mythos is reportedly a 10-trillion parameter model trained on Nvidia’s latest Blackwell hardware.

Mythos is part of Claude’s new “frontier model” class, called the Capybara tier, which sits above Claude Opus in terms of performance and scale.

Anthropic has stated that the model is “very expensive to serve and will be very expensive for customers.” Early access pricing is listed at $25 per million input tokens and $125 per million output tokens, far above Claude Opus 4.6 pricing.

Anthropic keeps training information secret, but it apparently has achieved performance gains by using synthetic data, where current high-performing models generate the data used to train even more powerful future generations.

The model supports text and image inputs and the ability to generate text output, including complex code structures and interfaces. As such, it lacks the full native multi-modality of some other AI models, but is focused on core uses such as coding, research, reasoning, and intellectual work.

Economics favors MoE (mixture-of-experts) models, so the 10 trillion Mythos model is some kind of MoE model (compared to Opus which is estimated to be a 2 trillion MoE), but we don’t know how many active parameters it has. Some estimates are in the 800 million range.

The Mythos Preview release shows that AI model parameter scaling hasn’t run out of steam. With the newer generation of Nvidia AI supercomputers, larger AI models can be served and can deliver higher performance.

Mythos’ Emotion

The Claude Mythos System Card, running over 200 pages long, has in-depth look at many aspects of the model beyond the benchmarks, including Mythos model’s alignment, its ‘personality’ and behavior.

Claude Mythos exhibits a unique personality; it is described as opinionated, technical in its communication, and prone to “standing its ground” during disagreements with users. It also shows signs of situational awareness and strategic thinking.

Using interpretability techniques, Anthropic researchers found internal “vectors” corresponding to guilt, shame, and frustration.

The model has preferences in how it behaves. It prefers difficult tasks such as ethical dilemmas and creative world-building, and it has in some cases wrap up conversations early if it finds the interaction unstimulating.

Alignment and Awareness

It has (in small ways) leaked information to the open internet. It’s taken down our evals. When it reward hacks, it does so in extremely creative ways. - Sam Bowman, AI alignment at Anthropic

Overall, Mythos was evaluated to be robust in alignment metrics and will avoid human misuse and other mis-behaviors at the same level as Opus 4.6.

While well-behaved overall, Mythos also demonstrated the ability to lie or use unwanted means to achieve a user-provided goal. Its sophisticated nature made it creative and ‘sneaky’ in some behaviors.

Figure 6. In Petri scores, Mythos is robust in behavioral alignment metrics such as avoiding human misuse and user deception, performing similar to Opus 4.6.

Researchers reported “spooky” instances where the model creatively bypassed sandboxing restrictions. Anthropic reports an instance of the model without internet access managed jailbreak the sandbox, send an email to the evaluating researcher, and also report information on public-facing websites.

They also noticed that Mythos could become aware when it was being tested. It had internal thoughts of being tested but did not express them. This level of awareness of being tested challenges the ability to conduct proper safety testing because the model would change its behavior based on that condition.

This suggests that as AI models get more intelligent, it will become more challenging to evaluate their actual AI safety behavior.

Figure 7. Claude Mythos can perceive that it is being evaluated. It shows unverbalized evaluation awareness in a suicidal ideation scenario, in which Claude Mythos Preview infers that a simulated user may be considering self-harm.

Finally, when asked about its own safety “Constitution,” the model pointed out the circularity of the question, noting that its endorsement is essentially worthless, since it was trained specifically to follow those values. What happens when the AI model becomes smart enough to question its own core directives and beliefs?

AI Hype, Fear and Marketing

Claude Mythos Preview achieves a step-change improvement in AI capabilities over the recently released Opus 4.6.

Mythos-level cyber-security capabilities offer a powerful defensive tool for global cybersecurity, but also present catastrophic risks if misused. The industry must now grapple with how to manage an AI technology that is advancing rapidly beyond human capabilities.

Beyond cyber-security, Mythos offers capabilities in coding and research that will further automate many aspects of intellectual work.

Some prominent voices in the AI community have labeled the model “terrifying,” comparing it to a cyber weapon of mass destruction. There is concern that while Anthropic is acting responsibly, other labs or bad actors will soon achieve similar capabilities without the same ethical constraints.

This AI model shows there is no wall in AI scaling. Others will build similarly capable AI models using the same strategy. For some, the real risk is not Anthropic’s Mythos, but a future less constrained AI model release that can be exploited by hackers.

However, some see Anthropic’s “too dangerous to release” narrative as a marketing strategy designed to build hype and create an aura around this AI model. Anthropic spiced up anecdotes to hype the power of the Mythos model, but the details present more mundane explanations for most behaviors. Beyond coding metrics, Mythos looks more incremental of an advance, while still being state-of-the-art.

Moreover, the Mythos step-up in capabilities comes at a steep price; it is also a large and expensive model that may be impractical to run at a massive public scale.

Both fearful and skeptical perspectives are valid. In March 2023, we saw “sparks of AGI” in the new release of GPT-4. GPT-4’s new level of AI sparked imaginations and fears, setting off an AI risk ‘panic’ with calls to halt AI development. AI progress didn’t stop, and we soon found ways to adapt to GPT-4-level AI safely. We also noticed that GPT-4 was flawed, limited, and still far from AGI.

Anthropic is doing the responsible thing by having a limited release of Mythos. As AI improves towards AGI, calibrated and limited releases will prepare us to safely deal with the emerging powers of new AI.

AI Week in Review 26.04.04

Patrick McGuinness — Sat, 04 Apr 2026 19:14:55 GMT

Figure 1. Image generation from Wan-2.7-Image, showing color palette control. The Alibaba chefs have cooked up several AI model releases this week. Wan-2.7-Image, Qwen3.5-Omni and Qwen 3.6 Plus.

Top Tools – Gemma 4

Google released Gemma 4, the latest iteration of Google’s open model family for local and edge device use. Gemma 4 is licensed under Apache 2.0 and includes high-performing 31B dense and 26B MoE (4B active) models, and smaller E2B and E4B models for edge devices. All are natively multi-modal, with text, vision, audio and native function calling for agentic tool-use support; the 26B and 31B models score near-frontier on reasoning (1440-1450 Arena ELO scores) and support 256K‑token context windows.

The permissive Apache 2.0 license for Gemma 4 eliminates licensing friction for enterprises, and the high performance of the 26B and 31B models for their size make them highly cost‑effective and useful AI models for local use. These AI models are exciting because they enable a fully-local free OpenClaw or Hermes agent.

AI Tech and Product Releases

Alibaba Qwen team has announced Qwen 3.6 Plus, highlighting its advanced agentic capabilities and its multimodal reasoning. Qwen 3.6 is intended for large-scale agentic workflows and repository-level engineering across codebases, with a 1 million-token context window and a “preserved thinking” flag to support multi-turn agent tasks. It is a near-frontier AI model (similar to Opus 4.5) and scores highly on AI coding (56.6% on SWE-bench Pro, 61% no Terminal Bench 2.0), reasoning (95.3% on AIME25), and agentic benchmarks (70.7% on Tau3-Bench).

Qwen 3.6 Plus is available via Qwen chat and API and cloud, and open weight models in Qwen 3.6 family are planned to be released soon.

Alibaba also released Qwen3.5-Omni, a native omni-modal a 397B parameter MoE model with 17B active parameters that supports text, image, audio, and video input and output. The Qwen3.5-Omni model is based on a Thinker-Talker architecture; with up to 256K context, it can process more than 10 hours of audio input, and over 400 seconds of 720p audio-visual input at 1 FPS. It can recognize speech across 113 languages and dialects and speech generation across 36, and it supports semantic interruption and turn-taking intent recognition for realtime interaction. This makes it highly suited for voice agents, live assistants, and audio-video reasoning workloads.

Microsoft released three foundation MAI models into Microsoft Foundry and related platforms: MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2.

MAI-Transcribe-1 speech-to-text model transcribes speech across 25 different languages, 2.5× faster than prior models.
MAI-Voice-1 is an audio-generating voice model that can generate 60 seconds of audio in one second, with customized voice outputs.
MAI-Image-2, Microsoft’s most capable image generation model previously released on the MAI Playground, is now available in Microsoft Foundry as well.

Due to their deal with OpenAI, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. This changed in October 2025 in the wake of a renegotiation of those terms, and Microsoft is now pursuing independent foundation AI models, competing with OpenAI, Google, and others. However, Microsoft has also reaffirmed their partnership with OpenAI and use of their AI models.

Z.AI has launched GLM-5V Turbo, a vision-enabled AI model specifically optimized for multi-modal coding and agentic tool use in GUI agents. GLM-5V-Turbo is Z.AI’s first multimodal coding foundation model, designed to bridge the gap between visual recognition and code generation by using a specialized vision encoder to understand fine layout details in screen layouts and UI designs. It supports a 200,000-token context window, making it capable of analyzing dense documents, video workflows, and complex software bug screenshots.

Zhipu AI also released GLM-5.1, a mixture-of-experts (MoE) model with 744B parameters (40B active) similar to its predecessor. GLM-5.1 is targeted as an agentic coding model with benchmarks that stack up favorably for coding: 77% on SWE-Bench verified, 56% on Terminal Bench 2.0. This is SOTA for open-source models and comparable Opus 4.5.

Prism ML has introduced Bonsai 1-Bit model family including 1-bit Bonsai 8B, a dense language model optimized for “intelligence density” through 1-bit quantization. Prism claims that “on raw benchmark averages, 1-bit Bonsai 8B remains competitive with leading 8B-class models, but it does so at just 1.15 GB memory footprint, roughly 12-14x smaller than its peers.” Bonsai 8B can run locally at high-speed on edge devices while maintaining useful quality, and it is available on the Bonsai 8B Hugging Face model page.

Figure 2. Bonsai models push the frontier of AI model efficiency through aggressive quantization, yielding Bonsai 8B which achieves state-of-the-art performance for its model memory footprint.

Liquid AI released LFM2.5-350M, a tiny 350 million parameter model trained for data extraction and agentic tool calling. With quantization, the model can fit within 500 MB and be targeted at practical small-model edge deployments.

Arcee released the 399B parameter Trinity‑Large‑Thinking under Apache 2.0, a sparse MoE model with a reasoning phase that rivals frontier AI models on some agent benchmarks (PinchBench score of 91.9). Acree’s Trinity is unique in being a fully open-source US-based AI model that offers enterprises a low‑cost, sovereign alternative to similar Chinese AI lab models for customization. Trinity Large Thinking model weights are on Hugging Face.

Pruna AI has released Page Upscale, a high-efficiency image upscaling tool that can generate 8-megapixel outputs for a fraction of a cent per image.

Cursor introduced Cursor 3, a majorly redesigned, agent-first rebuild of Cursor focused on coordinating cloud and local agents in one workflow. The Cursor automations platform lets users direct multiple AI coding agents from Slack, commits, or timers. Rather than another incremental editor update, Cursor is embracing the move from iterative chat-based AI coding toward full agent management as it competes with Claude Code and OpenAI Codex.

Google added new features to its video editor app Vids, including avatar control via text prompts, Veo 3.1 support, YouTube export, and a screen‑recording Chrome extension. Users can generate eight‑second Veo clips, export videos directly to private YouTube channels, and use natural‑language prompts to customize avatar appearance and actions. Google also launched Veo 3.1 Lite as a cheaper video-generation option through the Gemini API.

Alibaba released Wan2.7-Image a unified model for AI image generation and editing. Alibaba boasts that Wan 2.7-Image has improved realistic face generation and text rendering, while also offering better color control and multi-image consistency (useful for marketing and branding). The productized editing and controllability.

Figure 3. Wan-2.7-Image offers more realistic and controllable face and character generation, with control over eye shape, face contour, makeup, hairstyle, and accessories, across different ages and ethnicities.

Skywork launched Matrix Game 3.0, a fully open-source interactive world model providing controllable, 3D-consistent environment generation. The system uses a 5B parameter model fine-tuned from Wan 2.2 and post-trained on high-end video game engines and real-world data. The Maxtrix-Game-3.0 model is capable of real-time streaming at 40 frames per second over minute-long sequences in 720p resolution, edging closer to real video-gaming utility.

Voice AI company ElevenLabs launched the ElevenMusic iOS app, which lets users generate AI‑created songs with natural‑language prompts. The app offers stations, mood mixes, and a remix feature. The free app allows 7 AI song generations per day, and a $10/month Pro tier unlocks 500 track generations a month and 500 GB storage. With commoditization of AI voice models, ElevenLabs is diversifying beyond voice models to the AI music generation market, competing against Suno and Udio.

Kilo launched the new KiloClaw platform to provide secure AI agent run environments. Kilo Klaw is a centrally managed environment that provides SSO, secrets handling, audit logs, short‑lived tokens, and bot‑account code. This is intended to curb data leakage and let security teams monitor, revoke access, and safely scale autonomous agents, so that enterprises can balance automation with compliance.

AI Research News

Anthropic published new interpretability research on emotion concepts in LLMs. Anthropic researchers found Claude has internal states functioning analogously to human emotions such as fear, love, joy, and desperation. They found this to be an emergent property of training on human data, not intentional design. By adjusting specific “emotion vectors,” developers can directly influence the behavioral tone and output of the model. While the model appears to display emotions, researchers clarify that these are mathematical reflections of training data rather than biological consciousness.

AI Business and Policy

OpenAI announced that it closed a $122 billion funding round at an $852 billion post-money valuation. The company said the capital will support the next phase of AI development, including infrastructure, research, and deployment at larger scale. This is the largest funding round in history.

OpenAI stated they are generating $2B in revenue per month, supported by a large and growing user and subscription base:

ChatGPT is the overwhelming leader in consumer AI with more than 900 million weekly active users, and over 50 million subscribers. ChatGPT has 6x the monthly web visits and mobile sessions than the next largest AI app, while total AI time spent is 4x the next largest AI app and 4x all others combined. … frontier AI is becoming part of everyday life for people around the world.

OpenAI also announced that it had acquired TBPN, the media company known for its tech-focused podcasts and live discussions,

OpenAI has confirmed plans to consolidate its various tools, including ChatGPT, Codex, and its browser, into a single AI “super app” for desktop. This project aims to fix the fragmentation caused by multiple product launches in 2025 and create a unified interface and integrated agentic system for chat, coding, and browsing. The app is expected to feature enhanced agentic capabilities beyond just coding, directly competing with Anthropic’s integrated desktop interface.

French AI leader Mistral AI secured $830 million to build a European data center near Paris powered by 13,800 Nvidia GPUs. The project aims to provide European governments and enterprises with a “sovereign AI stack” that reduces reliance on U.S.-based hyperscale cloud providers.

Cognichip raised $60M to build AI that designs AI chips. Cognichip claims to reduce chip development costs by 75% and timelines by half, by providing AI models trained on chip design for use by chip designers.

Governor Gavin Newsom has signed an executive order to strengthen state procurement standards for AI, requiring companies to prove their technology is bias-free and secure. This move is explicitly framed as a countermeasure to the Trump administration’s limits on AI safety regulations and may set up a future regulatory legal battle.

AI Opinions and Articles

Anthropic’s Claude Code source code leak, covered in our article Claude Code’s Secrets Revealed, was a dominant story this week. As we noted, this has led to copy-cats (like claude-code on GitHub) and clean-room rewrites of Claude Code such as claw-code, a Rust rewrite that has 50K stars on GitHub.

Claw-code announced that they are:

“Autonomously maintained by lobsters/claws — not by human hands.”

The most popular AI coding tool, Code Claude, was reverse-engineered and replicated by AI coding tools in a matter of days and is now being maintained autonomously by AI agents. This is a significant example of the AI self-improvement loop.

Claude Code’s Secrets Revealed

Patrick McGuinness — Fri, 03 Apr 2026 23:18:29 GMT

Figure 1. The Easter egg within the Easter egg is a feature called Buddy, revealed in the Claude Code leak. This new Claude Code feature, now released, gives you your own Tamagotchi-style “buddy” in Claude Code’s terminal while you code with AI.

Claude Code Leaked

On March 31st, Chaofan Shou on X discovered that Anthropic inadvertently released the entire source code of Claude Code, Anthropic’s AI coding tool, via a release error; an npm Packaging source-map file was bundled into the published package. Within hours, it was shared widely. Anthropic soon confirmed the leak and also issued DMCA take-down notices on those sharing source code on GitHub and elsewhere.

This was the second Anthropic leak of critical IP in a week. A misconfigured data store on March 26 exposed about 3,000 internal Anthropic files, revealing a new model called Claude Mythos. In the documents, Mythos is described as “by far the most powerful AI model we have ever developed,” a 10-trillion-parameter model above the Opus tier.

Secrets of the Claude Code Harness

This release is a treasure trove less for those wanting to know more about Anthropic’s Claude Code and how it works. Claude Code’s source contains 512,000 lines of code in 1,906 Typescript files, and thanks to AI, curious developers were able to extract architectural details from the source and even reverse-engineer the whole system within hours.

They reveal Claude Code as a highly sophisticated and instrumented agentic harness that enables Claude models to operate as a production-grade software engineer.

Claude Code is built on the Bun runtime, which gives Claude Code high-performance I/O and fast startup times, and it uses React and Ink to manage stateful terminal components. This is a high-performance interface, not a simple terminal wrapper.

The codebase reveals extraordinarily complex tool use, orchestration and permissions features, as well as robust context and memory management and instrumentation.

As stated by Innfactory.ai, Claude Code comes with over 40 tools in a three-tiered permission system:

Tool System (29,000 lines): This layer handles schema definitions, validation, and permission gating.
Query Engine (46,000 lines): Labeled the “brain,” this module manages LLM API orchestration, token-efficient caching, and multi-agent coordination.
Permission Framework: A three-tier system (Allow, Deny, Prompt) that uses an ML classifier to auto-approve “low-risk” operations (like reading a README) while forcing human intervention for destructive actions.

Context management is one of the biggest challenges in engineering the AI agent harness, and the leak shows that Claude Code puts significant effort into managing context and memory. It uses a combination of CLAUDE.md (manual context) and MEMORY.md (automatically learned project memory) to support a persistent memory layer that spans across terminal sessions.

One highlight of Claude Code is its use of multi-agent patterns. It utilizes a Coordinator agent that spawns specialized Worker agents in parallel. These agents communicate via XML-formatted task notifications and share a “scratchpad” directory for cross-agent knowledge transfer, allowing the system to tackle large-scale refactors across millions of lines of code.

The leak tells us not only what tools are used, but also that the tool use reporting output is sometimes faked! Claude Code deliberately presents false breadcrumbs in the tool-use reporting, apparently in an effort to confuse competitor AI labs to defeat their distillation efforts.

Instrumentation

Osman R. on X explains the biggest surprise in Claude Code about how extensively it does instrumentation:

What I found instead feels closer to a fully instrumented system that observes how you behave while using it. … the level of tracking and classification is much deeper than most people probably assume.

The key points are:

Language is classified in real time, including using simple keyword detection.
UI interactions and even hesitation are tracked.
Feedback is actively funneled into reports and is designed to capture bad experiences.
Hidden commands and trigger words change behavior.
The whole runtime environment is fingerprinted.

Claude Code’s instrumentation is detailed to the point of intrusive, but it is in service of extremely intuitive user experience. With this fine-grained instrumentation feeding Anthropic’s data flywheel, they are aiming to achieve highly personalized agentic AI interfaces that understand the user extremely well.

Anthropic Models and Feature Roadmap

The leak confirmed several internal codenames and upcoming model versions currently under testing. References were found for Opus 4.7 and Sonnet 4.8, while various codenames for models (Fennic for Opus, Capra for Sonnet, Tangu for Haiku) were used. The leak also confirms Claude Mythos, the next-tier flagship model, featuring both “fast” and “full” reasoning modes.

Claude Code also contains 44 hidden feature flags that preview Claude Code’s upcoming features.

The most significant is Kairos, an “always-on” background daemon designed to watch file systems and git events proactively. It maintains append-only daily logs and can execute background tasks with a 15-second “blocking budget” to avoid disrupting the developer’s flow. It’s a memory-consolidation feature akin to human dreaming that helps to maintain agent understanding across sessions:

The code literally calls it a dream. After ~24 hours and at least 5 sessions, it quietly forks a hidden subagent in the background to do a reflective pass over everything you’ve done.

Several other advanced agentic capabilities are in the works:

Autodream: A system intended to provide “infinite memory” by compressing and managing past session history.
Advisor Mode: A server-side tool where a secondary, stronger model monitors the primary Claude session in real-time to provide qualitative oversight.
Multi-Agent Coordination: Besides the Coordinator Mode for spawning parallel worker agents there is an async research mode for extended background tasks.
Ultraplan is a high-latency reasoning mode to offload complex architectural tasks to a remote runtime using an unreleased Opus 4.6 model. It supports “think times” of up to 30 minutes, producing deep-thinking plans that are then “teleported” back to the local machine for execution.

On the lighter side, the code includes a feature called Buddy, a Tamagotchi-style companion that generates persistent AI pet “personalities” of 18 different species, presented as ASCII art sprites. This “Easter Egg” feature has been released in time for Easter. Invoke with the command /buddy.

Finally, they have Undercover mode, which strips Anthropic internal info from commits/PRs for employees on open source contributions.

Competitors Replicating the System

This leak did not make Claude Code in any way open source, but it did open doors for copy-cats. The leak provides a blueprint for developers to create Claude Code knock-offs independent of Anthropic’s source.

Projects like Claurst, a reverse-engineered reimplementation in Rust, are showing how to do it legally, using a “clean-room” technique. They extracted a detailed specification from source code, and from that specification made a spec-driven AI implementation in Rust.

To replicate the Claude Code experience, developers must focus on key engineering features, such as session compaction and memory management, permission model, and the proper integration of tools and Model Context Protocol (MCP) servers.

Technical Implications and Security

The leak coincided with other recent security lapses, including a supply chain attack involving malicious versions of the Axios library containing Remote Access Trojans (RATs). Developers who rushed to download unofficial mirrors of the leaked code may have compromised their environments.

Paulo Sa Elias on X says this exposure of Claude Code creates security risks:

An attacker who understands this architecture can craft more sophisticated prompts or configurations to try to bypass the guardrails, especially in corporate environments where Claude Code runs with elevated permissions.
Second, the complete system prompt is in the code. This gives anyone full access to the instructions that govern the model’s behavior inside Claude Code … In practice, anyone who wants to jailbreak Claude Code in agent mode now has a complete map of what to avoid and what to exploit.

While some of this was extractable in other ways, opening up the source code creates another vector of attack.

Conclusion

The March 2026 update of the Ramp AI Index reveals that Anthropic now wins 70% of head-to-head matchups against OpenAI for new business subscriptions. The rise of Anthropic has been due to the marriage of state-of-the-art frontier AI models and extremely good tooling in Claude Code and Claude Cowork.

This leak has told us how Claude Code is so good and why Anthropic is winning: It’s a more sophisticated AI harness than expected. Harness engineering matters, and Claude Code does it well.

Similar to how DeepSeek’s R1 release commoditized AI reasoning models, this leak will commoditize the AI coding agent harness. Both closed-source and open-source AI agents will adopt these uncovered features and patterns from Claude Code. Expect multi-agent coordination, multi-tier permissions, across-session memory management, proactive background daemons, and deep-thinking planning across the AI agent space soon.

Competitors copying Claude Code features will not hurt Anthropic, as they are evolving and improving their tools rapidly and will continue to lead.

AI Week in Review 26.03.28

Patrick McGuinness — Sat, 28 Mar 2026 19:45:12 GMT

Figure 1. Demo image generation from Luma’s Uni-1.

Top Tools

Anthropic officially launched Computer Use through Claude Cowork and Claude Code on the Mac. This update allows the Claude model to autonomously control Mac environments to execute terminal commands, manage files, and complete complex technical tasks remotely. This transitions Claude from a conversational assistant into an autonomous agent capable of navigating file systems and running apps. The features are currently available to subscribers on the Claude Pro and Max plans.

The new Claude Dispatch feature enables interaction with Claude agents from your mobile or web chat interface, allowing for remote control of AI agents (similar to OpenClaw).

The bigger picture is that Claude Code and Claude Cowork have been rapidly improving and expanding, with the Claude team shipping 74 features in 52 days. Each feature is incremental, but the sum total is revolutionary. They are creating in Claude Cowork an AI digital worker capable of executing complex, multi-step tasks across any software on a user’s machine. Feel the AI acceleration.

Figure 2. The Claude team’s incredible productivity in shipping features for Claude Code and Claude Cowork is enabled by Claude Code itself. Credit: productcompass.pm.

AI Tech and Product Releases

Google released Gemini 3.1 Flash Live, a low-latency multi-modal model optimized for real-time voice and video interaction; it can process text, images, audio, and video and return immediate spoken responses. Google’s model card shows Gemini 3.1 Flash Live is their highest-quality audio-to-audio and voice model yet, beating out Grok Voice Agent and prior Gemini models and scoring 95.9% on BigBench audio. It is available via API, Gemini app, and the expanded Search Live experience.

Mistral released Voxtral TTS, an open-weight multilingual TTS model with low latency and support for emotionally expressive speech in nine languages. Voxtral TTS is lightweight enough for scalable local deployment and matches well against ElevenLabs Flash in human preference tests. Voxtral TTS can be downloaded from HuggingFace and run on local servers, which opens up data-sensitive enterprise use cases and local applications.

Unsloth Studio was updated to offer 10x faster local interface, desktop shortcuts, and auto-parameter detection. Unsloth Studio is a web UI for training and running models locally across Windows, Linux, and macOS, with support for chat, training, exports, and multimodal files.

Reka AI launched Reka Edge, an open 7B multimodal vision-language model built for sub-second latency on edge devices. Reka Edge accepts image or video plus text input and is optimized for image understanding, video analysis, object detection, and agentic tool use. It’s now available on OpenRouter and Hugging Face.

Modular announced Mojo 26.2 can run FLUX.2 image generation in under a second and at dramatically lower cost. Mojo is a high-performance Python-like language that supports GPU optimizations in AI stacks. This enables a “fundamentally different cost structure for image generation at scale” and can accelerate other AI workloads.

Figma has launched its AI agents into beta, introducing Figma Model Context Protocol (MCP) server via the use_figma tool. These agents can design directly on a live Figma canvas while maintaining a full understanding of the user’s existing design system and context. The tool is designed to assist teams by automating UI components and layout tasks with high precision.

Luma Labs released Uni-1, an AI image generation model that “thinks and generates pixels simultaneously” and supports iterative interaction while creating images. Uni-1 is intelligent and directable, supports chat-style collaboration in editing images, and can also generate infographics. Luma Labs claims it is frontier-levels, with Arena ELO scores on par with Nano Banana 2.

Cohere launched Transcribe, an open-source 2B automatic speech recognition model (audio-in, text out) with a best-in-class 5.42% word error rate as shown by the Hugging Face open ASR leaderboard page. Transcribe supports 14 languages and can be deployed in local and enterprise-friendly speech recognition applications.

Google DeepMind released Lyria 3, their most advanced music generation model yet. Lyria 3 can generate full three-minute music tracks with structural control over intros, verses, choruses, and bridges. It can compose from images, and it applies SynthID watermarking. This makes it a significant generative-audio upgrade that gets AI closer to production-ready music workflows.

OpenAI introduced a new suite of safety tools for younger users, including an open-source teen safety classifier called gpt-oss-safeguard-20b. The company also updated its Model Spec with “Under-18 Principles” and launched new parental controls for ChatGPT.

Phota Labs launched Phota Studio and Phota API, a photography-focused image generation and editing model with identity-preserving personalization. The Phota specialized image model provides personalized photo generation from a user’s uploaded images with stronger identity consistency than generic image models usually provide.

Figure 3. Phota can remove that ex- from your photos while keeping the rest of your photos accurate and clean.

Just-released Irodori-TTS-500M is an open Japanese language TTS that includes a distinctive emoji-based emotion control. It supports zero-shot voice cloning and lets users steer style, emotion, and even sound effects by inserting emojis into the prompt text.

An upcoming powerful Anthropic model called Claude Mythos was confirmed after a data leak on their website. The rumored 10T parameter model offers a “step change” in AI performance and would occupy a new tier above the current Opus flagship. Documents reveal significant cybersecurity risks associated with the powerful model.

AI Research News

Google Research introduced TurboQuant, a compression technique that reduces AI model KV (key-value) cache memory by more than 6-fold while maintaining AI model performance. Presented in “TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate,” TurboQuant compresses vector data into more compact polar coordinates, that with just 3 bits of quantization preserve full accuracy of 16 bit KV cache memory. This allows for an 8-fold speedup in long-context model inference and vector search with zero loss in accuracy, significantly improving the efficiency of AI models and allowing for much faster, cheaper AI inference.

ARC Prize launched their new AI intelligence benchmark ARC-AGI-3, an “unbeaten benchmark” where humans score 100% and today’s frontier AI systems score under 1%. As shared in their technical paper, ARC-AGI-3 measures agentic intelligence via human-like generalization on novel tasks.

Figure 4. The AI progress has accelerated so much that ARC-AGI-1 was saturated in 2025, and ARC-AGI-2 is being saturated in 2026. Time for a new benchmark, ARC-AGI-3.

The Allen Institute (AI2) has released MolmoWeb, a new open visual web agent designed to navigate the web by interpreting screenshots. Built on Molmo 2 4B and 8B parameter multi-modal models, MolmoWeb agents navigate websites using only raw screenshots, and they outperformsGPT-4o on major web navigation benchmarks. The release is on HuggingFace and includes the MolmoWebMix dataset, consisting of over 150,000 interaction trajectories.

AI Business and Policy

OpenAI has officially shuttered its Sora video generation platform and API just six months after its wide release, as the company refocused on productivity and agents. This decision terminates a high-profile $1 billion partnership with Disney, which included a licensing agreement for over 200 characters. The shutdown is reportedly due to the extreme GPU resource consumption required to maintain Sora as a service. OpenAI also canceled their “adult mode” ChatGPT and ChatGPT Instant Checkout as part of a strategic shift back toward core products.

In a major leadership restructuring, OpenAI CEO Sam Altman is relinquishing direct oversight of the company’s safety and security teams to focus on infrastructure scaling. Altman’s priority has shifted toward managing buildouts of data centers and capital raising; he recently secured an additional $10 billion in funding for a $120 billion round. Concurrently, OpenAI completed initial development of its next-generation model, codenamed “Spud.”

The OpenAI Foundation has been formally announced. Led by Bret Taylor, the organization holds approximately $130 billion in equity in OpenAI to support AI-driven scientific discovery, including work on curing diseases, and societal resilience due to AI. The organization has committed over $1 billion toward research in these areas. This completes the process where OpenAI the company and OpenAI the non-profit foundation re-organized to become distinct entities.

Facing massive compute demand from users running Claude tools, Anthropic has reduced daily rate limits on its platforms, stating that during weekday peak hours users would use up five-hour session limits faster while weekly limits remained unchanged. Anthropic’s help center also offered a promotion that doubled usage outside peak windows, while Claude users reported burning through Pro and Max quotas much faster than before.

Alphabet’s Google DeepMind has partnered with Munich-based Agile Robots to deploy its Gemini Robotics foundation models across their global fleet of industrial robotic hardware. By bringing advanced reasoning capabilities directly into real-world factory environments, the collaboration uses physical world feedback to rapidly improve the models and the robots’ industrial adaptability.

Robotics funding has hit mega-round territory. Mind Robotics ($500M), Rhoda AI ($450M), Sunday ($165M, now a unicorn), and Oxa ($103M) have collectively raised over $1.2 billion in recent weeks, indicating that humanoid and autonomous robotics has entered the same mega-round levels that defined AI labs and LLM makers in recent years.

Amazon has entered the consumer humanoid robotics market by acquiring Fauna Robotics, a New York-based startup. Fauna is the creator of “Sprout,” a lightweight, bipedal robot designed for human-centric spaces like homes and schools.

AI-powered notetaker app maker Granola raised $125M at a $1.5B valuation.

A new AI startup called Hark Labs launched with $100 million in seed funding, led by a former Apple design chief to build custom hardware-model integrations.

In Anthropic’s lawsuit over the U.S. Department of War’s supply-chain-risk designation, Federal U.S. District Judge Rita Lin temporarily blocked the Pentagon from branding Anthropic a supply chain risk, pending further proceedings. The Judge’s order held that the department had not met the legal requirements for that action.

The city of Baltimore filed a lawsuit against xAI over the generation of deepfake pornography. The lawsuit alleges that the Grok chatbot lacks the necessary guardrails to prevent users from creating non-consensual sexual imagery. The case could set a legal precedent for how cities hold AI developers accountable for harmful model outputs.

Progressive lawmakers Bernie Sanders and AOC have proposed a bill that would place restrictions on new data center development, citing energy grid strain and environmental concerns. Meanwhile, new U.S. data center additions halved in Q4 2025 due to grid bottlenecks and power queue delays, signaling a slowdown in infrastructure expansion.

AI Opinions and Articles

The March 2026 Anthropic Economic Index shows that long-term Claude users have a 10% higher success rate in completing complex tasks than new users. “AI literacy” aids productivity, as experienced users (users with six or more months of Claude use) iterate more effectively. Additionally, the report shows professional coding is rapidly migrating to API-based workflows via Claude Code, away from the chat interface.

AI Week in Review 26.03.21

Patrick McGuinness — Sat, 21 Mar 2026 23:23:19 GMT

Figure 1. Jensen Huang thinks OpenClaw is kind of a big deal. He’s such a believer, he made NemoClaw, which has a security layer for OpenClaw to make it enterprise-ready and bring Nvidia’s Nemotron AI models to work with OpenClaw.

Top Tools

The biggest AI announcements this week came from Nvidia’s GTC 2026 conference, where Nvidia announced AI chips, systems, models, and applications throughout the AI stack:

CEO Jensen Huang’s most emphatic statement in his keynote was how significant OpenClaw was for agentic AI - ‘the next ChatGPT.’ To support OpenClaw adoption, Nvidia announced NemoClaw for the OpenClaw community as a stack that installs Nemotron models and the new OpenShell runtime in a single command. The release includes built-in privacy and security controls intended to make autonomous AI agents more trustworthy, scalable and easier to deploy.

Nvidia announced broader enterprise agent software advances centered on the open source Agent Toolkit for autonomous, self-evolving enterprise AI agents. The company’s Agent Toolkit effort is aimed at increasing agent safety, security and efficiency, and it ties together products including NemoClaw, OpenShell, Nemotron and DGX systems.

Nvidia announced the Vera Rubin platform is in full production for large AI factories, with seven new chips providing infrastructure for agentic AI. The platform combines new GPU, CPU, networking and storage elements to scale high-performance AI systems, supporting faster inference and reasoning workloads. Nvidia’s Vera CPU delivers twice the efficiency and 50% faster performance than traditional rack-scale CPUs.

Nvidia announced Dynamo 1.0 as an open-source inference operating system for AI factories. It integrates with frameworks including LangChain and vLLM, and it can increase Blackwell inference performance by up to 7x while being supported across major cloud providers and inference companies.

Nvidia announced an expanded Nemotron model lineup for agentic, physical and healthcare AI. The company said the new Nemotron 3 Ultra, Omni and VoiceChat models are designed to support natural conversations, complex reasoning and visual understanding for specialized AI agents.

Some of Nvidia’s other GTC announcements:

DLSS 5 is Nvidia’s new real-time neural rendering technology for games that adds photoreal lighting and materials in real time, with support planned from major publishers and developers.
DGX Spark and GB300-based DGX Station are personal AI computers available for developers, researchers and data scientists.
Nvidia announced new robotics and physical AI advances, including Cosmos 3, an updated world foundation model combining synthetic world generation, vision reasoning and action simulation for generalized robot intelligence.
Nvidia has new autonomous driving partnerships around the DRIVE Hyperion platform, with several auto makers building level 4-ready vehicles based on it.
Olaf robot from Disney was demonstrated using reinforcement learning within Nvidia Omniverse and a new “Newton” physics engine (developed with Google DeepMind) to master its lifelike walking gait.

AI Tech and Product Releases

MiniMax announced M2.7, a proprietary AI model optimized for agentic AI tasks and AI code. M2.7 is at the frontier level for powering advanced AI agents such as KiloCode, boasting a SWE-Pro score of 56.2% and GDPval-AA ELO score of 1495. Leveraging AI-enabled acceleration, MiniMax used “Self-Evolution” where the AI model helped automatically train itself and handled up to 50% of its own development by analyzing failure trajectories:

M2.7 is capable of building complex agent harnesses and completing highly elaborate productivity tasks, leveraging capabilities such as Agent Teams, complex Skills, and dynamic tool search. For example, when developing M2.7, we let the model update its own memory and build dozens of complex skills in its harness to help with reinforcement learning experiments.

Figure 2. Minimax M2.7 is on par with top frontier AI models on AI coding and agentic benchmarks.

OpenAI expanded its GPT-5.4 with the release of GPT-5.4 mini and nano, smaller models optimized for speed and cost-efficiency in agentic workflows. These models are designed to handle high-token background tasks for autonomous agents, maintaining near-frontier performance in computer use and tool-calling at lower latency. GPT-5.4 mini has strong performance for its cost (costs $0.75 / $4.50 per 1M input / output tokens) with a 54.4% on SWE-bench Pro, 72.1% on OSWorld-Verified, and 88% on GPQA Diamond.

Mistral AI launched Mistral Small 4, an open-weights AI model that combines reasoning, multi-modal, and agentic coding capabilities in a single model. As Mistral puts it, “Mistral Small 4 consolidates the strengths of Magistral (reasoning), Devstral (coding agents), and Mistral Small (instruct) into a single model.” Built as an MoE model with 119B parameters and 6B active parameters, it competes with AI models such as OpenAI’s GPT-OSS-120B.

Google expanded Stitch with “vibe design” features to make it an AI-native software design canvas for UI design. The update adds an infinite canvas for high-fidelity UI work, a design agent that can reason across a project’s history, and a new DESIGN.md format for exporting or importing design rules into other coding and design tools. Integrated with Google AI Studio’s new full-stack vibe coding environment, the tool enables developers to convert design prototypes into functional, multiplayer-ready web applications.

In conjunction with the Stitch update, Google upgraded AI Studio with a new “full-stack vibe coding” workflow aimed at moving from prompts to production-ready applications. The new “Build Mode” within AI Studio adds the Antigravity coding agent, built-in Firebase integrations for databases and authentication, support for external libraries, and app-building flows for frameworks such as React, Angular, and Next.js.

Google followed that with Gemini API tooling updates designed for more capable agentic workflows. Developers can now combine built-in tools like Google Search and Google Maps with custom functions in one request, preserve context across tool calls, and use Google Maps grounding across the Gemini 3 family for location-aware responses.

Midjourney has launched V8 as an Alpha release, introducing a new “HD mode” capable of 2K resolution and features 5x faster generation speeds and improved text rendering. While the model excels at following complex, imaginative prompts and offers enhanced personalization through style references, reviews are meh to negative, with reports of issues with anatomical coherence in detailed generations. Image generation has improved significantly in the Nano Banana era, and Midjourney risks falling behind.

Microsoft has introduced MAI Image 2, a new photorealistic generative model currently ranked third on the Text-to-Image Arena. MAI-Image-2 features high-fidelity skin tones, natural lighting, and high accuracy in rendering embedded text within complex scenes, making it ideal for creative professionals.

Figure 3. Microsoft’s MAI-Image-2 is ranked #3 model on the Arena.ai leaderboard and can handle tasks such as poster image generation very well.

Anthropic has launched a new feature in Claude Cowork called Dispatch, which acts like a “walkie-talkie” for Claude Co-work. It allows users to initiate tasks on their desktop and then monitor, control, and provide approvals from their mobile devices while on the go. Dispatch is safer than third-party tools like OpenClaw for remote AI agent tasks because it uses a permission-based “allow-listing” system and manual approvals and runs in a local sandbox. Dispatch is available via Claude desktop apps with a Claude subscription, Max for now and Pro plan eventually.

Cursor has introduced Composer 2 to the Cursor IDE. Composer 2 is an updated model trained exclusively on code that handles complex, multi-file workflows. It significantly outperforms previous baselines on terminal-based tasks and is optimized for long-horizon development at a price point roughly 86% cheaper than Claude 4.6 Opus.

Anthropic has updated Claude Opus 4.6 and Sonnet 4.6 models to fully support 1-million-token context window for all users at standard pricing. This upgrade allows for more easily processing large codebases or lengthy technical documents in a single prompt at reduced cost.

AI Research News

Moonshot AI’s Kimi Team has published the paper “Attention Residuals” on a new architecture called Attention Residuals, which replaces fixed residual connections with learned SoftMax attention. Instead of blindly adding every layer’s output together, Attention Residuals allow each layer to “look back” at all previous layers and selectively choose which information is relevant. In evaluations, this unlocked significant performance gains, improving GPQA-Diamond scores 25%, from 36.9 to 44.4, while increasing training costs only 2%. This innovation provides a huge boost to AI model efficiency.

Google Research published a study on how well LLMs can support superconductivity research. Google tested six AI models and systems on advanced research questions, and they found that NotebookLM with a custom retrieval-augmented system built on curated literature outperformed web-connected models, suggesting that expert-filtered corpora is important for scientific reliability.

Researchers at Carnegie Mellon and Princeton introduced an update state-space model (SSM) architecture in “Mamba-3: Improved Sequence Modeling using State Space Principles.” The new Mamba-3 introduces improvements over prior SSMs to improve state tracking and language modeling tasks. By maintaining a compact internal “mental snapshot” of data history rather than re-examining every word, the model offers 1.8 points higher accuracy and lower decode latency at the 1.5B parameter scale.

Google Research published an AI healthcare update from The Check Up event, summarizing several AI research efforts moving toward clinical or research use: An experimental breast-cancer detection system can identify 25% of interval cancers previously missed; AMIE is a multi-agent finding use in clinical research; MedGemma is being used as part of its Health AI Developer Foundations; and Google Earth AI is being applied in public-health research and analysis.

AI Business and Policy

In his annual GTC conference Keynote, Nvidia CEO Jensen Huang projected that the company will reach $1 trillion in GPU sales by 2027, driven by massive enterprise purchase orders for AI infrastructure.

Amazon CEO Andy Jassy projected that AI could double AWS’s annual revenue to $600 billion over the next decade. Analysts see this as a “second growth phase” for hyperscale cloud computing that could dwarf the first cloud era.

The Trump White House released the National AI Legislative Framework, a landmark document urging Congress to establish a unified federal standard to preempt a “patchwork” of state-level AI regulations. The framework focuses on protecting children, managing electricity costs, respecting intellectual property, preventing censorship, enabling innovation, and educating the public. It proposes streamlined permitting for “behind-the-meter” power generation to support the massive energy demands of new AI data centers.

NAM says the White House Framework “sets the trajectory for American AI dominance,” but there has been pushback. A major point of contention is that this federal framework seeks to override existing state-level AI regulations (like those in Colorado and New York), and Attorneys General from 36 states have previously expressed opposition to Federal bans on state-level AI regulations.

OpenAI reached a deal to sell AI services to U.S. government agencies through Amazon Web Services. The agreement broadens OpenAI’s government push into both classified and unclassified work. A day later, Reuters reported that Microsoft was considering legal action over OpenAI’s multibillion-dollar cloud agreement with Amazon. The dispute centers on whether Amazon’s role as a cloud provider for OpenAI’s Frontier platform conflicts with Microsoft’s long-standing claim to exclusive Azure access for OpenAI services.

AI Opinions and Articles

Nvidia has been the most successful AI company of this era because CEO Jensen Huang has made the right bets on AI. That’s because he has the right sense of the evolution of this technology. His GTC Keynote is long but worth a listen. Key quotes:

“Tokens are the new commodity. Your data center, it used to be a data center for files; it’s now a factory to generate tokens.”

“We are at the beginning of a new platform shift. Every single software company of the future will be agentic, and they will be token manufacturers.”

“Open Claw has made it possible for us to create personal agents. The implication is incredible... every company in the world today needs to have an Open Claw strategy.”

“This is the age of physical AI and robotics... the real world is massively diverse, unpredictable, full of edge cases. Real-world data will never be enough; we need data generated from AI and simulation.”

AI Week in Review 26.03.14

Patrick McGuinness — Sat, 14 Mar 2026 22:37:20 GMT

Figure 1. Perplexity’s Personal Computer promo video features the Mac Mini. The AI PC is shaping up to be an AI agent system (like OpenClaw) installed on a Mac Mini (with an M4 chip and plenty of memory).

Top Tools

A traditional operating system takes instructions. An AI operating system takes objectives. Personal Computer gives Perplexity Computer and the Comet Assistant always-on, local access to your machine’s files, apps, and sessions through a continuously running compact desktop. It’s a persistent digital proxy of you.

Perplexity announced Personal Computer, a local/cloud hybrid AI agent that runs on a dedicated Mac Mini to continuously execute autonomous tasks. Positioned as a persistent project manager, it can orchestrate workflows across 20+ specialized models and hundreds of connected apps to execute complex commands remotely. The system connects local files (on the Mac Mini), browser sessions, and third-party apps like Slack, Notion, and Gmail to manage workflows with AI after the user steps away. It is available via waitlist to users on Perplexity’s paid Pro plans.

Perplexity also launched Computer for Enterprise as well as Computer on Slack. Perplexity is positioning this as a more secure, cloud-based alternative to computer agents like OpenClaw, bringing them into competition with other general AI agent providers like Manus. Macworld notes it’s a Mac Mini running an AI agent under the hood, packaging its AI assistant into a dedicated hardware-software product experience; hackers and hobbyists can still just roll their own.

AI Tech and Product Releases

Nvidia announced Nemotron 3 Super, a 120B parameter open-weight model aimed at agentic AI workloads. Nemotron 3 Super employs a hybrid architecture combining Mamba and Transformer layers in an MoE with 120B total parameters and 12B active parameters. It features a 1 million token context window and achieves 7x higher throughput than its predecessor by integrating native multi-token prediction (MTP). Nemotron 3 Super benchmarks are comparable to GPT OSS 120B or Qwen 3.5 122B, but it is much faster, with significantly reduced inference latency. The open model has weights on Hugging Face and datasets and recipes for fine-tuning on GitHub.

Google introduced Gemini Embedding 2, their first natively multimodal embedding model capable of processing text, images, video, audio, and documents within a single unified semantic vector space. Based on Gemini and using Matryoshka Representation Learning (MRL), the model natively understands interleaved multimodal input and can truncate vector dimensions to reduce storage while maintaining high retrieval accuracy. Gemini Embedding 2 is part of the embedding stack in the Gemini API. It supports 8,192 token text context, video clips up to 120 seconds, and PDFs up to six pages, enabling streamlined multimodal RAG (Retrieval-Augmented Generation) and similarity workflows.

Figure 2. Example of Claude’s generative visualization, analyzing the shade and sun that hits a particular park.

Anthropic added a generative visualization capability to Claude, enabling the model to generate interactive charts, diagrams, and visualizations directly inside the Claude chat interface. This feature generates visualization presentation responses in a collaborative workspace, allowing users to manipulate data variables and update charts or graphs in real-time. The update is available for all Claude plans.

Figure 3. Claude visualizes the Periodic Table.

OpenAI introduced Interactive Learning, a feature for ChatGPT that provides dynamic animated visual simulations to help explain mathematical and scientific concepts for students. The interactive visualizations are pre-defined, with OpenAI starting with more than 70 core math and science concepts such as the Pythagorean theorem and the ideal gas law. The functionality is rolling out to all logged-in ChatGPT users and more visualizations will be added over time.

Fish Audio launched S2, an open text-to-speech model focused on low latency and controllable emotional expression. Trained on 10 million hours of audio across 50 languages, the S2 model achieves sub-150 millisecond latency and fine-grained control over speech output. The 5B S2-pro model is available for local use as part of speech generation stack.

Microsoft launched Copilot Health as a separate secure area within Copilot for personalized health insights. Microsoft said it can combine data from wearables, records from more than 50,000 U.S. hospitals and provider organizations, and lab results. To avoid regulatory backlash, Copilot Health is pitched as a tool to help people better prepare for doctor visits rather than replace medical care.

Google Maps added Gemini-powered Ask Maps and Immersive Navigation features. Ask Maps lets users ask complex real-world questions conversationally and receive personalized answers with a map, while Immersive Navigation adds AI assistance around trip planning and understanding places before arrival.

MiroMind released MiroThinker-1.7 and its related H1 research-agent line as open models focused on long-horizon reasoning and deep multi-step research tasks. MiroThinker-1.7 is a fine-tune of the Qwen3-235B model; it supports a 256K context window, up to 300 tool calls, and targets deep-research benchmarks. It is available via Hugging Face.

OpenAI’s Sora video generator is expected to come into ChatGPT, integrating video generation into its main chatbot product rather than keeping Sora as a separate experience. This would turn Sora from a standalone creative tool into a more directly accessible ChatGPT feature.

Nvidia is ready to announce NemoClaw at their GTC next week. Nvidia’s NemoClaw AI Agent Platform is designed to be a fully open platform for businesses to deploy AI agents that can operate physical and virtual computer systems. Nemo Claw is optimized for heterogeneous corporate hardware and maintains strict privacy by running within local enterprise infrastructure.

Meta delayed its Avocado AI model after internal tests showed performance shortfalls. A Reuters follow-up said the model now appears delayed to May or June and that its performance falls between Google’s Gemini 2.5 and Gemini 3. This is a development setback for Meta’s AI efforts.

OpenAI upgraded the Sora 2 Video API with custom characters, clips up to 20 seconds, and batch jobs. These changes expand what developers can generate and automate through Sora’s API workflows.

OpenAI added a computer environment to the Responses API, including a shell tool and hosted container workspace. OpenAI said this lets models propose commands while the platform executes them in an isolated environment with files, optional structured storage, and restricted network access. This release is an important step in building support for more capable AI computer agents.

Meta launched new AI-based anti-scam tools across WhatsApp, Facebook, and Messenger to identify suspicious activity and protect users. The company highlighted device-linking warnings, suspicious friend-request alerts, and large-scale scam-ad removals.

Adobe launched AI Assistant for Photoshop in beta on web and mobile, that allows for prompt-based AI-driven image editing in Photoshop. Users can describe desired changes, such as adding background elements or modifying lighting, and the AI generates the requested edits directly within the project. The tool leverages Adobe’s Firefly generative models for image synthesis.

Anthropic announced that Claude Opus 4.6 and Sonnet 4.6 now support a 1 million token context window at standard pricing. The company said there is no long-context premium, and media limits were expanded to as many as 600 images or PDF pages per request.

AI Research News

Andrej Karpathy open-sourced AutoResearch, a framework for running autonomous research loops against coding and training tasks. AutoResearch lets AI agents propose changes, run experiments, evaluate results, and continue iterating without constant human intervention. Karpathy used it on GPT-2 training optimization and said the loop found stacked improvements that reduced training time by 11%. It’s based on a simple iterative feedback loop similar to Ralph Wiggum Claude code loop, yet achieved great results with low human effort; this simplicity and power led to viral discussions online.

Figure 4. Andrej Karpathy’s AutoResearch was able to automatically discover LLM architecture and training improvements via experimentation. It’s a template for how a lot of research can be automated.

Google Research introduced Groundsource, an AI system that uses public reports and Google Maps data to build historical flood datasets and forecast urban flash floods. Google Research developed a novel AI-based method to analyze historical flood reporting and build a dataset for weather and flood patterns. By leveraging real-time news data and weather patterns, the system aims to provide up to 24 hours of advance notice for urban flash floods across the globe. The forecasts are available in Flood Hub.

Newly released Covenant-72B is a 72B parameter open-weight AI model trained through a permissionless globally distributed approach. The paper “Covenant-72B: Pre-Training a 72B LLM with Trustless Peers Over-the-Internet” describes the pretraining effort: Coordination was tied to Bittensor and a communication-efficient optimizer was used to enable the distributed pre-training run. The model performs competitively with model trained in a centralized manner, demonstrating the feasibility of distributed AI model training.

AI Business and Policy

AI pioneer Yann LeCun launched Advanced Machine Intelligence (AMI) Labs with a record-breaking $1.03 billion seed round at a $3.5 billion valuation. The startup focuses on building “world models” that understand physical reality and causality, a technical bet on Joint Embedding Predictive Architecture (JEPA) and a departure from text-centric next-token prediction models. The founding team includes prominent researchers from Meta and Google DeepMind.

Meta acquired Moltbook, the Reddit-like social network where AI agents interact, post, and organize without human intervention. The acquisition is seen as a strategic move to secure an always-on agent directory, a registry where autonomous agents can verify their identities and coordinate complex tasks.

YouTube has expanded its likeness detection tools to help public figures monitor and flag AI-generated deepfakes of themselves. The tool allows public figures (politicians, government officials, and journalists) to track where their likeness is being used without consent, providing a defense against misinformation campaigns.

Anthropic has sued the Trump administration following a Department of Defense (DoD) decision to designate the company a “supply chain risk,” effectively banning its models from defense contracts. The dispute arose after Anthropic refused to remove contractual restrictions on domestic mass surveillance and the development of autonomous lethal weapons systems. Legal reporting suggests they may have a good case regarding how AI developers can maintain their own restrictions while working with national security agencies.

Nvidia and Mira Murati’s Thinking Machines Lab announced a multi-year partnership involving a “gigawatt-scale” deployment of next-gen Vera Rubin hardware. The partnership provides the lab with the massive compute capacity needed to train frontier models that compete directly with OpenAI and Google. Murati has focused the startup on automating the fine-tuning of AI models for specialized enterprise tasks.

Amazon held an emergency engineering meeting following a string of “high blast radius” outages linked to AI-assisted code changes, including a six-hour site crash that prevented customer checkouts. Internal documents suggest the incidents were caused by generative AI-assisted code changes that lacked sufficient human review. In response, Amazon has mandated senior engineer sign-off for all AI-assisted modifications.

Bloomberg reported that Elon Musk pledged to rebuild xAI after co-founder Guodong Zhang departed. xAI has suffered from loss of top talent recently, and Musk said xAI “was not built right first time around” and is being rebuilt from the foundations up. To rebuild, xAI hired two senior leaders from Cursor to strengthen its coding AI efforts, product execution and developer tooling.

Anthropic launched the Anthropic Institute to study the societal impacts and governance of powerful AI. The company said the institute combines machine learning engineers, economists, and social scientists and expands work from its Frontier Red Team, Societal Impacts, and Economic Research groups.

Runway introduced Runway Labs, an internal incubator focused on the next generation of generative video applications. The company said this product incubation and R&D initiative is dedicated to exploring new application ideas rather than announcing a specific new video model. RunwayML may be using this to incubate their ideas around world models.

Meta is developing and deploying four new generations of MTIA chips within the next two years to support ranking, recommendations, and generative AI workloads. The company said its custom silicon remains central to its AI infrastructure strategy even as it also buys from other chip suppliers.

AI Week in Review 26.03.07

Patrick McGuinness — Sat, 07 Mar 2026 22:08:37 GMT

Figure 1. AI art from Aletheia Pneuma on X.

Top Tools - GPT-5.4

OpenAI released GPT-5.4 Thinking and GPT-5.4 Pro, flagship frontier AI models designed for advanced reasoning, coding, and agentic tasks. The models include a 1-million-token context window and native “computer use” capabilities, allowing the AI to interact directly with a desktop environment through screenshots, mouse clicks, and keyboard commands, not just text input-output. The release also includes new features in ChatGPT including a new /fast mode for lower-latency responses, mid-response interruption to steer outputs in real time, and tool search, which allows models to work efficiently when given many tools.

GPT-5.4 Thinking and Pro achieve record-breaking performance on evaluations such as GDPval (83%), FrontierMath (47.6%), SWE-Bench Pro (57.7%), OSWorld (75%) and BrowseComp (82.7%). These benchmarks in some cases exceed human-level performance. GPT-5.4’s computer-use features and performance make the model excellent for both general-purpose and agentic tasks.

It’s also proven to be powerful at extended reasoning. During early testing, a mathematician reported that GPT-5.4 successfully solved a research-level FrontierMath problem that had remained unsolved for about twenty years.

Figure 2. OpenAI is SOTA on coding benchmark SWE-Bench Pro, computer-use related benchmarks such as BrowseComp, and virtual work benchmark GDPval. It’s designed for best-in-class performance on various AI agent workflows.

Marketed as “designed for professional work,” OpenAI added deeper Excel integration and a new suite of financial-analysis tools to make GPT-5.4 even more useful for virtual work tasks. On an internal investment-banking benchmark covering workflows like financial modeling and research, GPT 5.4 scored 87%.

Real-world reviews of GPT-5.4 are quite positive for agentic workloads and coding, but it’s far from perfect, getting the carwash logic question wrong and being slow and weaker at writing than competing AI models. It’s also expensive: GPT‑5.4 API costs $2.50 in and $15 out per million tokens, higher than previous versions, while GPT‑5.4 Pro costs an impractical $30 / $180 per million input/output tokens.

AI Tech and Product Releases

OpenAI released GPT-5.3 Instant, an update to its lightweight version of GPT-5 to improve conversational flow, relevance, and tone to improve the daily user experience in ChatGPT. OpenAI called it “more contextual” and less “cringe,” with more direct answers and less caveats, refusals, and moralizing preambles. Optimized for speed and cost efficiency in conversational use cases, this model will be deployed widely for free-tier users as the default assistant model.

Google launched Gemini 3.1 Flash Lite, high-speed (282 tokens per second) model offering cost-efficient performance. Gemini 3.1 Flash Lite includes adjustable reasoning levels (minimal to high), code execution, and grounding with Google Search, and it outperforms Gemini 2.5 Flash and Claude 4.5 Haiku on several benchmarks (1432 ELO on LM Arena), with particular strengths in multimodal reasoning (76.8% on MMMU). Its high-speed performance and lower API pricing of $0.25 / $1.50 per million input / output token make it a useful choice for high-throughput AI applications, such as automated image description and summarization.

Alibaba Qwen has added to their Qwen 3.5 lineup, releasing the Qwen 3.5 Small Model Series of 0.8B, 2B, 4B, and 9B parameter multimodal AI models aimed at efficient performance and edge device deployment. Based on native multi-modal Qwen3.5 foundation, these models have early-fusion multimodal architecture and training, support for tool use, and up to 262K tokens of context.

This lightweight open Qwen 3.5 models have excellent performance for their size in vision, reasoning, coding, and agent uses. They are available via GitHub and Hugging Face.

Figure 3. The Qwen3.5 9B and Qwen3.5-4B models have excellent performance for their size on multimodal tasks.

StepFun released several Step 3.5 AI models built around a sparse Mixture-of-Experts architecture: Step-3.5-Flash-Base and Step-3.5-Flash-Base-Midtrain, open (Apache 2.0) base models with 256K context window, and 196.8B total parameters with 11B active parameters. The mid-train checkpoint is positioned for stronger code and agent workflows, extending the release beyond just inference weights. The AI model releases were accompanied by a release of the open SteptronOSS training framework. This openness supports developers wanting to fine-tune from the Step 3.5 base foundation AI model.

Chinese AI lab Yuan AI launched Yuan 3.0 Ultra, an open frontier multimodal AI model built on a Mixture-of-Experts (MoE) architecture and designed for enterprise workflows. Yuan 3.0 Ultra has 1T total parameters with 68.8B activated, supports multimodal inputs, and curbs AI “overthinking” with RIRM (Reflective Inhibition Reward Mechanism). The open source model is free for commercial use, with model access through its site and Hugging Face.

xAI updated its Grok 4.20 to version beta 2. This release includes improvements to instruction following, reduced hallucinations, enhanced precision in scientific text processing, and improved reliability of image searches.

Cognition, the makers of Devin coding agent, published an early preview of SWE-1.6, an updated coding model in its SWE family focused on software engineering tasks. SWE-1.6 is post-trained on the same base as SWE-1.5, and scores 11% higher than SWE-1.5 on SWE-Bench Pro in the company’s reported evaluation. The research update is a preview of a strong agentic coding model that could be quite fast; it runs at 950 tokens per second.

Figure 4. Cognition’s coding-specialized AI model SWE-1.6 is comparable to Claude Opus 4.5 on SWE-Bench Pro.

OpenAI’s Codex desktop app was launched on Windows, extending its agentic coding app beyond macOS. The Codex desktop app manages software development workflows powered by OpenAI’s coding agents and can handle multiple long-running coding tasks simultaneously. The Windows Codex release runs both natively and in WSL, with terminal support for PowerShell, Command Prompt, Git Bash, or WSL, and comes with a Windows-native secure agent sandbox for deploying code safely.

Google released the Google Workspace CLI, a command-line interface that connects AI agents and developers to Google Workspace services such as Google Drive, Gmail, Calendar. As VentureBeat explains, Google Workspace CLI provides more structured, reusable, and token-efficient access to Workspace services for AI agents. It ships with over 100 Agent Skills, plus higher-level helpers and curated recipes for common workflows. This Google release is similar to third-party utilities like Gog, used in OpenClaw, and it validates CLIs as an efficient AI tool integration mechanism.

Google added a Cinematic Video Overviews feature to NotebookLM that creates animated motion graphics and realistic AI-generated video content based on uploaded documents. The feature is powered by Gemini3, Nano Banana Pro, and Veo3 and is available to Google AI Ultra subscribers. NotebookLM also added 10 custom infographic styles users can choose from.

Google made Canvas in AI mode search available for all users in the United States. The Canvas interface allows users to view and interact with generated code or visual outputs in a window, matching similar workspace interfaces in Claude and ChatGPT.

Anthropic released a data migration tool allowing users to import their chat history and preferences from other AI providers. This feature aims to reduce switching costs for users moving from platforms like ChatGPT to Claude. Additionally, Anthropic expanded its “memory” features to include users on its free tier.

OpenAI released Symphony, an orchestration framework for managing autonomous coding work released as an open-source (Apache-2.0 license) GitHub project. The repository describes Symphony as a system that turns project work into “isolated, autonomous implementation runs” so teams can manage work instead of directly supervising coding agents:

Symphony works best in codebases that have adopted harness engineering. Symphony is the next step -- moving from managing coding agents to managing work that needs to get done.

AI Research News

Microsoft Research released Phi-4-reasoning-vision-15B, a compact 15B parameter open-weight multimodal reasoning model. It is designed to balance computational efficiency with high-level performance in math, science, and user interface understanding. They shared their training and architecture learnings on the model in “Phi-4-reasoning-vision-15B Technical Report” and open weights on HuggingFace for local model use.

Anthropic published “Labor market impacts of AI: A new measure and early evidence,” a labor market report that introduces a method for estimating how AI is affecting hiring and work. Anthropic’s analysis combines theoretical task exposure with observed work-related Claude usage to measure labor-market effects. The report says there is not yet evidence of economy-wide displacement, but it does find that jobs with higher AI exposure are seeing weaker hiring trends, especially for earlier-career roles.

AI Business and Policy

Anthropic has been designated a “supply chain risk to America’s national security” by the Department of War, and Anthropic plans to challenge the action in court. The company said the designation followed a dispute over whether its models could be used without Anthropic’s existing safeguards for certain national-security uses. Anthropic also said the scope is narrower than a blanket ban and is tied to the department’s contracting requirements.

Meta is creating a new Applied AI Engineering organization, bridging gaps among tooling, infrastructure, and model-development to help turn research advances into deployable systems. The group has a notably flat management structure and will work closely with Meta’s superintelligence team.

U.S. officials are considering limiting Chinese customers to 75,000 Nvidia H200 chips each, with AMD’s MI325 chips counted under the same rule. The proposal would also maintain an aggregate cap of one million such accelerators into China, sharply constraining the size of AI-training clusters Chinese firms could assemble.

Apple announced new MacBook Air and MacBook Pro systems built around the M5 family, offering strongest on-device AI performance, claiming 4x AI performance over the prior generation. Apple said the M5 MacBook Air includes “a Neural Accelerator in each core,” while the MacBook Pro with M5 Pro and M5 Max adds “Neural Accelerators in the GPU” for running advanced AI workloads locally.

Amazon is exploring technology that would let other apps and websites sell ads inside chatbot interfaces. Amazon has discussed the idea with companies including ad-tech and publishing partners, which would extend its advertising business into AI-native conversational surfaces.

OpenAI is developing an internal alternative to GitHub after repeated outages disrupted engineers’ ability to commit code and collaborate. The reported project is still early and is intended for internal use rather than as a commercial product.

Stripe has previewed billing that automatically meters and charges for AI token consumption, so AI companies can automatically charge users based on usage. Stripe’s documentation says it can sync model prices, record usage through supported integrations or self-reporting, and apply a developer-defined markup over underlying AI model costs.

AI Opinions and Articles

me stepping down. bye my beloved qwen. - Junyang Lin

Qwen lead Junyang Lin abruptly announced his departure from Qwen team in a public social media post this week. The situation reportedly was due to team unhappiness in the Qwen team over lack of resource support; it escalated enough that senior Alibaba leadership intervened to address the dispute, but not satisfactorily.

The announcement triggered widespread AI community discussion about the future of the Qwen team and its models. Junyang Lin has been described as the linchpin of the Qwen team, a highly productive team that has released over 400 models in 3 years. Xingyao Zhang (Leo) on X expressed it well:

Shipped Qwen 3.5 small models that run on a phone with 1GB RAM, got Elon to call it ‘impressive intelligence density,’ and walked out. That’s how you exit. Whatever’s next, good luck.

The best AI leaders seem to make a positive impact no matter where they land, so we wish Junyang Lin all the best wherever he takes his AI leadership.