TL;DR - We dive into two demos and one paper showing solid progress on autonomous AI agents and robots:
Cognition AI shared demos of Devin, their AI Agent for software engineering.
Google DeepMind published research on AI Agents called SIMA.
The State of AI, One Year On
We just passed, on Pi Day (3-14), the one-year anniversary of the release of GPT-4. Coincidentally, “AI Changes Everything” is also one years old this month. Since March 2023, we have been writing on AI technology, and it’s been incredible tracking the advances in AI this past year.
This makes for a good opportunity to review and refresh the “Fundamental Thoughts on AI” I expressed a year ago. Much has been unexpected in AI since then, but these ‘fundamental thoughts’ held up pretty well:
Scale is effective in AI - unreasonably and amazingly effective: Still true. Scaling matters and scaling works, but there has been more progress on the efficiency of AI models than on advancing beyond GPT-4 capabilities. AI keeps gets cheaper and more effective.
We are in the era of Foundational Models: AI model releases confirm that foundation AI models under the hood determine the capabilities of all AI applications and AI-based systems. The AI era is driven by foundation AI models.
The prompt is the interface: True. We discussed the profound advantages and challenges of natural language inputs in “AI UX and the Magic Prompt.”
Current state-of-the-art AI will change the world: AI will keep getting better beyond GPT-4-level state-of-the-art AI, but today’s AI is enough to disrupt many activities.
2023 is the inflection point for the AI adoption: True. Yes it was, and 2024 is bringing yet more progress, innovation and adoption.
Anyone who says “AI cannot do X” will eventually be wrong: AI continues to surprise us with what it can do. Timelines are uncertain but progress is inevitable. Go ahead, challenge me: What things do you believe AI can never do? Other than solving the Reimann Hypothesis (which has stymied human mathematicians for a century), I draw a blank.
We will have super-human AI by 2029: So far, on track. That was a bold marker to put down, but I’m not alone in my prediction, as Nvidia CEO Jensen Huang and OpenAI CEO Sam Altman have similarly predicted AGI by the end of this decade.
Why AI Agents are Hard
AI is progressing rapidly on all fronts, in some areas accelerating timelines and exceeding expectations, but autonomous AI remains a difficult challenge. For example, self-driving cars were first promised many years ago, but autonomous vehicle AI has been far harder and taken longer to achieve than imagined.
Why are autonomous AI agents hard? Reliability.
Imagine a fantastic AI mode that does 80% of tasks well. As a co-pilot, it gives you help 80% of the time. If a fully autonomous AI agent uses this model to complete a complex project of 10 such tasks together, what’s its success rate? With 20% failure at each step, this yields a success rate of only 10% for this agent.
So long as high reliability is a stumbling block to full automation, AI co-pilots will be more useful than full AI automation.
Meet Devin, Your Software Engineering AI Agent
Despite the challenge, many are working hard to create useful AI agents, and this news suggests real progress on the front: Cognition Labs announced its software engineering AI agent called “Devin,” sharing a blog post and providing compelling video demos of their AI agent navigating new code-bases, hunting down bugs, and even fine-tuning new AI models.
Like other AI Agents, it does planning then execution of tasks. The magic under the hood is that Devin has its own command-line terminal, browser, and code editor; it also seems to use several techniques to recover from errors, which the CEO in the announcement credits to advances in “reasoning and long-term planning.”
Devin is the new state-of-the-art on the SWE-Bench coding benchmark, has successfully passed practical engineering interviews from leading AI companies, and has even completed real jobs on Upwork.
Regarding what SWE-Bench means, SWE-Bench is a benchmark of real-world software engineering problems. This benchmark was created only last year, and it goes beyond code generation into complex software development tasks:
Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process extremely long contexts and perform complex reasoning that goes far beyond traditional code generation
Devin’s ‘state of the art’ SWE-Bench result is only 13.86%, but that dwarfs other AI models, including GPT-4 which got under 2%.
Demos of Devin solving complete useful software engineering tasks, including an Upwork task, led to amazed reactions like this hyperbolic YouTube reaction video, full of visions of productivity boosts, printing money from AI Agents, and jobs lost.
The most impressive aspect of the demos are how Devin recovers from errors and fixes bugs, such as inserting then deleting print statements to debug code. Since reliability is the Achilles’ Heel of full automation, recovering from failures and errors is a vital capability.
Andrei Karpathy shared his thoughts on Software Engineering automation:
In my mind, automating software engineering will look similar to automating driving. … progression of increasing autonomy and higher abstraction … AI doing more and the human doing less, but still providing oversight. In Software engineering, the progression is shaping up similar:
1. first the human writes the code manually
2. then GitHub Copilot autocompletes a few lines
3. then ChatGPT writes chunks of code
4. then you move to larger and larger code diffs
5.... Devin is an impressive demo of what perhaps follows next: coordinating a number of tools that a developer needs to string together to write code: a Terminal, a Browser, a Code editor, etc., and human oversight that moves to increasingly higher level of abstraction.
A truly useful AI application or AI agent needs to leverage both underlying AI models and capabilities beyond the AI model: Tool use and function-calling; code execution; RAG (retrieval-augmented generation). An agent that can control tools takes on those tool capabilities, becoming much more powerful.
Cognition AI integrating multiple capabilities into their AI agent reliably is real progress.
SIMA, Simulating Your Way to Multi-World AI Agents
Connecting LLMs, and their world of words, to environments and actions should unlock the power of AI models to control agents. That has motivated many AI research efforts where AI agents are trained by interacting in virtual worlds.
Last year, work from Nvidia called Voyager, An Open-Ended Embodied Agent with Large Language Models showed how an AI agent could learn (virtual) skills in the game Minecraft.
The latest training-of-agents-in-virtual-games work from DeepMind is called SIMA, Scalable Instructable Multiworld Agent, and shared in a blog post A generalist AI agent for 3D virtual environments and the paper “Scaling Instructable Agents Across Many Simulated Worlds.”
In contrast to the Minecraft environment, these AI Agents are trained on 9 different 3D video games and 4 additional different virtual environments. The researchers collected gameplay in dataset and use that diverse dataset of gameplay to train SIMA agents on 600 different specific skills.
SIMA is an AI agent that can perceive and understand a variety of environments, then take actions to achieve an instructed goal. It comprises a model designed for precise image-language mapping and a video model that predicts what will happen next on-screen. We finetuned these models on training data specific to the 3D settings in the SIMA portfolio.
The SIMA work showed transfer learning across environments. “An agent trained in all but one game performed nearly as well on that unseen game as an agent trained specifically on it, on average.”
Another interesting aspect is that the SIMA AI agent interacted at the pixel level. “This dataset is used to train agents to follow open-ended language instructions via pixel inputs and keyboard-and-mouse action outputs.” This is similar to the approach some other AI agents, including the Rabbit R1 (recall “The Rabbit and the LAM”) and the Multi-On Agent.
If you can train a general-purpose AI agent well enough to interact at the pixel-level, you can give it access to a web browser and the sky’s the limit. This may be future solution for web-enabled AI agents.
Lastly, they showed language as vital for training SIMA to follow instructions to map into actions.
Learning to play even one video game is a technical feat for an AI system, but learning to follow instructions in a variety of game settings could unlock more helpful AI agents for any environment. Our research shows how we can translate the capabilities of advanced AI models into useful, real-world actions through a language interface.
SIMA represents a step towards creating more general-purpose, language-driven AI agents that could potentially be useful in the real world and on the Internet, beyond just gaming applications.
Figure 1 Robot Shows off GPT-Enabled Skills
If you can get LLMs to instruct an AI Agent to take an action in a virtual world, what’s the next step? It’s a continuum, with the AI model (multimodal LLM) as the headwaters:
LLMs → Coding & multi-modal LLMs → Coding AI Agents
→ Virtual World AI Agents → Robots
That takes us to a demo where the humanoid robot from Figure One gives a guy an apple. Figure 1 has connected with OpenAI as a partner and investor, and Figure 1 uses GPT-4 to be its high-level LLM ‘brain’, to converse and understand its world.
OpenAI models provide high-level visual and language intelligence.
Figure neural networks deliver fast, low-level, dexterous robot actions.
It’s easy to both see this through a hyped-up lens “OMG, dish-washer robot today, Terminator bot tomorrow.” and a more skeptical lens.
The skeptical take: First, it’s just a demo, and robot demos have historically hid the bloopers, construction dust, edge-case errors, etc. Further, they slapped GPT-4 on their robot to be its high-level intelligence, is the OpenAI API call that big of a deal? They then utilized the learnings and data from Google’s RT-1 and RT-2-X and others to connect their Agentic LLM-driven robot to do actions, and then added a shell - the body - to physically implement.
As a reminder, RT-2-X is a dataset of datasets, which, like SIMA, conveys actions in environments captured on video: 60 datasets of 1 million episodes, to train 527 skills, across 22 embodiments.
A more appreciative hype lens: This has an end-to-end neural network. The Figure 1 robot rests on the intelligence and work of embedded LLM, then conversion of that into actions, and the LLM and lower-level action model does the heavy lifting. Putting it into an embodied form is actually impressive as an integration and coordination achievement.
Left open is the question of how well integrated it is, how well does it generalize actions, and what it’s real capabilities are.
Summary
“Its a really hard problem and we’ve only just started.” - Scott Wu, CEO of Cognition AI
Devin and Figure 1 both are impressive demos, but they are still just demos, not releases and not open source. For now, we can only guess at their technical capabilities.
The good news is there are many AI Agent efforts, both open source and proprietary, making progress, and open source AI Agent frameworks like Autogen Studio advance AI Agent capabilities in ways we can assess.
There’s a lot at stake here, including the future of work, so awe and fear are understandable reactions to progress. However, some reactions are jumping to conclusions, like witnessing a baby’s first steps and expecting it to run a marathon next week.
To bring us back to where I started, I’ll add one more fundamental point about AI to my list:
The hardest problem in AI is fully autonomous AI systems
In AI self-driving terms, the Devin demo is like an self-driving AI that managed to get out of the driveway and onto some local streets. The Cognition AI CEO admitted as much in saying “we’ve only just started.” Progress is solid, but there is a long journey to make these fully useful.
I’ll leave some words from Yann LeCun suggesting it will take some some. Perhaps AI Agents and their embodied AI robot cousins will some day replace all human labor, but not yet. Not yet.
We'll be closer to human-level AI when we have systems that can: - understand the physical world - remember and retrieve appropriately - reason - set sub-goals and plan hierarchically But even once we have systems with such capabilities, it will take a while before we bring them up to human or superhuman level. - Yann LeCun