AI Research Roundup 24.04.05

SWE-Agent, Octopus v2, Many-shot Jailbreaking, dynamic inference with Mixture-of-Depths, Apple's ReALM, GPT-4's bar exam marks questioned, CodeEditorBench.

Patrick McGuinness

Apr 05, 2024

Figure 1. Octopus v2 helps you run your smartphone AI agent.

Introduction

AI research highlights for this past week:

Many-shot Jailbreaking
SWE-Agent gets 12.3% on SWE-Bench
Octopus v2: An on-device function-calling AI model for better agents
Dynamic Inference Compute with Mixture-of-Depths
Study: OpenAI's GPT-4 Bar Exam Score Lower Than Claimed
CodeEditorBench: A Code Editor Benchmark
Apple's ReALM-3B Beats GPT-4 on Reference Resolution

Many-shot Jailbreaking

Anthropic AI researchers uncovered a new jailbreaking technique: Pure repetition. As reported in TechCrunch, Anthropic researchers can wear down AI ethics with repeated questions. They reported their results in a blog post and a (not yet peer-reviewed) paper: “Many Shot Jailbreaking.”

The gist of the jailbreak is simple -

The basis of many-shot jailbreaking is to include a faux dialogue between a human and an AI assistant within a single prompt for the LLM. That faux dialogue portrays the AI Assistant readily answering potentially harmful queries from a User. At the end of the dialogue, one adds a final target query to which one wants the answer.

Figure 2. Many-shot jailbreaking example.

So, if you give it enough examples (many-shot) of bad behavior, where it looks like the LLM already broke its own rules, it confuses the LLM enough to become willing to breach its own guardrails. How many examples? It takes a few hundred to break the LLM.

Figure 3. Many-shot jailbreaking becomes effective with hundreds of shots.

Anthropic shared this result to alert others and to help fix the jailbreak. How to defeat it? Some guardrails, including prompt modification, can be effective:

We had more success with methods that involve classification and modification of the prompt before it is passed to the model ... One such technique substantially reduced the effectiveness of many-shot jailbreaking — in one case dropping the attack success rate from 61% to 2%.

SWE-Agent gets 12.3% on SWE-Bench

The stunning demos of Cognition’s AI Agent Devin accomplishing difficult software development tasks have inspired competitors, copy-cats, and multiple open source efforts to create AI Agent software engineers. Examples of open source alternatives include OpenDevin and the project Devika (explainer vid here).

The Princeton NLP group have created and published another one: SWE-agent. They released SWE-Agent as open-source software to a Github repo, and a paper is coming April 10th. The big news with this is their shared results - SWE-Agent got 12.3% of SWE-Bench, close to the claimed results on Devin, with an average resolution time of 93 seconds to complete tasks.

How did SWE-agent get such good results? Using GPT-4 as its base code editing and running model, they innovated in how the LLM interacts with the computer:

We accomplish these results by designing simple LM-centric commands and feedback formats to make it easier for the LM to browse the repository, view, edit and execute code files. We call this an Agent-Computer Interface (ACI) and build the SWE-agent repository to make it easy to iterate on ACI design for repository-level coding agents.

Figure 5. SWE-Agent and the Agent-Computer Interface.

One final point that might call into question how much these AI Agents are doing beyond their underlying base LLMs. As reported on X, Claude Opus alone was only 2% worse than Devin:

Just ran SWE benchmark on opus-20240229, out of 1129 questions it gets 11.07% correct

This doesn’t mean AI Agents aren’t doing useful tasks beyond what even the best LLMs can do, but it reminds us that sometimes a good base AI model can do the job for you directly.

Octopus v2: An on-device function-calling AI model for better agents

Function-calling has become an essential capability for AI agents to perform useful actions, e.g., to update a calendar or book a hotel room online. The paper “Octopus v2: On-device language model for super agent” looks into improving agents’ capabilities to perform function-calling.

Cloud-based LLMs perform function-calling well, but there are issues with cost and privacy, while smaller local models have accuracy and latency issues. Invoking a costly LLM every time you want to do a function call can add up in flows of hundreds of steps. Using a RAG-based lookup (such as what Octopus v1 did) can save cost, but it can have high latency.

Octopus v2 creates a more efficient 2B model to perform function calls that can “surpass the performance of GPT-4 in both accuracy and latency, and decrease the context length by 95%.” It also reduces the latency “by 35-fold” relative to a flow based on Llama-7B and RAG.

Figure 6. Octopus v2 function-calling LLM compared to RAG-based function calling.

The implementation (as summarized on X) defines functions as special tokens and fine-tunes a small model to output appropriate functions based on input conditions:

Define supported functions as special tokens, e.g. <func_1> and add them to the tokenizer
Collect/Generate 100-1,000 data points for each function set, e.g. “take a photo” → <func2>
Fine-tune LLM with updated tokens on dataset to learn when to use which function and what parameters to use, here Gemma 2B
Deploy the trained and fine-tuned model on edge devices or build agents based on defining functions as special tokens

The result is a small special-purpose function-calling LLM that is fast, accurate and can work on-device.

Dynamic Inference Computing with Mixture-of-Depths

Currently, LLMs use the same effort on each token and each question. Are there ways to allocate more or less effort to inference depending on the situation?

In Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, Google Deep-Mind research present an approach to dynamically allocate effort in token calculation.

Figure 7. Mixture-of-Depths transformer architecture uses a router (similar to MoE router) that decides between going through the layer or a skip connection.

The Mixture-of-Depths architecture introduces a router at each layer that enables skipping that layer in computing token outputs. They dynamically cause some layers to be skipped for some tokens by enforcing a budget that caps the number of tokens that can use the self-attention and MLP computations for a given layer. The result:

Not only do models trained in this way learn to dynamically allocate compute, they do so efficiently. These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling.

While this shows great promise for more efficient inference serving, there are practical issue limiting its application.

Study: OpenAI's GPT-4 Bar Exam Score Lower Than Claimed

Did GPT-4 really pass the bar exam? OpenAI claimed GPT-4 scored in the 90th percentile on the bar exam, but the recent study “Re-evaluating GPT-4’s bar exam performance” suggests it actually scored much lower.

The study, by MIT PhD candidate Eric Martínez, showed that the methodology used to score GPT-4 and compare to other test-takers was flawed. For example, the score was compared to a test-taking sample “skewed towards repeat test-takers who failed the July administration and score significantly lower.” Correcting for that, he found:

GPT-4’s overall UBE percentile was below the 69th percentile, and ∼48th percentile on essays. … GPT-4’s performance against first-time test takers is estimated to be ∼62nd percentile, including ∼42nd percentile on essays.

The challenge of meaningful benchmarks is only going to get more difficult as AI models get more capable and complex.

CodeEditorBench: A Code Editor Benchmark

Seeking to address the benchmarking challenge, the authors of “CodeEditorBench: Evaluating Code Editing Capability of Large Language Models” give us a valuable benchmark to help evaluate and advance AI coding models. CodeEditorBench assesses performance of LLMs using prompts and datasets across several code editing categories - debugging, translating, polishing, and requirement switching:

CodeEditorBench emphasizes real-world scenarios and practical aspects of software development. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks.

On CodeEditorBench, the top scoring LLM was GPT-4, with Gemini Ultra also performing well. They are releasing all prompts and datasets to the community.

Figure 8. CodeEditorBench 4-part rubric and results.

Apple's ReALM-3B Beats GPT-4 on Reference Resolution

Apple has been quietly working on their own LLMs (project Ajax) as well as producing interesting research results to share. Their latest paper - "ReALM: Reference Resolution As Language Modeling" - focuses on using LLMs to effectively resolve references that occur in multi-turn conversation contexts:

This paper demonstrates how LLMs can be used to create an extremely effective system to resolve references of various types, by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality.

As a special-purpose small LLM, ReALM outperforms GPT-4 for tasks that involve on-screen references without needing screenshots as inputs—unlike GPT-4, which shows improved performance when provided with images. ReALM performs effectively in scenarios where traditional LLMs not specialized in reference resolution might struggle.

AI Changes Everything

Discussion about this post