AI Research Roundup 24.10.11

Archon framework for inference techniques, Diff Transformer to reduce attention noise, RAG and beyond with 4 levels of data-query alignment, Astute RAG, OP-RAG, and Structured-RAG.

Oct 12, 2024

Figure 1. AI art - Autumn sunset in the forest.

Introduction

This week, our Research Roundup covers the topic of LLM inference, memory, context, and Retrieval-Augmented Generation (RAG). These papers present various methods to improve how LLMs access and use relevant context to generate higher quality output:

Archon: An Architecture Search Framework for Inference-Time Techniques
Differential Transformer
Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely
Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models
In Defense of RAG in the Era of Long-Context Language Models
StructuredRAG: JSON Response Formatting with Large Language Models

Archon: An Architecture Search Framework for Inference-Time Techniques

Researchers from Stanford University’s Scaling Intelligence Lab have developed a new inference framework called Archon, which uses an Inference-Time Architecture Search (ITAS) algorithm to improve LLM performance without additional training.

Described in the paper “Archon: An Architecture Search Framework for Inference-Time Techniques,” Archon is designed to be model agnostic and open source, helping reduce the costs of running AI models by optimizing inference.

The Archon framework utilizes the inference-time architecture search (ITAS) algorithm to build and evaluate inference-time architectures that encompass techniques such as ensemble generation, repeated sampling, ranking, fusion, critiquing, verification, and unit testing. ITAS determines a run-time architecture to maximize generation quality for a given set of tasks, transforming the problem of selecting an optimal inference approach into a hyperparameter optimization objective.

Figure 2. Overview of Archon Framework: Inference-Time Architecture Search (ITAS) takes as inputs the target benchmarks, inference call budget, list of available LLMs, and available inference-time techniques, then selects and tests different configurations, then returns the chosen optimized Archon architecture (right) for the target benchmarks.

Evaluations of Archon inference showed that the framework outperformed existing models like GPT-4 and Claude 3.5 Sonnet for a wide range of tasks. It did so by improving on benchmark tests over baseline LLMs by an average of 15%, being most effective with larger LLMs of 70B parameters or more.

Figure 3. An example output architecture created by ITAS process.

This work confirms the idea of applying inference-time compute to improve result quality. Various inference-time methods have been applied to improve LLM performance, and Archon is putting those methods within an optimization framework. It is also an interesting example of AI acceleration, as they convert AI systems design into an AI optimization task.

Figure 4. Performance gains from scaling and combining inference-time techniques (different Archon architectures) using top-8 70B+ open-source models on AlpacaEval 2.0, Arena-Hard-Auto, MT-Bench, MixEval, MixEval Hard, MATH, and CodeContests benchmarks.

Differential Transformer

The Differential Transformer paper from Microsoft and Tsinghua University is a groundbreaking result that improves how LLMs use information from context. Existing transformer-based LLMs over-allocate attention to irrelevant context, leading to unwanted noise in outputs, hallucinations, and performance variance. This paper introduces a new Differential Transformer architecture that reduces that noise, dramatically improving how LLMs glean information in their own context.

The Differential Transformer architecture modifies the transformer architecture with the Differential Attention module, that replaces standard transformer attention with one that computes two attention maps per input and subtracts them. This makes makes attention differential because each attention head takes the difference between two softmax attention maps to cancel out attention noise.

The authors compare this method to noise-cancelling headphones and say:

The idea is analogous to differential amplifiers proposed in electrical engineering, where the difference between two signals is used as output, so that we can null out the common-mode noise of the input.

Figure 5. The Differential Attention module and algorithm. This is applied in all layers of the multi-layer transformer-based architecture.

The result cancels attention noise in LLMs and enhances the model’s focus on critical information, which leads to some remarkably robust improvements in parameter savings, better long-context modeling, and reduced hallucinations. Experimental results showed the following:

DIFF Transformer requires only about 65% of model size or training tokens to match Transformer’s performance. … 7.8B-size DIFF Transformer matches the performance of 13.1B-size Transformer, requiring only 59.5% of parameters.
[on TREC in-context learning benchmark] DIFF Transformer consistently outperforms Transformer across datasets and varying numbers of demonstration samples. Moreover, the improvement in average accuracy is substantial, ranging from 5.2% to 21.6%.

Figure 6. Transformer models often over-attend to irrelevant context, creating attention noise. The Differential Transformer amplifies attention to relevant spans and cancels noise, enhancing context modeling and use of relevant context.

This architectural modification is compatible with FlashAttention and other elements of the transformer, so it can be implemented as an upgrade to existing LLM architectures. The robust results of Diff Transformer and its compatibility make it “a highly effective and promising architecture to advance LLMs.”

Retrieval Augmented Generation and LLM External Data Use

We believe that there is no one-size-fits-all solution for data-augmented LLM applications. In practice, underperformance often arises from a failure to correctly identify the core focus of a task or because the task inherently requires a blend of multiple capabilities that must be disentangled for better resolution. - From “Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More”

Researchers at Microsoft propose a framework categorizing different types of RAG tasks based on the complexity of external data and reasoning required, addressing challenges in enhancing LLMs with domain-specific knowledge. The framework includes four levels: explicit facts, implicit facts, interpretable rationales, and hidden rationales, each presenting unique technical hurdles.

It is presented in the paper “Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely.” They point out that effective use of data-augmented LLMs presents challenges that include not just data retrieval but also combining reasoning capabilities of LLMs with knowledge for complex tasks.

Figure 7. The 4 levels of queries. Each type of data dependency requires a different approach for data-query alignment.

For each of the four ‘levels’ they defined, they explain their scope and use, provide relevant datasets, and summarize the most effective techniques for addressing these challenges.

For example, the explicit and implicit fact queries (level 1 and 2) can be managed with RAG techniques, including iterative RAG for implicit fact collation. However, for complex hidden rationale queries that combine knowledge with reasoning chains, domain expertise acquired via fine-tuning or offline learning is more helpful.

Figure 8. Summary of Main Techniques for Different Query Levels in Data augmented LLM applications.

There have been many advances in integrating external knowledge into LLMs recently, including many variations on RAG, Graph-RAG, iterative RAG, fine-tuning, in-context learning, and prompting and token strategies for fact-verification. This paper provides a guide for where to use which ideas, and the authors conclude by noting there is a place for all such ideas:

Data augmented LLM applications typically involves a combination of diverse query types, necessitating developers to engineer a routing pipeline that integrates multiple methodologies to effectively tackle these multifaceted challenges.

Astute RAG: Overcoming Imperfect Retrieval and Knowledge Conflicts for LLMs

The paper Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models addresses the problem of conflicts and errors in retrieved information in RAG.

The authors examine knowledge conflicts between LLMs’ internal knowledge and external knowledge from retrieval, and they propose ASTUTE RAG as a solution, a novel Retrieval-Augmented Generation (RAG) approach designed to enhance LLM robustness against imperfect retrieval information. ASTUTE RAG employs three major steps:

Adaptive generation of internal knowledge to explicitly complement the retrieved passages.
Source-aware knowledge consolidation of information from various internal and external sources.
Answer finalization based on information reliability.

Experiments on various datasets show that ASTUTE RAG significantly outperforms earlier robustness-enhanced RAG methods:

Our experiments using Gemini and Claude demonstrate that Astute RAG significantly outperforms previous robustness-enhanced RAG methods. Notably, Astute RAG is the only approach that matches or exceeds the performance of LLMs without RAG under worst-case scenarios. Further analysis reveals that Astute RAG effectively resolves knowledge conflicts, improving the reliability and trustworthiness of RAG systems.

In Defense of RAG in the Era of Long-Context Language Models

The paper In Defense of RAG in the Era of Long-Context Language Models re-examines the effectiveness of RAG in long-context question answering, in light of the trend towards longer context in LLMs, and make the case for RAG even with long-context LLMs:

Unlike works favoring the long-context LLM over RAG, we argue that the extremely long context in LLMs suffers from a diminished focus on relevant information and leads to potential degradation in answer quality.

The authors propose order-preserve retrieval-augmented generation (OP-RAG), that preserves the order of retrieved chunks in the original text, as opposed to traditional RAG, which places the chunks in a relevance-descending order. The paper shows that the order of retrieved chunks in the context of the language model (LM) is vital for answer quality and that OP-RAG significantly improves the answer quality of RAG.

It also finds a trade-off between number of retrieved chunks and answer:

With OP-RAG, as the number of retrieved chunks increases, the answer quality initially rises, and then declines, forming an inverted U-shaped curve. There exist sweet points where OP-RAG could achieve higher answer quality with much less tokens than long-context LLM taking the whole context as input.

Experiments on public benchmarks demonstrate the superiority of OP-RAG over vanilla RAG, showing OP-RAG achieves higher answer quality versus counterparts that rely solely on long-context LMs.

StructuredRAG: JSON Response Formatting with LLMs

The ability of LLMs to generate structured outputs, such as JSON, is crucial for their use in many AI systems. Researchers at Weaviate (the open-source vector database) present the paper StructuredRAG: JSON Response Formatting with Large Language Models to address the challenge of evaluating and improving structured LLM outputs.

Structured RAG is a benchmark of six tasks designed to assess LLMs' proficiency in following response format instructions. They evaluate two state-of-the-art LLMs, Gemini 1.5 Pro and Llama 3 8B-instruct with 4-bit quantization using two distinct prompting strategies:

Across 24 experiments, we find an average success rate of 82.5%. We further find a high variance in performance across tasks, models, and prompting strategies with success rates ranging from 0 to 100%. We find that Llama 3 8B-instruct often performs competitively with Gemini 1.5 Pro.

The results show that task complexity significantly influences performance, with tasks involving lists or composite object outputs proving more challenging, and they also highlight the varying capabilities of LLMs in generating complex structured outputs.

Conclusion

Collective takeaways from these papers:

Optimization strategies (like Archon framework) can be used to choose the best inference techniques. Different techniques are best for different knowledge retrieval use-cases, as shown in the “RAG and Beyond” paper. Methods like Astute RAG can recover from data conflicts.

The most powerful improvement to managing context shared above may be Diff Transformer, a novel differential attention mechanism that can tune out noise and focus attention on the important context information. Research progress in knowledge retrieval (RAG) and inference techniques to improve LLM results is continuing at a rapid clip; this will continue to drive LLMs to be faster, cheaper, more factual and smarter.

AI Changes Everything

Discussion about this post