Fine-Tuning LLMs with Direct Preference Optimization
DPO - Direct Preference Optimization - is the new fine-tuning kid on the block
The Fine-Tuning Process
Fine-tuning LLMs has proven to be the essential added ingredient that evolved LLMs from mere chat completion models to aligned instruction-following useful AI models that answer questions, solve specific problems, and more.
We described and discussed the key steps of fine-tuning an LLM. The fine-tuning process used to make Instruct-GPT, ChatGPT and GPT-4, as well as other leading LLMs, follow these three steps:
Supervised Fine-Tuning (SFT): Fine-tune the model output against defined preferred outputs. This can tune a model to follow instructions or being a conversational chat model.
Train a Reward model: Using preference data from alternative outputs from the LLM, train a Reward model for use in RL.
Optimize with Reinforcement Learning: Use the Reward model to update an LM policy that guides the LLM to adjust its output aligned using RL.
This RLHF process helps align an LLM to be more aligned to follow user intent.
Regarding step 2, how do we define which of two outputs to prefer? This is where the human feedback in Reinforcement Learning with Human Feedback (RLHF) comes in. RLHF requires human labelers to create the preference data to feed the reward model. When OpenAI was building GPT-4, they required a vast army of gig workers (OpenAI hired people in Kenya to do this work).
So while RLHF proved to be effective, it’s also expensive, time-consuming and bottle-necked by need for human input, a sharp contrast to LLM pre-training that can be largely automated and scaled with massive compute.
An alternative to RLHF was presented in “RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback,” using off-the-shelf LLMs as annotators of AI model response preference data. Experiments showed the LLM annotation was at least as good as human annotation overall, showing RLAIF can achieve performance comparable to RLHF. Using an AI model itself to define preferences for a response may seem circular, but, given an appropriate prompt, a model can competently rate a response more easily than generating that response.
DPO - Direct Preference Optimization
Another issue with RLHF is its complexity. It first fits a reward model from examples and then uses RL against that. Shouldn’t there be a more direct way?
Stanford researchers answered this question affirmatively with an algorithm called Direct Preference Optimization (DPO). Their work was published in Direct Preference Optimization: Your Language Model is Secretly a Reward Model and presented at the NeurIPS 2023 conference held this week, where it won a 2023 NeurIPS paper award. The DPO algorithm is based on deriving an optimal policy in a direct and closed form, which avoids the need for an RL training loop:
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for fitting a reward model, sampling from the LM during fine-tuning, or performing significant hyperparameter tuning.
They show that, at least for models up to 6B parameters, the DPO method exceeded RLHF’s ability to align LLM with given preferences with a simpler implementation.
Zephyr and distilled DPO
DPO is making its way into AI model releases. Hugging Face in October released Zephyr 7B beta, a small but powerful aligned LLM fine-tuned on top of Mistral AI’s 7Bs model. Zephyr 7B Beta showed the effectiveness of the process of fine-tuning using distilled DPO.
Zephyr followed a procedure of first doing distilled Supervised Fine-Tuning (dSFT), followed by distilled Direct Preference Optimization (dDPO): Distilled DPO first ranks a dataset of outputs using a teacher model, then applies direct preference optimization (DPO) to learn the improved intent alignment. Zephyr’s Technical Report explained the three-step fine-tuning process:
(1) Large scale, self-instruct-style dataset construction (UltraChat), followed by distilled supervised fine-tuning (dSFT).
(2) AI Feedback (AIF) collection via an ensemble of chat model completions, followed by scoring by GPT-4 (UltraFeedback) and binarization into preferences.
(3) Distilled direct preference optimization (dPO) of the dSFT model utilizing the feedback data.
The result was the Zephyr 7B model that does remarkably well on chat-related benchmarks (MT-Bench score of 7.31), outperforming other open source models including the much larger Llama 70B chat model. Zephyr 7B Beta showed “A good teacher is all you need,” at least when you combine a teach model with good distillation DPO.
Notus uses data curation and distilled DPO
Notus is another 7B model built on the work in Zephyr 7B, using a variation on the same technique. Argilla, the team behind Notus, describes their Notus models on Github as “a collection of fine-tuned models using SFT, DPO, SFT+DPO, and/or any other RLAIF/RLHF techniques.” This mouthful of acronyms points to DPO as one of their techniques.
Meet Notus, an enhanced LLM with Data-Driven Fine-Tuning describes in more detail how Notus v1 7B was made. It followed the process used in Zephyr, with some adjustments to how they scored and used evaluations of responses.
Notably, while Zephyr used the overall critique score to determine chosen responses, Notus opted to analyze the average preference ratings.
Overall, the MT-Bench benchmark results for Notus were similar to Zephyr.
Mistral 8x7B Uses DPO
A final example of an LLM using DPO is Mistral’s latest release, their 8x7B LLM. As they mention in their Mixtral of experts post:
This model has been optimized through supervised fine-tuning and direct preference optimization (DPO) for careful instruction following. On MT-Bench, it reaches a score of 8.30, making it the best open-source model, with a performance comparable to GPT3.5.
The remarkably good benchmarks for the Mistral 8x7B model is a validation that DPO can scale up and be an effective alternative to RLHF for larger and more capable AI models.
Conclusion
The techniques and the architectures used to build LLMs and multi-modal foundation AI models are continually improving and evolving. There will be a continued effort to find more efficient and more automatic processes.
Although RLHF has been the standard method for fine-tuning for AI Alignment, there is a strong motivation to replace it with scalable, cheaper automated methods. DPO combined with using LLMs to evaluate responses for preference scoring is a strong contender.
So long as DPO continues to work, DPO will likely become the go-to process for resource-constrained open source LLM builders, except in cases where direct human feedback or human red-teaming is essential. Expect to see DPO to become more prevalent in the AI model builders’ stack.
PostScript. Alignment without Fine-Tuning
Another way of looking at alignment is to go back to the prompt. The project page for the Re-Align project for Allen Institute shares a link to the paper The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning, and it summarizes their findings:
Alignment affects only a very small fraction of tokens. The base and aligned LLMs behave the same in decoding on most positions, where they share the same top-ranked tokens.
Alignment mainly concerns stylistic tokens, such as discourse markers, transitional words, and safety disclaimers, which only take about 5-8% of the positions.
Alignment is more critical for earlier tokens. For most positions, the aligned model's top-ranked token is within the top 5 tokens ranked by the base model.
Base LLMs have already acquired adequate knowledge to follow instructions. They behave very similarly to aligned LLMs when given an appropriate context as a prefix.
Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. … Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF.
This results sounds surprising, but it is logical. A prompt is a way to steer output, so why can’t the right prompt do the heavy lifting of a fine-tune?
The challenge with this result is that the prompt is not a guardrail against abuse by users trying to jailbreak an AI model, so it doesn’t obviate the need for AI Alignment via fine-tuning. It is a good reminder of the power of good prompt engineering.