Over the past 2 years, AI has seen ideas move from research to industry implementation at a record breaking pace. This phenomenon continues to motivate our ongoing event series, New Ideas in AI, which brings together today’s most relevant AI authors, founders, and practitioners. This month, we had the honor of featuring Archit Sharma and Rafael Rafailov, two of the authors of the 2023 NeurIPS Outstanding Paper “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” (aka the DPO paper).
DPO presents a closed form, efficient, yet mathematically equivalent alternative to RLHF. After being credited as the algorithm responsible for many of the performance gains seen in the jaw dropping release of ChatGPT in 2022, RLHF became the gold standard for aligning language models to human preferences and requisite for achieving state of the art performance. However, the complexity, instability, and computational demands of the RLHF algorithm made it largely inaccessible and difficult to reproduce. DPO breaks down these barriers. With DPO already being used in production by AI industry leaders like Meta and Mistral, it is once again a testament to immediate applicability and pertinence of the research happening in AI today.
You can find the full talk with Archit and Rafael above.
Some highlights of the discussion we found to be the most exciting:
The practical implications of DPO:
- We live in a GPU-constrained world. DPO makes alignment accessible to those who previously lacked compute resources, and can free up the resources of those previously dependent on RLHF.
- Bringing down the barrier for aligning models means bringing down the barrier for enterprise adoption. Alignment is not only core to making AI better at the topics you care about, but also crucial for avoiding the topics that are out of bounds. These guarantees are non-negotiables for businesses that are looking to utilize AI.
- Less computational complexity also means less time to value. DPO can implicitly shorten the time it takes to get alignment data integrated into models allowing models to adapt faster and remain fresh.
Are there any reasons you wouldn’t want to use DPO?
In the words of the authors, “just try DPO first.” With such dramatic efficiency gains and in almost every case, no or negligible performance loss, the opportunity cost of trying DPO as an alignment method pales in comparison to that of RLHF.
Thank you again to Archit and Rafael. You can learn more about our New Ideas in AI Series here.