ORPO: NEW DPO Alignment and SFT Method for LLM

Published: 24 March 2024
on channel: Discover AI
4,104
113

Instead of the classical SFT and DPO alignment for training our LLMs, there is a new method available. A innovative "reference model-free" monolithic odds ratio reference optimization algorithm, ORPO, eliminating the necessity for an additional preference alignment phase.

A New Preference-aligned SFT method.

We explore this idea from a theoretical physics perspective and notice a similarity to the regularizations terms methodologies. We further explore the conceptional similarities from a Lagrange Multiplier to new correction terms in addition to the classical SFT loss functional.

The performance figures of ORPO are given in comparison to a LLama 2 and a Mistral 7B model.

ORPO: Monolithic Preference Optimization without Reference Model
https://arxiv.org/pdf/2403.07691v2.pdf


Watch video ORPO: NEW DPO Alignment and SFT Method for LLM online without registration, duration hours minute second in high quality. This video was added by user Discover AI 24 March 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 4,104 once and liked it 113 people.