IOPO: Het versterken van LLM's met Complexe Instructieopvolging via Input-Output Voorkeursoptimalisatie

Samenvatting

In het domein van grote taalmodellen (LLM's) is het vermogen van modellen om nauwkeurig instructies op te volgen van cruciaal belang, aangezien steeds meer agenten en toepassingen LLM's gebruiken voor constructie, waarbij de complexiteit van instructies snel toeneemt. Aan de ene kant is er echter slechts een bepaalde hoeveelheid complexe instructie-evaluatiedata beschikbaar; aan de andere kant zijn er geen speciale algoritmes om het vermogen om complexe instructies op te volgen te verbeteren. Daarom introduceert dit artikel TRACE, een benchmark voor het verbeteren en evalueren van het vermogen om complexe instructies op te volgen, die bestaat uit 120K trainingsdata en 1K evaluatiedata. Bovendien stellen we de IOPO (Input-Output Preference Optimization) aligneringsmethode voor, die zowel input- als outputvoorkeursparen in overweging neemt, waarbij LLM's niet alleen snel in lijn zijn met reactievoorkeuren, maar ook zorgvuldig instructievoorkeuren verkennen. Uitgebreide experimenten op zowel in-domein als out-of-domain datasets bevestigen de effectiviteit van IOPO, met respectievelijk 8,15%, 2,18% verbeteringen op in-domeindata en 6,29%, 3,13% op out-of-domain data in vergelijking met SFT en DPO.

English

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

IOPO: Het versterken van LLM's met Complexe Instructieopvolging via Input-Output Voorkeursoptimalisatie

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Samenvatting

Support