IOPO : Donner aux LLMs les moyens de suivre des instructions complexes via l'optimisation des préférences d'entrée-sortie

Résumé

Dans le domaine des grands modèles de langage (GML), la capacité des modèles à suivre précisément les instructions est primordiale alors que de plus en plus d'agents et d'applications exploitent les GML pour la construction, où la complexité des instructions augmente rapidement. Cependant, d'une part, il n'y a qu'une certaine quantité de données d'évaluation d'instructions complexes ; d'autre part, il n'existe pas d'algorithmes dédiés pour améliorer la capacité à suivre des instructions complexes. À cette fin, cet article présente TRACE, un banc d'essai pour améliorer et évaluer la capacité à suivre des instructions complexes, qui comprend 120 000 données d'entraînement et 1 000 données d'évaluation. De plus, nous proposons la méthode d'alignement IOPO (Optimisation des Préférences Entrée-Sortie) qui prend en compte à la fois les paires de préférences d'entrée et de sortie, où les GML s'alignent non seulement rapidement avec les préférences de réponse mais explorent également méticuleusement les préférences d'instructions. Des expériences approfondies sur des ensembles de données à la fois dans le domaine et hors domaine confirment l'efficacité de IOPO, montrant des améliorations de 8,15 %, 2,18 % sur les données dans le domaine et de 6,29 %, 3,13 % sur les données hors domaine par rapport à SFT et DPO respectivement.

English

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

IOPO : Donner aux LLMs les moyens de suivre des instructions complexes via l'optimisation des préférences d'entrée-sortie

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Résumé

Support