IOPO: Stärkung von LLMs durch komplexe Anweisungsbefolgung mittels Optimierung von Eingabe-Ausgabe-Präferenzen.

Zusammenfassung

Im Bereich großer Sprachmodelle (LLMs) ist die Fähigkeit von Modellen, Anweisungen präzise zu befolgen, von entscheidender Bedeutung, da immer mehr Agenten und Anwendungen LLMs für die Konstruktion nutzen, wobei die Komplexität der Anweisungen rapide zunimmt. Auf der einen Seite gibt es jedoch nur eine begrenzte Menge an Daten zur Bewertung komplexer Anweisungen; auf der anderen Seite gibt es keine dedizierten Algorithmen, um die Fähigkeit zur Befolgung komplexer Anweisungen zu verbessern. Zu diesem Zweck führt diese Arbeit TRACE ein, einen Benchmark zur Verbesserung und Bewertung der Fähigkeit zur Befolgung komplexer Anweisungen, der aus 120.000 Trainingsdaten und 1.000 Bewertungsdaten besteht. Darüber hinaus schlagen wir die IOPO (Input-Output Preference Optimization)-Ausrichtungsmethode vor, die sowohl Eingabe- als auch Ausgabepräferenzpaare berücksichtigt, wobei LLMs nicht nur schnell mit Antwortpräferenzen übereinstimmen, sondern auch sorgfältig die Anweisungspräferenzen erkunden. Umfangreiche Experimente mit sowohl in-domain als auch out-of-domain Datensätzen bestätigen die Wirksamkeit von IOPO und zeigen Verbesserungen von 8,15 % bzw. 2,18 % bei in-domain Daten und 6,29 % bzw. 3,13 % bei out-of-domain Daten im Vergleich zu SFT und DPO.

English

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

IOPO: Stärkung von LLMs durch komplexe Anweisungsbefolgung mittels Optimierung von Eingabe-Ausgabe-Präferenzen.

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

Zusammenfassung

Summary

Support