IOPO:透過輸入輸出偏好優化賦能LLM進行複雜指令跟隨

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

November 9, 2024
作者: Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li
cs.AI

摘要

在大型語言模型(LLMs)領域中,模型準確遵循指示的能力至關重要,因為越來越多的代理和應用程式正在利用LLMs進行構建,其中指示的複雜性正在迅速增加。然而,一方面,複雜指示評估數據是有限的;另一方面,目前沒有專用的算法來提高遵循複雜指示的能力。為此,本文介紹了TRACE,一個用於改進和評估複雜指示遵循能力的基準,包括12萬個訓練數據和1千個評估數據。此外,我們提出了IOPO(輸入-輸出偏好優化)對齊方法,該方法考慮了輸入和輸出偏好對,使LLMs不僅快速與回應偏好對齊,還精心探索指示偏好。對於領域內和領域外數據集的廣泛實驗證實了IOPO的有效性,相對於SFT和DPO,分別在領域內數據上提高了8.15%,2.18%,在領域外數據上分別提高了6.29%,3.13%。
English
In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

Summary

AI-Generated Summary

PDF197November 12, 2024