ChatPaper.aiChatPaper

IOPO:通过输入输出偏好优化,赋予LLM复杂指令跟随的能力。

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

November 9, 2024
作者: Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li
cs.AI

摘要

在大型语言模型(LLMs)领域,模型准确遵循指令的能力至关重要,因为越来越多的代理和应用程序利用LLMs进行构建,其中指令的复杂性正在迅速增加。然而,一方面,复杂指令评估数据是有限的;另一方面,目前没有专门的算法来提高遵循复杂指令的能力。因此,本文介绍了TRACE,一个用于改进和评估复杂指令遵循能力的基准,包括12万个训练数据和1千个评估数据。此外,我们提出了IOPO(输入-输出偏好优化)对齐方法,该方法考虑了输入和输出偏好对,使得LLMs不仅能够快速与响应偏好对齐,而且能够细致地探索指令偏好。对领域内外数据集进行的大量实验证实了IOPO的有效性,相较于SFT和DPO,分别在领域内数据上提高了8.15%和2.18%,在领域外数据上提高了6.29%和3.13%。
English
In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

Summary

AI-Generated Summary

PDF208November 12, 2024