IOPO: 입력-출력 선호도 최적화를 통해 복잡한 명령 따르기를 강화하는 LLMs

초록

대형 언어 모델(LLM) 분야에서, 모델이 정확하게 지침을 따르는 능력은 점점 더 복잡한 지침이 빠르게 증가하는 상황에서 LLM을 활용하는 더 많은 에이전트와 응용 프로그램에서 중요합니다. 그러나 한편으로는 복잡한 지침 평가 데이터가 일정량만 존재하고, 다른 한편으로는 복잡한 지침을 따르는 능력을 향상시키기 위한 전용 알고리즘이 없습니다. 이에 본 논문에서는 복잡한 지침을 따르는 능력을 향상하고 평가하기 위한 TRACE라는 벤치마크를 소개합니다. 이 벤치마크는 120,000개의 훈련 데이터와 1,000개의 평가 데이터로 구성되어 있습니다. 더불어, 입력-출력 선호 최적화(IOPO) 정렬 방법을 제안합니다. 이 방법은 입력과 출력 선호 쌍을 모두 고려하여, LLM이 응답 선호와 빠르게 일치하면서도 지침 선호를 세심하게 탐구할 수 있도록 합니다. 도메인 내 및 도메인 외 데이터셋에 대한 포괄적인 실험을 통해 IOPO의 효과를 확인하였고, 이 결과 SFT 및 DPO에 비해 도메인 내 데이터에서 각각 8.15%, 2.18%의 향상, 도메인 외 데이터에서는 각각 6.29%, 3.13%의 향상을 보여주었습니다.

English

In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.

IOPO: 입력-출력 선호도 최적화를 통해 복잡한 명령 따르기를 강화하는 LLMs

IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization

초록

Support