LongReward: AI를 활용한 장기 맥락 대형 언어 모델 개선

초록

긴 맥락 대형 언어 모델(LLM) 개발에서 상당한 발전이 이루어졌지만, 감소된 품질의 LLM 합성 데이터는 감독된 세밀 조정(SFT)을 위한 장기적 성능에 영향을 미치며 내재적 한계를 야기하는 경우가 많습니다. 원칙적으로 적절한 보상 신호와 함께 강화 학습(RL)은 모델의 능력을 더욱 향상시킬 수 있습니다. 그러나 긴 맥락 시나리오에서 신뢰할 수 있는 보상을 얻는 방법은 아직 탐구되지 않았습니다. 이에 우리는 LongReward라는 새로운 방법을 제안합니다. 이 방법은 오프더셀프 LLM을 활용하여 도움, 논리성, 충실성, 완전성이라는 인간의 가치관을 반영한 네 가지 차원에서 장기적 모델 응답에 보상을 제공하는데, 각각을 신중히 설계된 평가 파이프라인을 통해 수행합니다. LongReward와 오프라인 RL 알고리즘인 DPO를 결합함으로써 우리는 효과적으로 장기적 SFT 모델을 개선할 수 있습니다. 실험 결과 LongReward가 모델의 장기적 성능을 현저히 향상시키는데 그치지 않고 짧은 지시사항을 따르는 능력도 향상시킨다는 것을 보여줍니다. 또한 LongReward와 전통적인 짧은 맥락 DPO를 함께 사용하여 양쪽의 성능을 해치지 않고 사용할 수 있다는 것을 발견했습니다.

English

Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models' capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models' long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one's performance.

LongReward: AI를 활용한 장기 맥락 대형 언어 모델 개선

LongReward: Improving Long-context Large Language Models with AI Feedback

초록

Support