SPaR: 대형 언어 모델에서 지시 따르기를 개선하기 위한 트리 탐색 정제와 함께 하는 셀프 플레이

초록

지시 따르기는 언어 모델의 기본적인 능력으로, 모델이 지시의 가장 미묘한 요구사항까지 인식하고 그것을 정확하게 출력에 반영해야 합니다. 이러한 능력은 선호 학습에 적합하며 종종 최적화됩니다. 그러나 기존 방법은 종종 모델에서 선호 쌍을 만들 때 모델로부터 여러 독립적인 응답을 직접 샘플링합니다. 이러한 방식은 지시가 정확히 따르는지와 관련이 없는 콘텐츠 변형을 도입할 수 있으며(예: 동일한 의미에 대한 다른 표현), 모델이 향상된 지시 따르기를 이끄는 주요 차이를 인식하는 데 방해가 될 수 있습니다. 이에 따라, 우리는 SPaR을 소개합니다. SPaR은 셀프 플레이 프레임워크로, 트리 탐색 자기 세발을 통합하여 산란 없는 유효하고 비교 가능한 선호 쌍을 만들어냅니다. LLM은 자신과 대결함으로써, 지시에 대한 이전 응답을 트리 탐색 전략을 사용하여 세밀하게 조정하면서 불필요한 변형을 최소화합니다. 우리의 실험 결과, SPaR에 의해 안내되는 세 번의 반복 훈련을 받은 LLaMA3-8B 모델은 IFEval 벤치마크에서 GPT-4-Turbo를 능가하며 일반적인 능력을 잃지 않습니다. 더 나아가, SPaR은 유망한 확장성과 이전성을 보여주며, GLM-4-9B 및 LLaMA3-70B와 같은 모델을 크게 향상시킵니다. 또한, 트리 탐색에서 추론 스케일링이 모델 성능에 어떤 영향을 미칠지 확인합니다. 우리의 코드와 데이터는 https://github.com/thu-coai/SPaR에서 공개적으로 이용 가능합니다.

English

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.

SPaR: 대형 언어 모델에서 지시 따르기를 개선하기 위한 트리 탐색 정제와 함께 하는 셀프 플레이

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

초록

Support