SPaR：自我對弈搭配樹搜索精煉以提升大型語言模型中的指令遵循

摘要

指示遵循是語言模型的基本能力，要求模型能夠識別指示中甚至最微妙的要求，並準確地在輸出中反映這些要求。這種能力非常適合並經常被偏好學習所優化。然而，現有方法在創建偏好對時常直接從模型中採樣多個獨立的回應。這種做法可能引入與是否準確遵循指示無關的內容變化（例如，關於相同語義的不同表達），干擾了教導模型識別導致改善指示遵循的關鍵差異的目標。基於此，我們介紹了SPaR，一個自我對弈框架，將樹搜索自我完善整合在一起，以產生沒有干擾的有效且可比較的偏好對。通過自我對弈，一個LLM使用樹搜索策略來根據指示來完善其先前的回應，同時最小化不必要的變化。我們的實驗表明，經過SPaR引導的三次迭代訓練的LLaMA3-8B模型，在IFEval基準測試中超越了GPT-4-Turbo，同時沒有失去一般能力。此外，SPaR展示了有望的可擴展性和可轉移性，極大地增強了像GLM-4-9B和LLaMA3-70B這樣的模型。我們還確定了樹搜索中的推理擴展如何影響模型性能。我們的代碼和數據公開在https://github.com/thu-coai/SPaR。

English

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.

SPaR：自我對弈搭配樹搜索精煉以提升大型語言模型中的指令遵循

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

摘要

Summary

Support