SPaR：自我对弈与树搜索细化，以提高大型语言模型中的指令遵循能力

摘要

指令遵循是语言模型的基本能力，要求模型识别指令中甚至最微妙的要求，并准确地在输出中反映这些要求。这种能力非常适合并经常通过偏好学习进行优化。然而，现有方法在创建偏好对时通常直接从模型中抽样多个独立的响应。这种做法可能会引入与是否准确遵循指令无关的内容变化（例如，关于相同语义的不同表达），干扰了教导模型识别导致改进指令遵循的关键差异的目标。鉴于此，我们引入了SPaR，这是一个自我对弈框架，集成了树搜索自我完善，以产生没有干扰的有效可比偏好对。通过自我对弈，一个LLM利用树搜索策略，针对指令优化其先前的响应，同时最小化不必要的变化。我们的实验表明，经过SPaR引导的经过三次迭代训练的LLaMA3-8B模型，在IFEval基准测试中超越了GPT-4-Turbo，而不会失去一般能力。此外，SPaR展现出有望的可扩展性和可迁移性，极大地增强了GLM-4-9B和LLaMA3-70B等模型。我们还确定了树搜索中推理扩展如何影响模型性能。我们的代码和数据可在https://github.com/thu-coai/SPaR 上公开获取。

English

Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability and transferability, greatly enhancing models like GLM-4-9B and LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Our code and data are publicly available at https://github.com/thu-coai/SPaR.

SPaR：自我对弈与树搜索细化，以提高大型语言模型中的指令遵循能力

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

摘要

Support