思维偏好优化

摘要

监督微调（SFT）一直是提升较小规模大语言模型（LLMs）长链思维（CoT）推理能力的有效方法，通过使用来自更大LLMs的长CoT响应进行微调。为了持续提升推理能力，我们既可以收集新的高质量长CoT推理SFT数据，也可以对现有SFT数据集进行重复训练。然而，获取新的长CoT SFT数据成本高昂且有限，而重复训练往往导致性能停滞或下降。为了进一步利用SFT数据提升性能，我们提出了思维偏好优化（ThinkPO），这是一种简单而有效的后SFT方法，无需新的长CoT响应即可增强长链推理能力。ThinkPO利用现成或易于获取的短CoT推理响应作为被拒绝的答案，而将长CoT响应作为同一问题的优选答案，随后应用直接偏好优化，促使模型倾向于生成更长的推理输出。实验表明，ThinkPO进一步提升了SFT模型的推理性能，例如，它将SFT模型的数学推理准确率提高了8.6%，输出长度增加了25.9%。值得注意的是，ThinkPO能够持续提升公开蒸馏SFT模型的性能，例如，将DeepSeek-R1-Distill-Qwen-7B在MATH500上的官方性能从87.4%提升至91.2%。

English

Supervised Fine-Tuning (SFT) has been a go-to and effective method for enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by fine-tuning them with long CoT responses from larger LLMs. To continually improve reasoning abilities, we can either collect new high-quality long CoT reasoning SFT data or repeatedly train on existing SFT datasets. However, acquiring new long CoT SFT data is costly and limited, while repeated training often results in a performance plateau or decline. To further boost the performance with the SFT data, we propose Thinking Preference Optimization (ThinkPO), a simple yet effective post-SFT method that enhances long CoT reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes readily available or easily obtainable short CoT reasoning responses as rejected answers and long CoT responses as chosen answers for the same question. It then applies direct preference optimization to encourage the model to favor longer reasoning outputs. Experiments show that ThinkPO further improves the reasoning performance of SFT-ed models, e.g. it increases math reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%. Notably, ThinkPO is capable of continually boosting the performance of the publicly distilled SFT model, e.g., increasing the official DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.

思维偏好优化

Thinking Preference Optimization

摘要

Summary

Support