思维偏好优化
Thinking Preference Optimization
February 17, 2025
作者: Wang Yang, Hongye Jin, Jingfeng Yang, Vipin Chaudhary, Xiaotian Han
cs.AI
摘要
监督微调(SFT)一直是提升较小规模大语言模型(LLMs)长链思维(CoT)推理能力的有效方法,通过使用来自更大LLMs的长CoT响应进行微调。为了持续提升推理能力,我们既可以收集新的高质量长CoT推理SFT数据,也可以对现有SFT数据集进行重复训练。然而,获取新的长CoT SFT数据成本高昂且有限,而重复训练往往导致性能停滞或下降。为了进一步利用SFT数据提升性能,我们提出了思维偏好优化(ThinkPO),这是一种简单而有效的后SFT方法,无需新的长CoT响应即可增强长链推理能力。ThinkPO利用现成或易于获取的短CoT推理响应作为被拒绝的答案,而将长CoT响应作为同一问题的优选答案,随后应用直接偏好优化,促使模型倾向于生成更长的推理输出。实验表明,ThinkPO进一步提升了SFT模型的推理性能,例如,它将SFT模型的数学推理准确率提高了8.6%,输出长度增加了25.9%。值得注意的是,ThinkPO能够持续提升公开蒸馏SFT模型的性能,例如,将DeepSeek-R1-Distill-Qwen-7B在MATH500上的官方性能从87.4%提升至91.2%。
English
Supervised Fine-Tuning (SFT) has been a go-to and effective method for
enhancing long chain-of-thought (CoT) reasoning in relatively small LLMs by
fine-tuning them with long CoT responses from larger LLMs. To continually
improve reasoning abilities, we can either collect new high-quality long CoT
reasoning SFT data or repeatedly train on existing SFT datasets. However,
acquiring new long CoT SFT data is costly and limited, while repeated training
often results in a performance plateau or decline. To further boost the
performance with the SFT data, we propose Thinking Preference Optimization
(ThinkPO), a simple yet effective post-SFT method that enhances long CoT
reasoning without requiring new long CoT responses. Instead, ThinkPO utilizes
readily available or easily obtainable short CoT reasoning responses as
rejected answers and long CoT responses as chosen answers for the same
question. It then applies direct preference optimization to encourage the model
to favor longer reasoning outputs. Experiments show that ThinkPO further
improves the reasoning performance of SFT-ed models, e.g. it increases math
reasoning accuracy of SFT-ed models by 8.6% and output length by 25.9%.
Notably, ThinkPO is capable of continually boosting the performance of the
publicly distilled SFT model, e.g., increasing the official
DeepSeek-R1-Distill-Qwen-7B's performance on MATH500 from 87.4% to 91.2%.Summary
AI-Generated Summary