ChatPaper.aiChatPaper

LongDPO:通过批判增强的分步信息解锁LLM更好的长文生成能力

LongDPO: Unlock Better Long-form Generation Abilities for LLMs via Critique-augmented Stepwise Information

February 4, 2025
作者: Bowen Ping, Jiali Zeng, Fandong Meng, Shuo Wang, Jie Zhou, Shanghang Zhang
cs.AI

摘要

长篇生成对于学术论文和代码生成在存储库级别上至关重要。尽管如此,包括GPT-4o在内的当前模型仍然表现不佳。现有方法利用偏好学习和结果监督,通常无法为扩展上下文提供详细反馈。这一缺陷可能导致内容未能完全满足查询要求,从而导致长度偏差和质量下降等问题。本文提出通过整合过程监督来增强长篇生成。我们采用蒙特卡洛树搜索来收集逐步偏好对,利用全局内存池来保持一致性。为解决次优候选选择问题,我们整合外部评论来完善和提高偏好对的质量。最后,我们应用收集的逐步偏好对来进行步级DPO。实验结果表明,我们的方法在长篇生成基准上提高了长度和质量,在各种模型主干上的一般基准上几乎没有性能损失。
English
Long-form generation is crucial for academic writing papers and repo-level code generation. Despite this, current models, including GPT-4o, still exhibit unsatisfactory performance. Existing methods that utilize preference learning with outcome supervision often fail to provide detailed feedback for extended contexts. This shortcoming can lead to content that does not fully satisfy query requirements, resulting in issues like length deviations, and diminished quality. In this paper, we propose enhancing long-form generation by incorporating process supervision. We employ Monte Carlo Tree Search to gather stepwise preference pairs, utilizing a global memory pool to maintain consistency. To address the issue of suboptimal candidate selection, we integrate external critiques to refine and improve the quality of the preference pairs. Finally, we apply step-level DPO using the collected stepwise preference pairs. Experimental results show that our method improves length and quality on long-form generation benchmarks, with almost lossless performance on general benchmarks across various model backbones.

Summary

AI-Generated Summary

PDF42February 5, 2025