ChatPaper.aiChatPaper

Flow-DPO:通过在线多智能体学习改进LLM数学推理

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

October 29, 2024
作者: Yihe Deng, Paul Mineiro
cs.AI

摘要

数学推理是大型语言模型(LLMs)的一个关键能力,然而生成详细和准确的推理过程仍然是一个重要挑战。本文介绍了一种新颖的方法,使用在线学习流生成LLM微调的高质量推理过程。我们的方法采用增量输出生成流,其中组件LLMs通过迭代通信协作构建解决方案。我们使用在线直接偏好优化(DPO)学习与展开来训练流,为每个训练示例生成DPO对,并实时更新模型。我们直接比较了我们方法生成的推理过程质量与通过直接模型推断产生的推理过程的质量,展示了我们方法在改善LLM在数学推理任务中表现方面的有效性。
English
Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning Flows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

Summary

AI-Generated Summary

PDF182November 16, 2024