Flow-DPO:通過在線多智能體學習改進LLM數學推理
Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning
October 29, 2024
作者: Yihe Deng, Paul Mineiro
cs.AI
摘要
數學推理是大型語言模型(LLMs)的一項關鍵能力,然而生成詳細和準確的推理蹤跡仍然是一個重大挑戰。本文介紹了一種新方法,使用在線學習流來產生LLM微調的高質量推理蹤跡。我們的方法採用增量輸出生成流,其中組件LLMs通過迭代通信共同構建解決方案。我們使用在線直接偏好優化(DPO)學習與展開來訓練流,為每個訓練示例生成DPO對並實時更新模型。我們直接比較了我們方法生成的推理蹤跡質量與通過直接模型推理產生的蹤跡質量,展示了我們方法在改善LLM在數學推理任務中表現方面的有效性。
English
Mathematical reasoning is a crucial capability for Large Language Models
(LLMs), yet generating detailed and accurate reasoning traces remains a
significant challenge. This paper introduces a novel approach to produce
high-quality reasoning traces for LLM fine-tuning using online learning
Flows. Our method employs an incremental output production Flow, where
component LLMs collaboratively construct solutions through iterative
communication. We train the Flow using online Direct Preference Optimization
(DPO) learning with rollouts, generating DPO pairs for each training example
and updating models in real-time. We directly compare the quality of reasoning
traces generated by our method with those produced through direct model
inference, demonstrating the effectiveness of our approach in improving LLM
performance in mathematical reasoning tasks.Summary
AI-Generated Summary