Flow-DPO: 온라인 다중 에이전트 학습을 통해 LLM 수학적 추론 개선하기

초록

수학적 추론은 대규모 언어 모델(LLMs)에 대한 중요한 능력이지만, 자세하고 정확한 추론 트레이스를 생성하는 것은 여전히 중요한 과제입니다. 본 논문은 온라인 학습 플로우를 사용하여 LLM 세부 조정을 위한 고품질 추론 트레이스를 생성하는 새로운 접근 방식을 소개합니다. 우리의 방법은 구성 LLM이 반복적인 통신을 통해 협력하여 솔루션을 구축하는 증분 출력 생성 플로우를 사용합니다. 우리는 롤아웃을 사용한 온라인 직접 선호도 최적화(DPO) 학습을 통해 플로우를 훈련시키고, 각 훈련 예제에 대해 DPO 쌍을 생성하고 모델을 실시간으로 업데이트합니다. 우리의 방법으로 생성된 추론 트레이스의 품질을 직접 모델 추론을 통해 생성된 것과 비교하여, 수학적 추론 작업에서 LLM 성능을 향상시키는 우리의 접근 방식의 효과를 입증합니다.

English

Mathematical reasoning is a crucial capability for Large Language Models (LLMs), yet generating detailed and accurate reasoning traces remains a significant challenge. This paper introduces a novel approach to produce high-quality reasoning traces for LLM fine-tuning using online learning Flows. Our method employs an incremental output production Flow, where component LLMs collaboratively construct solutions through iterative communication. We train the Flow using online Direct Preference Optimization (DPO) learning with rollouts, generating DPO pairs for each training example and updating models in real-time. We directly compare the quality of reasoning traces generated by our method with those produced through direct model inference, demonstrating the effectiveness of our approach in improving LLM performance in mathematical reasoning tasks.

Flow-DPO: 온라인 다중 에이전트 학습을 통해 LLM 수학적 추론 개선하기

Flow-DPO: Improving LLM Mathematical Reasoning through Online Multi-Agent Learning

초록

Support