MALT：透過多智能體LLM訓練來改善推理

摘要

促進LLM之間有效協作是發展能夠解決複雜問題的自主系統的關鍵步驟。雖然LLM通常被用作單模型生成器，人們對其輸出進行評論和精煉，但共同訓練協作模型的潛力仍然大多未被探索。儘管在多智能體通信和辯論環境中取得了一些有希望的結果，但在訓練模型共同處理任務方面進展甚微。本文提出了朝向在推理問題上進行“多智能體LLM訓練”（MALT）的第一步。我們的方法採用了一種順序多智能體設置，其中異質LLM被分配專門角色：一個生成器、一個驗證器和一個精煉模型，它們通過迭代解決問題。我們提出了一種基於軌跡擴展的合成數據生成過程，以及一種基於聯合結果為基礎的獎勵驅動的信用分配策略。這使得我們的後訓練設置能夠利用正向和負向軌跡來自主改進每個模型的專業能力，作為一個聯合順序系統的一部分。我們在MATH、GSM8k和CQA上評估了我們的方法，在Llama 3.1 8B模型上，MALT實現了相對改進，分別為14.14％、7.12％和9.40％。這表明了在數學和常識推理問題的性能上，多智能體合作能力的早期進展。更廣泛地說，我們的工作為圍繞多智能體LLM訓練方法的研究提供了具體方向。

English

Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.

MALT：透過多智能體LLM訓練來改善推理

MALT: Improving Reasoning with Multi-Agent LLM Training

摘要

Support