MALT：利用多智能体LLM训练改善推理

摘要

促进大型语言模型（LLMs）之间有效协作是发展能够解决复杂问题的自主系统的关键一步。虽然LLMs通常用作单模型生成器，人类会对其输出进行评审和改进，但联合训练协作模型的潜力仍然大部分未被探索。尽管在多智体通信和辩论环境中取得了一些有希望的结果，但在训练模型共同处理任务方面进展甚微。本文提出了在推理问题上迈出的“多智体LLM训练”（MALT）的第一步。我们的方法采用了一个顺序多智体设置，其中异构LLMs被分配专门角色：一个生成器、一个验证器和一个改进模型，它们迭代地解决问题。我们提出了基于轨迹扩展的合成数据生成过程和一个基于联合结果驱动的信用分配策略。这使得我们的后训练设置能够利用积极和消极轨迹来自主改进每个模型的专业能力，作为一个联合顺序系统的一部分。我们在MATH、GSM8k和CQA上评估了我们的方法，其中在Llama 3.1 8B模型上的MALT相对改进分别为14.14％、7.12％和9.40％。这展示了在数学和常识推理问题性能上多智体合作能力的早期进展。总的来说，我们的工作为围绕多智体LLM训练方法的研究提供了一个具体方向。

English

Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.

MALT：利用多智能体LLM训练改善推理

MALT: Improving Reasoning with Multi-Agent LLM Training

摘要

Summary

Support

Support