MALT: 다중 에이전트 LLM 훈련을 통한 추론 개선

초록

LLM 간의 효과적인 협력을 가능하게 하는 것은 복잡한 문제를 해결할 수 있는 자율 시스템을 개발하는 중요한 단계입니다. LLM은 일반적으로 단일 모델 생성기로 사용되지만, 인간이 그들의 출력물을 비평하고 개선하는 곳에서 공동 훈련된 협력 모델의 잠재력은 여전히 탐구되지 않은 상태입니다. 다중 에이전트 통신 및 토론 환경에서 융합된 모델들에 대한 유망한 결과가 있음에도 불구하고, 모델들을 과제에 협력하여 작업하도록 훈련하는 데는 아직 많은 진전이 이루어지지 않았습니다. 본 논문에서는 추론 문제에 대한 "다중 에이전트 LLM 훈련" (MALT)의 첫 번째 단계를 제시합니다. 저희 방법은 특화된 역할을 맡은 이질적 LLM들을 사용하는 순차적 다중 에이전트 설정을 활용하여 문제를 반복적으로 해결합니다. 우리는 궤적 확장 기반의 합성 데이터 생성 프로세스와 공동 결과를 기반으로 한 보상에 의해 주도되는 신용 할당 전략을 제안합니다. 이를 통해 훈련 후 설정에서 긍정적 및 부정적 궤적을 활용하여 각 모델의 특화된 능력을 자율적으로 향상시키는 것이 가능하며, 이는 공동 순차 시스템의 일부로 작동합니다. 우리는 MATH, GSM8k 및 CQA를 통해 접근 방식을 평가했으며, Llama 3.1 8B 모델에 대한 MALT는 동일한 기준 모델 대비 각각 14.14%, 7.12%, 9.40%의 상대적 향상을 달성했습니다. 이는 수학 및 상식적 추론 문제의 성능에 대한 다중 에이전트 협력 능력의 초기 진전을 보여줍니다. 보다 일반적으로, 저희의 연구는 다중 에이전트 LLM 훈련 방법 주변의 연구에 대한 구체적인 방향을 제시합니다.

English

Enabling effective collaboration among LLMs is a crucial step toward developing autonomous systems capable of solving complex problems. While LLMs are typically used as single-model generators, where humans critique and refine their outputs, the potential for jointly-trained collaborative models remains largely unexplored. Despite promising results in multi-agent communication and debate settings, little progress has been made in training models to work together on tasks. In this paper, we present a first step toward "Multi-agent LLM training" (MALT) on reasoning problems. Our approach employs a sequential multi-agent setup with heterogeneous LLMs assigned specialized roles: a generator, verifier, and refinement model iteratively solving problems. We propose a trajectory-expansion-based synthetic data generation process and a credit assignment strategy driven by joint outcome based rewards. This enables our post-training setup to utilize both positive and negative trajectories to autonomously improve each model's specialized capabilities as part of a joint sequential system. We evaluate our approach across MATH, GSM8k, and CQA, where MALT on Llama 3.1 8B models achieves relative improvements of 14.14%, 7.12%, and 9.40% respectively over the same baseline model. This demonstrates an early advance in multi-agent cooperative capabilities for performance on mathematical and common sense reasoning questions. More generally, our work provides a concrete direction for research around multi-agent LLM training approaches.

MALT: 다중 에이전트 LLM 훈련을 통한 추론 개선

MALT: Improving Reasoning with Multi-Agent LLM Training

초록

Summary

Support