다중 에이전트 강화 학습을 사용하여 사회적 추론을 위한 언어 모델 훈련

초록

자연어로 소통하는 것은 다중 에이전트 환경에서 강력한 도구로 작용합니다. 이는 독립적인 에이전트들이 정보를 공유할 수 있게 하며 부분적으로 관측 가능한 환경에서 사람들과의 제로샷 협조를 가능하게 합니다. 그러나 대부분의 이전 연구는 대량의 인간 데모로 훈련하는 것에 의존하거나 자연스럽고 유용한 의사 소통 전략을 생성하는 능력이 부족한 한계가 있습니다. 본 연구에서는 어떠한 인간 데모 없이 언어 모델을 훈련하여 환경에 대해 자연어로 생산적인 토론을 할 수 있도록 합니다. 우리는 소통 문제를 듣기와 말하기로 분해합니다. 핵심 아이디어는 에이전트의 목표를 활용하여 세계에 대한 유용한 정보를 예측하는 밀도 있는 보상 신호로 소통을 안내하는 것입니다. 구체적으로, 우리는 모델이 토론을 기반으로 환경에 대한 정보를 예측하도록 훈련함으로써 모델의 듣기 기술을 개선하고, 다른 에이전트들에게 영향을 미치는 메시지를 보상함으로써 모델의 말하기 기술을 동시에 향상시킵니다. 복잡한 사회적 환경에서 소통의 역할과 필요성을 조사하기 위해 Among Us를 기반으로 한 신체적 사회적 추론 게임을 연구합니다. 여기서 대답해야 할 주요 질문은 적대적인 장난꾸러기의 정체성입니다. 우리의 기술로 인한 발생적 행동, 예를 들어 용의자 비난과 증거 제시 등을 분석하고, 이로 인해 표준 강화 학습과 비교하여 승률을 두 배로 늘리는 강력한 토론을 가능하게 한다는 것을 발견합니다. 우리의 코드와 모델은 https://socialdeductionllm.github.io/에서 공개됩니다.

English

Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/

다중 에이전트 강화 학습을 사용하여 사회적 추론을 위한 언어 모델 훈련

Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

초록

Summary

Support