SDPO: ソーシャルエージェント向けのセグメントレベル直接選好最適化

要旨

大規模言語モデル（LLM）によって強化されたソーシャルエージェントは、人間の社会的行動をシミュレートできますが、複雑な目標志向型の社会的対話を処理するのには不十分です。直接的な選好最適化（DPO）は、様々なエージェントタスクにおいてLLMの振る舞いを人間の選好と整合させるのに効果的であることが証明されています。既存のDPOベースのアプローチは、多ターンの対話に対するターンレベルとセッションレベルの方法に分かれています。ターンレベルの方法は細かすぎて、個々のターンにのみ焦点を当てていますが、セッションレベルの方法は粗すぎて、しばしばトレーニングノイズを導入してしまいます。これらの制限に対処するために、私たちはセグメントレベルの直接選好最適化（SDPO）を提案しています。これは、対話内の特定のキーセグメントに焦点を当てて、トレーニングノイズを最小限に抑えながらマルチターンエージェントの振る舞いを最適化します。SOTOPIAベンチマークでの評価は、SDPOに調整されたエージェントが既存のDPOベースの方法やGPT-4oなどのプロプライエタリLLMを常に上回ることを示しており、SDPOがLLMベースのエージェントの社会的知能を向上させる可能性を強調しています。私たちは、当該コードとデータを以下のリンクから公開しています：https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO。

English

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

SDPO: ソーシャルエージェント向けのセグメントレベル直接選好最適化

SDPO: Segment-Level Direct Preference Optimization for Social Agents

要旨

Summary

Support