SDPO:面向社交代理的分段级直接偏好优化
SDPO: Segment-Level Direct Preference Optimization for Social Agents
January 3, 2025
作者: Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
cs.AI
摘要
由大型语言模型(LLMs)驱动的社交代理可以模拟人类社交行为,但在处理复杂的目标导向社交对话方面表现不佳。直接偏好优化(DPO)已被证明在各种代理任务中对齐LLM行为与人类偏好方面非常有效。现有基于DPO的多轮交互方法分为轮次级别和会话级别方法。轮次级别方法过于细粒度,专注于单个轮次,而会话级别方法过于粗粒度,通常会引入训练噪音。为了解决这些限制,我们提出了段级别直接偏好优化(SDPO),它专注于交互中的特定关键段以优化多轮代理行为,同时最小化训练噪音。在SOTOPIA基准测试上的评估表明,经过SDPO调优的代理始终优于现有基于DPO的方法和专有LLMs如GPT-4o,突显了SDPO推进基于LLM的代理的社交智能的潜力。我们在https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO 上发布了我们的代码和数据。
English
Social agents powered by large language models (LLMs) can simulate human
social behaviors but fall short in handling complex goal-oriented social
dialogues. Direct Preference Optimization (DPO) has proven effective in
aligning LLM behavior with human preferences across a variety of agent tasks.
Existing DPO-based approaches for multi-turn interactions are divided into
turn-level and session-level methods. The turn-level method is overly
fine-grained, focusing exclusively on individual turns, while session-level
methods are too coarse-grained, often introducing training noise. To address
these limitations, we propose Segment-Level Direct Preference Optimization
(SDPO), which focuses on specific key segments within interactions to optimize
multi-turn agent behavior while minimizing training noise. Evaluations on the
SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform
both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring
SDPO's potential to advance the social intelligence of LLM-based agents. We
release our code and data at
https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.Summary
AI-Generated Summary