SDPO:社交代理人的分段級直接偏好優化

SDPO: Segment-Level Direct Preference Optimization for Social Agents

January 3, 2025
作者: Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
cs.AI

摘要

由大型語言模型(LLMs)驅動的社交代理可以模擬人類社交行為,但在處理複雜的目標導向社交對話方面表現不佳。直接偏好優化(DPO)已被證明在各種代理任務中對齊LLM行為與人類偏好方面非常有效。現有基於DPO的多輪互動方法分為轉換級和會話級方法。轉換級方法過於細緻,僅專注於個別轉換,而會話級方法過於粗糙,通常會引入訓練噪音。為解決這些限制,我們提出了段落級直接偏好優化(SDPO),該方法專注於互動中的特定關鍵段落,以優化多輪代理行為同時最小化訓練噪音。在SOTOPIA基準測試中的評估表明,經過SDPO調整的代理不斷優於現有基於DPO的方法和像GPT-4o這樣的專有LLMs,突顯了SDPO在提升基於LLM的代理的社交智能方面的潛力。我們在https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO 上發布了我們的代碼和數據。
English
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

Summary

AI-Generated Summary

PDF182January 6, 2025