SDPO：社交代理人的分段級直接偏好優化

摘要

由大型語言模型（LLMs）驅動的社交代理可以模擬人類社交行為，但在處理複雜的目標導向社交對話方面表現不佳。直接偏好優化（DPO）已被證明在各種代理任務中對齊LLM行為與人類偏好方面非常有效。現有基於DPO的多輪互動方法分為轉換級和會話級方法。轉換級方法過於細緻，僅專注於個別轉換，而會話級方法過於粗糙，通常會引入訓練噪音。為解決這些限制，我們提出了段落級直接偏好優化（SDPO），該方法專注於互動中的特定關鍵段落，以優化多輪代理行為同時最小化訓練噪音。在SOTOPIA基準測試中的評估表明，經過SDPO調整的代理不斷優於現有基於DPO的方法和像GPT-4o這樣的專有LLMs，突顯了SDPO在提升基於LLM的代理的社交智能方面的潛力。我們在https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO 上發布了我們的代碼和數據。

English

Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex goal-oriented social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across a variety of agent tasks. Existing DPO-based approaches for multi-turn interactions are divided into turn-level and session-level methods. The turn-level method is overly fine-grained, focusing exclusively on individual turns, while session-level methods are too coarse-grained, often introducing training noise. To address these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which focuses on specific key segments within interactions to optimize multi-turn agent behavior while minimizing training noise. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.

SDPO：社交代理人的分段級直接偏好優化

SDPO: Segment-Level Direct Preference Optimization for Social Agents

摘要

Summary

Support