DiaTool-DPO:面向工具增强型大语言模型的多轮直接偏好优化
DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models
April 2, 2025
作者: Sunghee Jung, Donghun Lee, Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Junrae Cho, Kihyun Kim, Eunggyun Kim, Myeongcheol Shin
cs.AI
摘要
工具增强型大型语言模型(TA-LLMs)在现实应用中展现出潜力,但在处理不完整查询和超出范围请求时面临挑战。现有方法主要依赖于专家轨迹的监督微调,而我们提出了DiaTool-DPO,一种通过直接偏好优化增强TA-LLM对话能力的新方法。我们将TA-LLM的交互建模为具有5种不同对话状态的马尔可夫决策过程,并根据状态转移轨迹将用户查询分为3种类型。我们自动构建了正确与错误对话流程的配对轨迹数据集,并引入了一种专门用于对话控制的目标损失函数。全面评估表明,DiaTool-DPO在信息收集(94.8%)和工具调用拒绝(91%)方面接近GPT-4o的性能,相较于基线(分别为44%和9.6%)有显著提升,同时保持了核心功能。我们的方法为开发能够应对多样化现实场景的TA-LLMs开辟了新途径,无需额外的专家演示或人工标注。
English
Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in
real-world applications, but face challenges in handling incomplete queries and
out-of-scope requests. While existing approaches rely mainly on Supervised
Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method
that enhances TA-LLM's dialogue capabilities through Direct Preference
Optimization. We model TA-LLM interactions as a Markov Decision Process with 5
distinct dialogue states and categorize user queries into 3 types based on
their state transition trajectories. We automatically construct paired
trajectory datasets of correct and incorrect dialogue flows and introduce a
specialized objective loss for dialogue control. Our comprehensive evaluation
demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in
information gathering, 91% in tool call rejection) with substantial
improvements over baseline (44% and 9.6% respectively) while maintaining core
functionality. Our approach opens new possibilities for developing TA-LLMs that
can handle diverse real-world scenarios without requiring additional expert
demonstrations or human labeling.Summary
AI-Generated Summary