ChatPaper.aiChatPaper

DiaTool-DPO:面向工具增强型大型語言模型的多輪直接偏好優化

DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models

April 2, 2025
作者: Sunghee Jung, Donghun Lee, Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Junrae Cho, Kihyun Kim, Eunggyun Kim, Myeongcheol Shin
cs.AI

摘要

工具增強型大型語言模型(TA-LLMs)在實際應用中展現出潛力,但在處理不完整查詢和超出範圍的請求時仍面臨挑戰。雖然現有方法主要依賴於專家軌跡的監督微調,我們提出了DiaTool-DPO,這是一種通過直接偏好優化來增強TA-LLM對話能力的新方法。我們將TA-LLM的互動建模為具有五種不同對話狀態的馬可夫決策過程,並根據狀態轉移軌跡將用戶查詢分為三種類型。我們自動構建了正確與錯誤對話流程的配對軌跡數據集,並引入了一種專門用於對話控制的目標損失函數。我們的全面評估表明,DiaTool-DPO在信息收集(94.8%)和工具調用拒絕(91%)方面接近GPT-4o的性能,相較於基線(分別為44%和9.6%)有顯著提升,同時保持了核心功能。我們的方法為開發能夠處理多樣化現實場景的TA-LLMs開闢了新的可能性,而無需額外的專家示範或人工標註。
English
Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM's dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o's performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.

Summary

AI-Generated Summary

PDF62April 8, 2025