单一模型能否同时掌握多轮对话与工具使用？ CALM：统一对话式智能语言模型

摘要

具备API调用能力的大型语言模型（LLMs）不仅推动了高效语言代理（LA）的构建，还彻底革新了传统的任务导向对话（TOD）范式。然而，现有方法面临一个关键困境：TOD系统通常仅针对有限的目标API进行训练，当接入新服务时需补充新数据以维持其性能，而LA则未经过多轮对话中用户意图保持的训练。鉴于强大的多轮对话管理与高级功能调用对于高效对话代理至关重要，我们在三个主流基准上评估了这些能力：MultiWOZ 2.4（TOD）、BFCL V3（LA）和API-Bank（LA），分析表明，专业化方法在某一领域表现出色，但在另一领域则表现欠佳。为弥合这一鸿沟，我们提出了CALM（对话式代理语言模型），一种融合对话与代理能力的统一方法。我们构建了CALM-IT，一个精心设计的多任务数据集，其中交织了多轮ReAct推理与复杂API使用。利用CALM-IT，我们训练了三个模型：CALM 8B、CALM 70B和CALM 405B，这些模型在所有三个基准测试中均超越了包括GPT-4o在内的顶级领域专用模型。

English

Large Language Models (LLMs) with API-calling capabilities enabled building effective Language Agents (LA), while also revolutionizing the conventional task-oriented dialogue (TOD) paradigm. However, current approaches face a critical dilemma: TOD systems are often trained on a limited set of target APIs, requiring new data to maintain their quality when interfacing with new services, while LAs are not trained to maintain user intent over multi-turn conversations. Because both robust multi-turn management and advanced function calling are crucial for effective conversational agents, we evaluate these skills on three popular benchmarks: MultiWOZ 2.4 (TOD), BFCL V3 (LA), and API-Bank (LA), and our analyses reveal that specialized approaches excel in one domain but underperform in the other. To bridge this chasm, we introduce CALM (Conversational Agentic Language Model), a unified approach that integrates both conversational and agentic capabilities. We created CALM-IT, a carefully constructed multi-task dataset that interleave multi-turn ReAct reasoning with complex API usage. Using CALM-IT, we train three models CALM 8B, CALM 70B, and CALM 405B, which outperform top domain-specific models, including GPT-4o, across all three benchmarks.

单一模型能否同时掌握多轮对话与工具使用？ CALM：统一对话式智能语言模型

Can a Single Model Master Both Multi-turn Conversations and Tool Use? CALM: A Unified Conversational Agentic Language Model

摘要

Summary

Support