IntellAgent：用于评估对话人工智能系统的多智能体框架

摘要

大型语言模型（LLMs）正在改变人工智能，演变成能够自主规划和执行任务的系统。LLMs的主要应用之一是对话式人工智能系统，这些系统必须处理多轮对话，集成特定领域的API，并遵守严格的策略约束。然而，评估这些代理的工作仍然是一个重大挑战，因为传统方法无法捕捉现实世界互动的复杂性和变化性。我们介绍了IntellAgent，这是一个可扩展的、开源的多代理框架，旨在全面评估对话式人工智能系统。IntellAgent通过结合基于策略驱动的图建模、真实事件生成和交互式用户代理模拟，自动化创建多样化的合成基准。这种创新方法提供了细粒度诊断，解决了静态和手动策划的基准测试的粗粒度指标的局限性。IntellAgent代表了对评估对话式人工智能的范式转变。通过模拟真实的、多策略情景，跨不同复杂性水平，IntellAgent捕捉了代理能力和策略约束的微妙相互作用。与传统方法不同，它采用基于图的策略模型来表示策略互动的关系、可能性和复杂性，从而实现高度详细的诊断。IntellAgent还识别了关键的性能差距，提供了针对性优化的可操作见解。其模块化、开源的设计支持新领域、策略和API的无缝集成，促进了可重现性和社区合作。我们的研究结果表明，IntellAgent作为一个有效的框架，通过解决研究和部署之间的挑战，推动了对话式人工智能的发展。该框架可在https://github.com/plurai-ai/intellagent获得。

English

Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent

IntellAgent：用于评估对话人工智能系统的多智能体框架

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

摘要

Summary

Support

Support