StructFlowBench:面向多轮指令跟随的结构化流程基准测试
StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
February 20, 2025
作者: Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
cs.AI
摘要
多轮指令跟随能力是大型语言模型(LLMs)在现实应用中的核心技能。现有的评估基准主要聚焦于细粒度约束满足和特定领域能力评估,却忽视了对话轮次间关键的结构依赖关系,这一关系正是多轮交互与单轮交互的区别所在。这种结构依赖不仅反映了用户意图,还为指令跟随评估开辟了超越约束满足的第二维度。为填补这一空白,我们提出了StructFlowBench,一个结合结构流建模的多轮指令跟随评估基准。该基准创新性地定义了一个包含六种基本轮间关系的结构流框架,不仅为模型评估引入了新颖的结构约束,还作为生成参数,用于创建针对特定场景定制的对话流。采用成熟的基于LLM的自动评估方法,我们对13个领先的开源和闭源LLM进行了系统评估。实验结果表明,当前模型在多轮对话结构理解方面存在显著不足。代码已发布于https://github.com/MLGroupJLU/StructFlowBench。
English
Multi-turn instruction following capability constitutes a core competency of
large language models (LLMs) in real-world applications. Existing evaluation
benchmarks predominantly focus on fine-grained constraint satisfaction and
domain-specific capability assessment, yet overlook the crucial structural
dependency between dialogue turns that distinguishes multi-turn from
single-turn interactions. This structural dependency not only reflects user
intent but also establishes a second dimension for instruction following
evaluation beyond constraint satisfaction. To address this gap, we propose
StructFlowBench, a multi-turn instruction following benchmark with structural
flow modeling. The benchmark innovatively defines a structural flow framework
comprising six fundamental inter-turn relationships, which not only introduces
novel structural constraints for model evaluation but also serves as generation
parameters for creating customized dialogue flows tailored to specific
scenarios. Adopting established LLM-based automatic evaluation methodologies,
we conduct systematic evaluations of 13 leading open-source and closed-source
LLMs. Experimental results reveal significant deficiencies in current models'
comprehension of multi-turn dialogue structures. The code is available at
https://github.com/MLGroupJLU/StructFlowBench.Summary
AI-Generated Summary