WildIFEval:真实场景下的指令遵循评估
WildIFEval: Instruction Following in the Wild
March 9, 2025
作者: Gili Lior, Asaf Yehudai, Ariel Gera, Liat Ein-Dor
cs.AI
摘要
近期的大型语言模型(LLMs)在遵循用户指令方面展现了显著成效,然而处理包含多重约束的指令仍是一项重大挑战。本研究中,我们推出了WildIFEval——一个包含12,000条真实用户指令的大规模数据集,这些指令具有多样化的多约束条件。与以往数据集不同,我们的收集涵盖了自然用户提示中广泛的词汇和主题约束范围。我们将这些约束划分为八大高层类别,以捕捉其在现实场景中的分布与动态。依托WildIFEval,我们进行了大量实验,对主流LLMs的指令遵循能力进行了基准测试。研究结果表明,随着约束数量的增加,所有评估模型均出现性能下降,这表明所有模型在此类任务上均有较大提升空间。此外,我们发现特定类型的约束对模型性能起着关键作用。我们公开此数据集,旨在推动在复杂现实条件下指令遵循能力的进一步研究。
English
Recent LLMs have shown remarkable success in following user instructions, yet
handling instructions with multiple constraints remains a significant
challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K
real user instructions with diverse, multi-constraint conditions. Unlike prior
datasets, our collection spans a broad lexical and topical spectrum of
constraints, in natural user prompts. We categorize these constraints into
eight high-level classes to capture their distribution and dynamics in
real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments
to benchmark the instruction-following capabilities of leading LLMs. Our
findings reveal that all evaluated models experience performance degradation
with an increasing number of constraints. Thus, we show that all models have a
large room for improvement on such tasks. Moreover, we observe that the
specific type of constraint plays a critical role in model performance. We
release our dataset to promote further research on instruction-following under
complex, realistic conditions.Summary
AI-Generated Summary