ChatPaper.aiChatPaper

IFIR:专家领域信息检索中指令遵循评估的综合基准

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval

March 6, 2025
作者: Tingyu Song, Guo Gan, Mingsheng Shang, Yilun Zhao
cs.AI

摘要

我们推出了IFIR,这是首个旨在评估专家领域中指令跟随信息检索(IR)的综合基准。IFIR包含2,426个高质量示例,覆盖了金融、法律、医疗和科学文献四个专业领域的八个子集。每个子集针对一个或多个特定领域的检索任务,模拟了现实场景中定制化指令至关重要的情境。IFIR通过融入不同复杂程度的指令,实现了对指令跟随检索能力的细致分析。我们还提出了一种基于大语言模型(LLM)的新型评估方法,以更精确、可靠地衡量模型在遵循指令方面的表现。通过对15种前沿检索模型(包括基于LLM的模型)进行广泛实验,我们的结果表明,当前模型在有效遵循复杂、领域特定指令方面面临显著挑战。我们进一步提供了深入分析,以凸显这些局限性,为未来检索器的发展提供了宝贵的指导洞见。
English
We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

Summary

AI-Generated Summary

PDF202March 7, 2025