评判评判者:大语言模型生成的相关性判断集
Judging the Judges: A Collection of LLM-Generated Relevance Judgements
February 19, 2025
作者: Hossein A. Rahmani, Clemencia Siro, Mohammad Aliannejadi, Nick Craswell, Charles L. A. Clarke, Guglielmo Faggioli, Bhaskar Mitra, Paul Thomas, Emine Yilmaz
cs.AI
摘要
利用大型语言模型(LLMs)进行相关性评估,为提升信息检索(IR)、自然语言处理(NLP)及相关领域提供了广阔前景。实际上,LLMs有望让IR实验者以当前所需人工劳动的一小部分构建评估集。这对于知识尚有限的新兴主题尤为有利,并能缓解在低资源环境下评估排序系统的挑战,这类环境中寻找人工标注者往往困难重重。鉴于该领域近期的快速发展,关于LLMs作为评估者的诸多问题仍有待解答。在需要进一步探究的方面中,我们可以列举出相关性判断生成流程中各个环节的影响,例如所使用的提示词或选定的LLM。
本文基准测试并报告了SIGIR 2024大会上LLMJudge挑战赛的大规模自动相关性判断评估结果,该挑战赛提出了多种相关性评估方法。具体而言,我们发布并基准测试了由参与挑战的八支国际团队生成的TREC 2023深度学习赛道相关性判断的42个LLM生成标签。鉴于其多样性,这些自动生成的相关性判断不仅有助于社区研究LLM引起的系统性偏差,还能探索集成模型的有效性,分析不同模型与人工评估者之间的权衡,并推进改进自动化评估技术的方法论。发布的资源可通过以下链接获取:https://llm4eval.github.io/LLMJudge-benchmark/
English
Using Large Language Models (LLMs) for relevance assessments offers promising
opportunities to improve Information Retrieval (IR), Natural Language
Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing
IR experimenters to build evaluation collections with a fraction of the manual
human labor currently required. This could help with fresh topics on which
there is still limited knowledge and could mitigate the challenges of
evaluating ranking systems in low-resource scenarios, where it is challenging
to find human annotators. Given the fast-paced recent developments in the
domain, many questions concerning LLMs as assessors are yet to be answered.
Among the aspects that require further investigation, we can list the impact of
various components in a relevance judgment generation pipeline, such as the
prompt used or the LLM chosen.
This paper benchmarks and reports on the results of a large-scale automatic
relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where
different relevance assessment approaches were proposed. In detail, we release
and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track
relevance judgments produced by eight international teams who participated in
the challenge. Given their diverse nature, these automatically generated
relevance judgments can help the community not only investigate systematic
biases caused by LLMs but also explore the effectiveness of ensemble models,
analyze the trade-offs between different models and human assessors, and
advance methodologies for improving automated evaluation techniques. The
released resource is available at the following link:
https://llm4eval.github.io/LLMJudge-benchmark/Summary
AI-Generated Summary