评判评判者：大语言模型生成的相关性判断集

摘要

利用大型语言模型（LLMs）进行相关性评估，为提升信息检索（IR）、自然语言处理（NLP）及相关领域提供了广阔前景。实际上，LLMs有望让IR实验者以当前所需人工劳动的一小部分构建评估集。这对于知识尚有限的新兴主题尤为有利，并能缓解在低资源环境下评估排序系统的挑战，这类环境中寻找人工标注者往往困难重重。鉴于该领域近期的快速发展，关于LLMs作为评估者的诸多问题仍有待解答。在需要进一步探究的方面中，我们可以列举出相关性判断生成流程中各个环节的影响，例如所使用的提示词或选定的LLM。本文基准测试并报告了SIGIR 2024大会上LLMJudge挑战赛的大规模自动相关性判断评估结果，该挑战赛提出了多种相关性评估方法。具体而言，我们发布并基准测试了由参与挑战的八支国际团队生成的TREC 2023深度学习赛道相关性判断的42个LLM生成标签。鉴于其多样性，这些自动生成的相关性判断不仅有助于社区研究LLM引起的系统性偏差，还能探索集成模型的有效性，分析不同模型与人工评估者之间的权衡，并推进改进自动化评估技术的方法论。发布的资源可通过以下链接获取：https://llm4eval.github.io/LLMJudge-benchmark/

English

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

评判评判者：大语言模型生成的相关性判断集

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

摘要

Summary

Support