판단자를 판단하다: LLM 생성 관련성 평가 모음

초록

대규모 언어 모델(LLM)을 활용한 관련성 평가는 정보 검색(IR), 자연어 처리(NLP) 및 관련 분야의 개선을 위한 유망한 기회를 제공합니다. 실제로, LLM은 IR 실험자들이 현재 요구되는 수동 인력의 일부만으로 평가 컬렉션을 구축할 수 있게 해줄 가능성이 있습니다. 이는 아직 지식이 제한적인 최신 주제에 도움을 줄 수 있으며, 인간 평가자를 찾기 어려운 저자원 시나리오에서 순위 시스템 평가의 어려움을 완화할 수 있습니다. 이 분야의 빠른 발전 속도를 고려할 때, LLM을 평가자로 사용하는 것과 관련된 많은 질문들이 아직 해결되지 않았습니다. 추가 연구가 필요한 측면 중에는 프롬프트나 선택된 LLM과 같은 관련성 판단 생성 파이프라인의 다양한 구성 요소의 영향이 포함됩니다. 이 논문은 SIGIR 2024에서 개최된 대규모 자동 관련성 판단 평가인 LLMJudge 챌린지의 결과를 벤치마킹하고 보고합니다. 구체적으로, 우리는 이 챌린지에 참여한 8개 국제 팀이 생성한 TREC 2023 딥러닝 트랙 관련성 판단의 42개 LLM 생성 레이블을 공개하고 벤치마킹합니다. 이 자동 생성된 관련성 판단은 그 다양성 덕분에 커뮤니티가 LLM에 의해 발생하는 체계적 편향을 조사할 뿐만 아니라 앙상블 모델의 효과를 탐구하고, 다양한 모델과 인간 평가자 간의 트레이드오프를 분석하며, 자동 평가 기술을 개선하기 위한 방법론을 발전시키는 데 도움을 줄 수 있습니다. 공개된 리소스는 다음 링크에서 확인할 수 있습니다: https://llm4eval.github.io/LLMJudge-benchmark/

English

Using Large Language Models (LLMs) for relevance assessments offers promising opportunities to improve Information Retrieval (IR), Natural Language Processing (NLP), and related fields. Indeed, LLMs hold the promise of allowing IR experimenters to build evaluation collections with a fraction of the manual human labor currently required. This could help with fresh topics on which there is still limited knowledge and could mitigate the challenges of evaluating ranking systems in low-resource scenarios, where it is challenging to find human annotators. Given the fast-paced recent developments in the domain, many questions concerning LLMs as assessors are yet to be answered. Among the aspects that require further investigation, we can list the impact of various components in a relevance judgment generation pipeline, such as the prompt used or the LLM chosen. This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024, where different relevance assessment approaches were proposed. In detail, we release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams who participated in the challenge. Given their diverse nature, these automatically generated relevance judgments can help the community not only investigate systematic biases caused by LLMs but also explore the effectiveness of ensemble models, analyze the trade-offs between different models and human assessors, and advance methodologies for improving automated evaluation techniques. The released resource is available at the following link: https://llm4eval.github.io/LLMJudge-benchmark/

판단자를 판단하다: LLM 생성 관련성 평가 모음

Judging the Judges: A Collection of LLM-Generated Relevance Judgements

초록

Summary

Support