从生成到判断:LLM作为法官的机遇和挑战
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
November 25, 2024
作者: Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu
cs.AI
摘要
评估和评价长期以来一直是人工智能(AI)和自然语言处理(NLP)中的关键挑战。然而,无论是基于匹配还是基于嵌入的传统方法,往往难以判断微妙属性并提供令人满意的结果。最近大型语言模型(LLMs)的进展启发了“LLM作为评判者”的范式,其中LLMs被利用来在各种任务和应用中执行评分、排名或选择。本文提供了LLM为基础的评判和评估的全面调查,提供了深入的概述以推动这一新兴领域。我们首先从输入和输出的角度给出详细的定义。然后,我们引入了一个全面的分类法,从三个维度探讨LLM作为评判者:何以评判、如何评判和何处评判。最后,我们编制了用于评估LLM作为评判者的基准,并强调了关键挑战和有前途的方向,旨在提供有价值的见解并激发这一有前途的研究领域的未来研究。有关LLM作为评判者的论文列表和更多资源,请访问https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge和https://llm-as-a-judge.github.io。
English
Assessment and evaluation have long been critical challenges in artificial
intelligence (AI) and natural language processing (NLP). However, traditional
methods, whether matching-based or embedding-based, often fall short of judging
subtle attributes and delivering satisfactory results. Recent advancements in
Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs
are leveraged to perform scoring, ranking, or selection across various tasks
and applications. This paper provides a comprehensive survey of LLM-based
judgment and assessment, offering an in-depth overview to advance this emerging
field. We begin by giving detailed definitions from both input and output
perspectives. Then we introduce a comprehensive taxonomy to explore
LLM-as-a-judge from three dimensions: what to judge, how to judge and where to
judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and
highlight key challenges and promising directions, aiming to provide valuable
insights and inspire future research in this promising research area. Paper
list and more resources about LLM-as-a-judge can be found at
https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and
https://llm-as-a-judge.github.io.Summary
AI-Generated Summary