從生成到判斷:LLM作為法官的機遇與挑戰

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

November 25, 2024
作者: Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, Huan Liu
cs.AI

摘要

評估和評價長期以來一直是人工智慧(AI)和自然語言處理(NLP)中的重要挑戰。然而,傳統方法,無論是基於匹配還是嵌入的方法,通常難以判斷微妙的屬性並提供令人滿意的結果。大型語言模型(LLMs)的最新進展激發了“LLM作為評判”的範式,其中LLMs被利用來在各種任務和應用中執行評分、排名或選擇。本文提供了LLM為基礎的評判和評估的全面調查,提供了深入的概述以推進這一新興領域。我們首先從輸入和輸出的角度給出詳細的定義。然後,我們引入了一個全面的分類法,從三個維度探索LLM作為評判:評判什麼、如何評判以及在哪裡評判。最後,我們編制了用於評估LLM作為評判的基準,並突出了關鍵挑戰和有前途的方向,旨在提供有價值的見解並激發未來在這一有前途的研究領域的研究。有關LLM作為評判的論文列表和更多資源可在https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge 和 https://llm-as-a-judge.github.io 找到。
English
Assessment and evaluation have long been critical challenges in artificial intelligence (AI) and natural language processing (NLP). However, traditional methods, whether matching-based or embedding-based, often fall short of judging subtle attributes and delivering satisfactory results. Recent advancements in Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs are leveraged to perform scoring, ranking, or selection across various tasks and applications. This paper provides a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to advance this emerging field. We begin by giving detailed definitions from both input and output perspectives. Then we introduce a comprehensive taxonomy to explore LLM-as-a-judge from three dimensions: what to judge, how to judge and where to judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and highlight key challenges and promising directions, aiming to provide valuable insights and inspire future research in this promising research area. Paper list and more resources about LLM-as-a-judge can be found at https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge and https://llm-as-a-judge.github.io.

Summary

AI-Generated Summary

PDF372November 26, 2024