CompassJudger-1:一站式評估模型助力模型評估與演進
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution
October 21, 2024
作者: Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
cs.AI
摘要
高效且準確的評估對於持續改進大型語言模型(LLMs)至關重要。在各種評估方法中,主觀評估因其與現實使用情境和人類偏好的卓越一致性而受到重視。然而,基於人類的評估成本高昂且缺乏可重複性,這使得精確的自動評估者(評判者)在此過程中至關重要。在本報告中,我們介紹了CompassJudger-1,這是第一個開源的全能評判者LLM。
CompassJudger-1是一個通用型LLM,展示出卓越的多功能性。它能夠:1. 作為獎勵模型進行單一評分和雙模型比較;2. 根據指定格式進行評估;3. 生成評論;4. 執行像一般LLM那樣的多樣任務。為了在統一環境中評估不同評判者模型的評估能力,我們還建立了JudgerBench,這是一個新的基準測試,包含各種主觀評估任務並涵蓋廣泛的主題。CompassJudger-1提供了一個全面的解決方案,適用於各種評估任務,同時保持適應各種需求的靈活性。CompassJudger和JudgerBench均已釋出,並可供研究社區使用,網址為https://github.com/open-compass/CompassJudger。我們相信通過開源這些工具,我們可以促進合作,加速LLM評估方法的進展。
English
Efficient and accurate evaluation is crucial for the continuous improvement
of large language models (LLMs). Among various assessment methods, subjective
evaluation has garnered significant attention due to its superior alignment
with real-world usage scenarios and human preferences. However, human-based
evaluations are costly and lack reproducibility, making precise automated
evaluators (judgers) vital in this process. In this report, we introduce
CompassJudger-1, the first open-source all-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable
versatility. It is capable of: 1. Performing unitary scoring and two-model
comparisons as a reward model; 2. Conducting evaluations according to specified
formats; 3. Generating critiques; 4. Executing diverse tasks like a general
LLM. To assess the evaluation capabilities of different judge models under a
unified setting, we have also established JudgerBench, a new benchmark
that encompasses various subjective evaluation tasks and covers a wide range of
topics. CompassJudger-1 offers a comprehensive solution for various evaluation
tasks while maintaining the flexibility to adapt to diverse requirements. Both
CompassJudger and JudgerBench are released and available to the research
community athttps://github.com/open-compass/CompassJudger. We believe that by
open-sourcing these tools, we can foster collaboration and accelerate progress
in LLM evaluation methodologies.Summary
AI-Generated Summary