ChatPaper.aiChatPaper

CompassJudger-1:一站式評估模型助力模型評估與演進

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

October 21, 2024
作者: Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen
cs.AI

摘要

高效且準確的評估對於持續改進大型語言模型(LLMs)至關重要。在各種評估方法中,主觀評估因其與現實使用情境和人類偏好的卓越一致性而受到重視。然而,基於人類的評估成本高昂且缺乏可重複性,這使得精確的自動評估者(評判者)在此過程中至關重要。在本報告中,我們介紹了CompassJudger-1,這是第一個開源的全能評判者LLM。 CompassJudger-1是一個通用型LLM,展示出卓越的多功能性。它能夠:1. 作為獎勵模型進行單一評分和雙模型比較;2. 根據指定格式進行評估;3. 生成評論;4. 執行像一般LLM那樣的多樣任務。為了在統一環境中評估不同評判者模型的評估能力,我們還建立了JudgerBench,這是一個新的基準測試,包含各種主觀評估任務並涵蓋廣泛的主題。CompassJudger-1提供了一個全面的解決方案,適用於各種評估任務,同時保持適應各種需求的靈活性。CompassJudger和JudgerBench均已釋出,並可供研究社區使用,網址為https://github.com/open-compass/CompassJudger。我們相信通過開源這些工具,我們可以促進合作,加速LLM評估方法的進展。
English
Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce CompassJudger-1, the first open-source all-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established JudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

Summary

AI-Generated Summary

PDF612November 16, 2024