CompassJudger-1: 모델 평가와 진화를 돕는 올인원 판단 모델

초록

대규모 언어 모델(Large Language Models, LLMs)의 지속적인 향상을 위해 효율적이고 정확한 평가가 중요합니다. 다양한 평가 방법 중 주관적 평가는 실제 사용 시나리오와 인간의 선호도와 뛰어난 일치성으로 인해 상당한 관심을 받고 있습니다. 그러나 인간 중심의 평가는 비용이 많이 들며 재현성이 부족하여, 정확한 자동 평가자(판단자)가 이 과정에서 중요합니다. 본 보고서에서는 CompassJudger-1을 소개합니다. 이는 최초의 오픈 소스 올인원 판단자 LLM입니다. CompassJudger-1은 현저한 다재다능성을 보여주는 일반 목적의 LLM입니다. 이는 다음을 수행할 수 있습니다: 1. 보상 모델로서 단일 점수화 및 두 모델 비교; 2. 지정된 형식에 따른 평가 수행; 3. 비평 생성; 4. 일반 LLM처럼 다양한 작업 실행. 서로 다른 판단자 모델의 평가 능력을 통일된 환경에서 평가하기 위해 다양한 주관적 평가 작업을 포함하고 다양한 주제를 다루는 새로운 벤치마크인 JudgerBench를 개발했습니다. CompassJudger-1은 다양한 평가 작업에 대한 포괄적인 솔루션을 제공하면서 다양한 요구 사항에 적응할 수 있는 유연성을 유지합니다. CompassJudger와 JudgerBench는 https://github.com/open-compass/CompassJudger에서 연구 커뮤니티에 공개되어 있습니다. 이 도구들을 오픈 소스로 공개함으로써 협력을 촉진하고 LLM 평가 방법론의 발전을 가속화할 수 있다고 믿습니다.

English

Efficient and accurate evaluation is crucial for the continuous improvement of large language models (LLMs). Among various assessment methods, subjective evaluation has garnered significant attention due to its superior alignment with real-world usage scenarios and human preferences. However, human-based evaluations are costly and lack reproducibility, making precise automated evaluators (judgers) vital in this process. In this report, we introduce CompassJudger-1, the first open-source all-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. It is capable of: 1. Performing unitary scoring and two-model comparisons as a reward model; 2. Conducting evaluations according to specified formats; 3. Generating critiques; 4. Executing diverse tasks like a general LLM. To assess the evaluation capabilities of different judge models under a unified setting, we have also established JudgerBench, a new benchmark that encompasses various subjective evaluation tasks and covers a wide range of topics. CompassJudger-1 offers a comprehensive solution for various evaluation tasks while maintaining the flexibility to adapt to diverse requirements. Both CompassJudger and JudgerBench are released and available to the research community athttps://github.com/open-compass/CompassJudger. We believe that by open-sourcing these tools, we can foster collaboration and accelerate progress in LLM evaluation methodologies.

CompassJudger-1: 모델 평가와 진화를 돕는 올인원 판단 모델

CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution

초록

Summary

Support