다국어 LLMs를 평가하기 위한 교차언어 자동평가

초록

자연어 처리(NLP)에서 기계 생성 텍스트를 평가하는 것은 특히 비영어 언어에 대해 중요한 도전입니다. 현재 방법론은 자동화된 측정, 인간 평가, 그리고 LLM(언어 모델) 기반 평가를 포함하며, 이는 주로 영어에 초점을 맞추고 있어 다중언어 평가 프레임워크에서 상당한 차이를 드러냅니다. 저희는 Cross Lingual Auto Evaluation (CIA) Suite를 소개합니다. 이는 다중언어 평가를 위해 특별히 설계된 새로운 테스트 세트(Recon)와 평가자 LLMs(Hercule)를 포함한 확장 가능한 프레임워크입니다. 저희의 테스트 세트는 여섯 개 언어에 걸쳐 인간이 주석을 다는 500개의 지침을 포함하며, 인간 판단 점수를 특징으로 합니다. 이는 일반적인 다중언어 LLMs의 벤치마킹을 가능하게 하고, 평가자 LLMs의 메타평가를 용이하게 합니다. 제안된 모델인 Hercule은 영어로 쉽게 이용 가능한 참조 답변에 기초하여 응답에 점수를 할당하는 학습을 통해 대상 언어의 참조 답변 부족 문제를 해결하는 다중언어 평가 모델입니다. 저희의 실험은 Hercule이 전용 모델과 비교하여 인간 판단과 더 밀접하게 일치함을 보여주며, 이러한 다중언어 평가의 효과를 낮은 자원 상황에서 입증합니다. 더불어, 이는 보지 못한 언어에 대한 제로샷 평가에서도 효과적입니다. 이 연구는 LLMs를 사용한 다중언어 평가의 첫 종합적인 검토로, 다중언어 평가에 대한 확장 가능하고 효과적인 접근 방식을 제시합니다. 모든 코드, 데이터셋, 그리고 모델은 이 중요한 분야에서의 추가 연구를 가능하게 하기 위해 공개적으로 이용 가능할 것입니다.

English

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

다국어 LLMs를 평가하기 위한 교차언어 자동평가

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

초록

Summary

Support