用於評估多語言LLM的跨語言自動評估

摘要

在自然語言處理中，評估機器生成的文本仍然是一個重要挑戰，尤其對於非英語語言而言。目前的方法包括自動評量、人工評估和基於LLM的評估，主要集中在英語上，顯示了多語言評估框架中存在的重大差距。我們引入了跨語言自動評估（CIA）套件，這是一個可擴展的框架，包括評估LLMs（Hercule）和一個專門為多語言評估設計的新型測試集（Recon）。我們的測試集包含了500個人工標註的指令，涵蓋各種任務能力，以及跨六種語言的人工評分。這將使通用多語言LLMs的基準測試成為可能，並促進評估LLMs的元評估。所提出的模型Hercule是一個跨語言評估模型，通過學習根據英語中輕鬆獲得的參考答案為回應分配分數，解決了目標語言中參考答案稀缺的問題。我們的實驗表明，與專有模型相比，Hercule與人類判斷更為接近，展示了這種跨語言評估在資源匱乏情況下的有效性。此外，它在看不見的語言上的零-shot評估中也很有效。這項研究是使用LLMs進行跨語言評估的第一次全面考察，提出了一種可擴展且有效的多語言評估方法。所有代碼、數據集和模型將公開提供，以促進這一重要領域的進一步研究。

English

Evaluating machine-generated text remains a significant challenge in NLP, especially for non-English languages. Current methodologies, including automated metrics, human assessments, and LLM-based evaluations, predominantly focus on English, revealing a significant gap in multilingual evaluation frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an extensible framework that includes evaluator LLMs (Hercule) and a novel test set (Recon) specifically designed for multilingual evaluation. Our test set features 500 human-annotated instructions spanning various task capabilities along with human judgment scores across six languages. This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a cross-lingual evaluation model that addresses the scarcity of reference answers in the target language by learning to assign scores to responses based on easily available reference answers in English. Our experiments demonstrate that Hercule aligns more closely with human judgments compared to proprietary models, demonstrating the effectiveness of such cross-lingual evaluation in low resource scenarios. Further, it is also effective in zero-shot evaluation on unseen languages. This study is the first comprehensive examination of cross-lingual evaluation using LLMs, presenting a scalable and effective approach for multilingual assessment. All code, datasets, and models will be publicly available to enable further research in this important area.

用於評估多語言LLM的跨語言自動評估

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

摘要

Summary

Support

Support