用於評估多語言LLM的跨語言自動評估
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
October 17, 2024
作者: Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra
cs.AI
摘要
在自然語言處理中,評估機器生成的文本仍然是一個重要挑戰,尤其對於非英語語言而言。目前的方法包括自動評量、人工評估和基於LLM的評估,主要集中在英語上,顯示了多語言評估框架中存在的重大差距。我們引入了跨語言自動評估(CIA)套件,這是一個可擴展的框架,包括評估LLMs(Hercule)和一個專門為多語言評估設計的新型測試集(Recon)。我們的測試集包含了500個人工標註的指令,涵蓋各種任務能力,以及跨六種語言的人工評分。這將使通用多語言LLMs的基準測試成為可能,並促進評估LLMs的元評估。所提出的模型Hercule是一個跨語言評估模型,通過學習根據英語中輕鬆獲得的參考答案為回應分配分數,解決了目標語言中參考答案稀缺的問題。我們的實驗表明,與專有模型相比,Hercule與人類判斷更為接近,展示了這種跨語言評估在資源匱乏情況下的有效性。此外,它在看不見的語言上的零-shot評估中也很有效。這項研究是使用LLMs進行跨語言評估的第一次全面考察,提出了一種可擴展且有效的多語言評估方法。所有代碼、數據集和模型將公開提供,以促進這一重要領域的進一步研究。
English
Evaluating machine-generated text remains a significant challenge in NLP,
especially for non-English languages. Current methodologies, including
automated metrics, human assessments, and LLM-based evaluations, predominantly
focus on English, revealing a significant gap in multilingual evaluation
frameworks. We introduce the Cross Lingual Auto Evaluation (CIA) Suite, an
extensible framework that includes evaluator LLMs (Hercule) and a novel test
set (Recon) specifically designed for multilingual evaluation. Our test set
features 500 human-annotated instructions spanning various task capabilities
along with human judgment scores across six languages. This would enable
benchmarking of general-purpose multilingual LLMs and facilitate
meta-evaluation of Evaluator LLMs. The proposed model, Hercule, is a
cross-lingual evaluation model that addresses the scarcity of reference answers
in the target language by learning to assign scores to responses based on
easily available reference answers in English. Our experiments demonstrate that
Hercule aligns more closely with human judgments compared to proprietary
models, demonstrating the effectiveness of such cross-lingual evaluation in low
resource scenarios. Further, it is also effective in zero-shot evaluation on
unseen languages. This study is the first comprehensive examination of
cross-lingual evaluation using LLMs, presenting a scalable and effective
approach for multilingual assessment. All code, datasets, and models will be
publicly available to enable further research in this important area.Summary
AI-Generated Summary