M-RewardBench:在多語言環境中評估獎勵模型
M-RewardBench: Evaluating Reward Models in Multilingual Settings
October 20, 2024
作者: Srishti Gureja, Lester James V. Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, Marzieh Fadaee
cs.AI
摘要
獎勵模型(RMs)通過將人類反饋整合到語言建模過程中,推動了當今LLMs的最新性能。然而,RMs主要在英語中進行訓練和評估,它們在多語言環境中的能力仍然被廣泛忽視。在這項研究中,我們對多語言環境中的幾個獎勵模型進行了系統評估。我們首先構建了首個多語言RM評估基準M-RewardBench,其中包含23種類型多樣的語言的2.87k個偏好實例,測試了RMs的聊天、安全、推理和翻譯能力。然後我們嚴格評估了廣泛範圍的獎勵模型在M-RewardBench上的表現,為我們提供了有關它們在不同語言中表現的新見解。我們發現RMs在英語和非英語語言之間的表現存在顯著差距,並且RM的偏好可能在不同語言之間發生顯著變化。我們還提出了幾點關於不同多語言方面如何影響RM表現的發現。具體而言,我們發現RM的表現隨著翻譯質量的提高而改善。同樣,我們展示了模型對於高資源語言表現出更好的性能。我們在本研究中釋出了M-RewardBench數據集和代碼庫,以促進對多語言環境中RM評估的更好理解。
English
Reward models (RMs) have driven the state-of-the-art performance of LLMs
today by enabling the integration of human feedback into the language modeling
process. However, RMs are primarily trained and evaluated in English, and their
capabilities in multilingual settings remain largely understudied. In this
work, we conduct a systematic evaluation of several reward models in
multilingual settings. We first construct the first-of-its-kind multilingual RM
evaluation benchmark, M-RewardBench, consisting of 2.87k preference instances
for 23 typologically diverse languages, that tests the chat, safety, reasoning,
and translation capabilities of RMs. We then rigorously evaluate a wide range
of reward models on M-RewardBench, offering fresh insights into their
performance across diverse languages. We identify a significant gap in RMs'
performances between English and non-English languages and show that RM
preferences can change substantially from one language to another. We also
present several findings on how different multilingual aspects impact RM
performance. Specifically, we show that the performance of RMs is improved with
improved translation quality. Similarly, we demonstrate that the models exhibit
better performance for high-resource languages. We release M-RewardBench
dataset and the codebase in this study to facilitate a better understanding of
RM evaluation in multilingual settings.Summary
AI-Generated Summary