ChatPaper.aiChatPaper

MURI:透過反向指令為低資源語言打造高品質指令調整資料集

MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions

September 19, 2024
作者: Abdullatif Köksal, Marion Thaler, Ayyoob Imani, Ahmet Üstün, Anna Korhonen, Hinrich Schütze
cs.AI

摘要

指導調整通過對齊大型語言模型(LLMs)與人類偏好在各種任務上增強它們。傳統方法創建指導調整數據集面臨低資源語言的嚴重挑戰,因為它們依賴於數據標註。本研究引入了一種新方法,多語言反向指導(MURI),它為低資源語言生成高質量的指導調整數據集,無需人類標註者或現有的多語言模型。利用反向指導和翻譯管道,MURI從現有的低資源語言人寫的文本中生成指導-輸出對。這種方法通過從不同本地領域的文本中獲取並應用過濾器來消除不當內容,確保文化相關性和多樣性。我們的數據集,MURI-IT,包括超過200種語言的200多萬個指導-輸出對。由母語人士評估以及與mT5模型的微調實驗顯示了這種方法對於自然語言理解和開放式生成的有效性。我們在https://github.com/akoksal/muri 上公開發布數據集和模型。
English
Instruction tuning enhances large language models (LLMs) by aligning them with human preferences across diverse tasks. Traditional approaches to create instruction tuning datasets face serious challenges for low-resource languages due to their dependence on data annotation. This work introduces a novel method, Multilingual Reverse Instructions (MURI), which generates high-quality instruction tuning datasets for low-resource languages without requiring human annotators or pre-existing multilingual models. Utilizing reverse instructions and a translation pipeline, MURI produces instruction-output pairs from existing human-written texts in low-resource languages. This method ensures cultural relevance and diversity by sourcing texts from different native domains and applying filters to eliminate inappropriate content. Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages. Evaluation by native speakers and fine-tuning experiments with mT5 models demonstrate the approach's effectiveness for both NLU and open-ended generation. We publicly release datasets and models at https://github.com/akoksal/muri.

Summary

AI-Generated Summary

PDF83November 16, 2024