MURI:透過反向指令為低資源語言打造高品質指令調整資料集
MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions
September 19, 2024
作者: Abdullatif Köksal, Marion Thaler, Ayyoob Imani, Ahmet Üstün, Anna Korhonen, Hinrich Schütze
cs.AI
摘要
指導調整通過對齊大型語言模型(LLMs)與人類偏好在各種任務上增強它們。傳統方法創建指導調整數據集面臨低資源語言的嚴重挑戰,因為它們依賴於數據標註。本研究引入了一種新方法,多語言反向指導(MURI),它為低資源語言生成高質量的指導調整數據集,無需人類標註者或現有的多語言模型。利用反向指導和翻譯管道,MURI從現有的低資源語言人寫的文本中生成指導-輸出對。這種方法通過從不同本地領域的文本中獲取並應用過濾器來消除不當內容,確保文化相關性和多樣性。我們的數據集,MURI-IT,包括超過200種語言的200多萬個指導-輸出對。由母語人士評估以及與mT5模型的微調實驗顯示了這種方法對於自然語言理解和開放式生成的有效性。我們在https://github.com/akoksal/muri 上公開發布數據集和模型。
English
Instruction tuning enhances large language models (LLMs) by aligning them
with human preferences across diverse tasks. Traditional approaches to create
instruction tuning datasets face serious challenges for low-resource languages
due to their dependence on data annotation. This work introduces a novel
method, Multilingual Reverse Instructions (MURI), which generates high-quality
instruction tuning datasets for low-resource languages without requiring human
annotators or pre-existing multilingual models. Utilizing reverse instructions
and a translation pipeline, MURI produces instruction-output pairs from
existing human-written texts in low-resource languages. This method ensures
cultural relevance and diversity by sourcing texts from different native
domains and applying filters to eliminate inappropriate content. Our dataset,
MURI-IT, includes more than 2 million instruction-output pairs across 200
languages. Evaluation by native speakers and fine-tuning experiments with mT5
models demonstrate the approach's effectiveness for both NLU and open-ended
generation. We publicly release datasets and models at
https://github.com/akoksal/muri.Summary
AI-Generated Summary