Shiksha:針對印度語言的技術領域專注的翻譯數據集和模型

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

December 12, 2024
作者: Advait Joglekar, Srinivasan Umesh
cs.AI

摘要

神經機器翻譯(NMT)模型通常在具有有限科學、技術和教育領域曝光的數據集上進行訓練。總的來說,翻譯模型在涉及科學理解或技術術語的任務上往往表現不佳。對於資源稀缺的印度語言,它們的表現甚至更為糟糕。尋找一個特別關注這些領域的翻譯數據集是一個困難的挑戰。在本文中,我們通過創建一個多語種平行語料庫來應對這一挑戰,其中包含超過280萬條英語到印地語和印地語到印地語的高質量翻譯對,涵蓋了8種印度語言。我們通過採集人工翻譯的NPTEL視頻講座文本來實現這一目標。我們還使用這個語料庫對NMT模型進行微調和評估,在領域內的任務中超越了所有其他公開可用的模型。我們還展示了通過在Flores+基準測試中將基準提高了超過2 BLEU,從而提高了對這些印度語言的領域外翻譯任務的潛力。我們很高興通過以下鏈接釋出我們的模型和數據集:https://huggingface.co/SPRINGLab。
English
Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the potential for generalizing to out-of-domain translation tasks by improving the baseline by over 2 BLEU on average for these Indian languages on the Flores+ benchmark. We are pleased to release our model and dataset via this link: https://huggingface.co/SPRINGLab.

Summary

AI-Generated Summary

PDF42December 13, 2024