Shiksha:針對印度語言的技術領域專注的翻譯數據集和模型
Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages
December 12, 2024
作者: Advait Joglekar, Srinivasan Umesh
cs.AI
摘要
神經機器翻譯(NMT)模型通常在具有有限科學、技術和教育領域曝光的數據集上進行訓練。總的來說,翻譯模型在涉及科學理解或技術術語的任務上往往表現不佳。對於資源稀缺的印度語言,它們的表現甚至更為糟糕。尋找一個特別關注這些領域的翻譯數據集是一個困難的挑戰。在本文中,我們通過創建一個多語種平行語料庫來應對這一挑戰,其中包含超過280萬條英語到印地語和印地語到印地語的高質量翻譯對,涵蓋了8種印度語言。我們通過採集人工翻譯的NPTEL視頻講座文本來實現這一目標。我們還使用這個語料庫對NMT模型進行微調和評估,在領域內的任務中超越了所有其他公開可用的模型。我們還展示了通過在Flores+基準測試中將基準提高了超過2 BLEU,從而提高了對這些印度語言的領域外翻譯任務的潛力。我們很高興通過以下鏈接釋出我們的模型和數據集:https://huggingface.co/SPRINGLab。
English
Neural Machine Translation (NMT) models are typically trained on datasets
with limited exposure to Scientific, Technical and Educational domains.
Translation models thus, in general, struggle with tasks that involve
scientific understanding or technical jargon. Their performance is found to be
even worse for low-resource Indian languages. Finding a translation dataset
that tends to these domains in particular, poses a difficult challenge. In this
paper, we address this by creating a multilingual parallel corpus containing
more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality
translation pairs across 8 Indian languages. We achieve this by bitext mining
human-translated transcriptions of NPTEL video lectures. We also finetune and
evaluate NMT models using this corpus and surpass all other publicly available
models at in-domain tasks. We also demonstrate the potential for generalizing
to out-of-domain translation tasks by improving the baseline by over 2 BLEU on
average for these Indian languages on the Flores+ benchmark. We are pleased to
release our model and dataset via this link: https://huggingface.co/SPRINGLab.Summary
AI-Generated Summary