Shiksha:面向技术领域的印度语言翻译数据集和模型

Shiksha: A Technical Domain focused Translation Dataset and Model for Indian Languages

December 12, 2024
作者: Advait Joglekar, Srinivasan Umesh
cs.AI

摘要

神经机器翻译(NMT)模型通常在对科学、技术和教育领域了解有限的数据集上进行训练。因此,翻译模型通常在涉及科学理解或技术术语的任务上表现不佳。对于资源稀缺的印度语言,它们的表现甚至更差。寻找一个特别关注这些领域的翻译数据集构成了一个困难的挑战。在本文中,我们通过创建一个包含超过280万行英译印度语和印度语之间高质量翻译对的多语言平行语料库来解决这个问题,涵盖了8种印度语言。我们通过挖掘人工翻译的NPTEL视频讲座的双语文本,实现了这一目标。我们还利用这一语料库对NMT模型进行微调和评估,在领域内任务中超越了所有其他公开可用的模型。我们还展示了通过在Flores+基准测试中将基线提高超过2个BLEU分数,提高了这些印度语言的平均翻译质量,从而在领域外翻译任务中泛化的潜力。我们很高兴通过以下链接发布我们的模型和数据集:https://huggingface.co/SPRINGLab。
English
Neural Machine Translation (NMT) models are typically trained on datasets with limited exposure to Scientific, Technical and Educational domains. Translation models thus, in general, struggle with tasks that involve scientific understanding or technical jargon. Their performance is found to be even worse for low-resource Indian languages. Finding a translation dataset that tends to these domains in particular, poses a difficult challenge. In this paper, we address this by creating a multilingual parallel corpus containing more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality translation pairs across 8 Indian languages. We achieve this by bitext mining human-translated transcriptions of NPTEL video lectures. We also finetune and evaluate NMT models using this corpus and surpass all other publicly available models at in-domain tasks. We also demonstrate the potential for generalizing to out-of-domain translation tasks by improving the baseline by over 2 BLEU on average for these Indian languages on the Flores+ benchmark. We are pleased to release our model and dataset via this link: https://huggingface.co/SPRINGLab.

Summary

AI-Generated Summary

PDF42December 13, 2024