Whisper-LM：利用语言模型提升低资源语言的自动语音识别性能

摘要

自动语音识别系统无疑随着多语言多任务模型（如Whisper）的整合而取得了显著进展，这些模型展现出了理解和处理多种语言语音的广阔潜力。尽管这些模型表现出强大的鲁棒性，但在处理少数民族语言的细微差别时往往力有未逮。本研究通过将传统与新型语言模型与经过精细调优的Whisper模型相结合，填补了这一空白，旨在提升其在较少被研究语言上的表现。通过对多个数据集进行严格的微调和评估，我们展示了在词错误率上的显著改进，尤其是在低资源场景下。我们的方法不仅充分利用了Whisper预训练所依赖的大量数据，还通过引入语言模型增强了其语言适应能力。使用统计语言模型时，我们在分布内数据集上获得了高达51%的改进，在分布外句子上也实现了最高34%的提升；而大型语言模型则在多样化的语言环境中提供了虽温和但始终稳健的改进。研究结果表明，尽管整合对所有模型规模均带来可靠效益，改进程度却有所不同，凸显了优化语言模型参数的重要性。最后，我们强调了在使用基于Transformer的ASR模型报告结果时，选择合适评估参数的关键性。总之，本研究通过丰富ASR技术的语言知识，为开发更具包容性、跨语言表现更优的ASR技术铺平了道路。关于本研究的进一步实施细节，技术文档和源代码可在http://www.github.com/hitz-zentroa/whisper-lm获取。

English

Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at http://www.github.com/hitz-zentroa/whisper-lm.

Whisper-LM：利用语言模型提升低资源语言的自动语音识别性能

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages

摘要

Summary

Support

Support