交错式语音-文本语言模型的扩展性分析

摘要

现有的语音语言模型（SLM）扩展分析描绘了一幅黯淡的图景。它们预测，与文本相比，SLM需要更多的计算资源和数据，这使一些人质疑训练高质量SLM的可行性。然而，现代SLM通常通过语音-文本交错初始化自预训练的文本语言模型（TextLM），以实现知识迁移。这引发了一个问题——交错式SLM是否比无文本SLM扩展得更高效？在本文中，我们给出了一个响亮的肯定回答！我们通过对数十个交错式SLM进行训练并分析其扩展趋势，开展了扩展分析。我们发现，在这种设置下，SLM在计算资源上的扩展效率更高。此外，我们的结果表明，其扩展动态与无文本SLM显著不同，这意味着应将更多的计算预算用于增加模型规模，而非训练数据量。我们还研究了合成数据和TextLM模型系列在释放这一潜力中的作用。结果表明，我们扩展后的模型在语音语义指标上达到了领先模型的相当性能，同时使用的计算资源和数据量少于其他方法。我们开源了模型、样本和数据——https://pages.cs.huji.ac.il/adiyoss-lab/sims。

English

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

交错式语音-文本语言模型的扩展性分析

Scaling Analysis of Interleaved Speech-Text Language Models

摘要

Summary

Support

Support