ChatPaper.aiChatPaper

Soundwave:大语言模型中语音-文本对齐的“少即是多”之道

Soundwave: Less is More for Speech-Text Alignment in LLMs

February 18, 2025
作者: Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
cs.AI

摘要

现有的端到端语音大语言模型(LLMs)通常依赖于大规模标注数据进行训练,而数据高效训练的问题尚未得到深入探讨。我们聚焦于语音与文本之间的两个基本问题:表示空间差异和序列长度不一致性。为此,我们提出了Soundwave,它采用了一种高效的训练策略和一种新颖的架构来解决这些问题。实验结果表明,Soundwave在语音翻译和AIR-Bench语音任务上超越了先进的Qwen2-Audio模型,且仅使用了五十分之一的训练数据。进一步分析显示,Soundwave在对话过程中仍能保持其智能性。该项目已发布于https://github.com/FreedomIntelligence/Soundwave。
English
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.

Summary

AI-Generated Summary

PDF782February 19, 2025