零样本视听语音识别(Zero-AVSR):通过习得语言无关的语音表征,利用大语言模型实现跨语言语音识别
Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations
March 8, 2025
作者: Jeong Hun Yeo, Minsu Kim, Chae Won Kim, Stavros Petridis, Yong Man Ro
cs.AI
摘要
我们探索了一种新颖的零样本视听语音识别(AVSR)框架,命名为Zero-AVSR,该框架能够在目标语言中实现语音识别,而无需这些语言的任何视听语音数据。具体而言,我们引入了视听语音罗马化器(AV-Romanizer),它通过预测罗马文本来学习语言无关的语音表示。随后,利用大型语言模型(LLMs)强大的多语言建模能力,我们提出将预测的罗马文本转换为特定语言的字符,形成所提出的级联Zero-AVSR。更进一步,我们探索了一种统一的Zero-AVSR方法,通过直接将AV-Romanizer编码的视听语音表示整合到LLM中实现。这是通过使用我们提出的多任务学习方案微调适配器和LLM来完成的。为了捕捉广泛的语音和语言多样性,我们还引入了一个多语言视听罗马化语料库(MARC),包含82种语言的2,916小时视听语音数据,以及以特定语言字符和罗马文本记录的转录。广泛的分析和实验证实,所提出的Zero-AVSR框架具有扩展语言支持的潜力,超越了AV-Romanizer训练期间所见语言的范围。
English
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR)
framework, dubbed Zero-AVSR, which enables speech recognition in target
languages without requiring any audio-visual speech data in those languages.
Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer),
which learns language-agnostic speech representations by predicting Roman text.
Then, by leveraging the strong multilingual modeling capabilities of Large
Language Models (LLMs), we propose converting the predicted Roman text into
language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it
a step further, we explore a unified Zero-AVSR approach by directly integrating
the audio-visual speech representations encoded by the AV-Romanizer into the
LLM. This is achieved through finetuning the adapter and the LLM using our
proposed multi-task learning scheme. To capture the wide spectrum of phonetic
and linguistic diversity, we also introduce a Multilingual Audio-Visual
Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data
across 82 languages, along with transcriptions in both language-specific
graphemes and Roman text. Extensive analysis and experiments confirm that the
proposed Zero-AVSR framework has the potential to expand language support
beyond the languages seen during the training of the AV-Romanizer.Summary
AI-Generated Summary