OMCAT:全情境感知轉換器
OMCAT: Omni Context Aware Transformer
October 15, 2024
作者: Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
cs.AI
摘要
大型語言模型(LLMs)在文本生成和理解方面取得了顯著進展,最近的進步擴展到了多模態LLMs,這些模型整合了視覺和音頻輸入。然而,這些模型在細粒度、跨模態時間理解方面仍然存在困難,特別是在相關聯音頻和視頻流中的事件。我們通過兩個關鍵貢獻來應對這些挑戰:一個新的數據集和模型,分別稱為OCTAV和OMCAT。OCTAV(Omni Context and Temporal Audio Video)是一個新穎的數據集,旨在捕捉音頻和視頻之間的事件轉換。其次,OMCAT(Omni Context Aware Transformer)是一個強大的模型,利用RoTE(Rotary Time Embeddings),這是RoPE的一個創新擴展,以增強時間基準和計算效率在時間錨定任務中。通過一個強大的三階段訓練流程-特徵對齊、指導調整和OCTAV特定訓練-OMCAT在跨模態時間理解方面表現出色。我們的模型在音視覺問答(AVQA)任務和OCTAV基準上展示了最先進的性能,展示了在時間推理和跨模態對齊方面的顯著增益,通過全面的實驗和消融研究進行驗證。我們的數據集和代碼將公開提供。我們的演示頁面鏈接為https://om-cat.github.io。
English
Large Language Models (LLMs) have made significant strides in text generation
and comprehension, with recent advancements extending into multimodal LLMs that
integrate visual and audio inputs. However, these models continue to struggle
with fine-grained, cross-modal temporal understanding, particularly when
correlating events across audio and video streams. We address these challenges
with two key contributions: a new dataset and model, called OCTAV and OMCAT
respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset
designed to capture event transitions across audio and video. Second, OMCAT
(Omni Context Aware Transformer) is a powerful model that leverages RoTE
(Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal
grounding and computational efficiency in time-anchored tasks. Through a robust
three-stage training pipeline-feature alignment, instruction tuning, and
OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our
model demonstrates state-of-the-art performance on Audio-Visual Question
Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in
temporal reasoning and cross-modal alignment, as validated through
comprehensive experiments and ablation studies. Our dataset and code will be
made publicly available. The link to our demo page is https://om-cat.github.io.Summary
AI-Generated Summary