OMCAT：全情境感知轉換器

摘要

大型語言模型（LLMs）在文本生成和理解方面取得了顯著進展，最近的進步擴展到了多模態LLMs，這些模型整合了視覺和音頻輸入。然而，這些模型在細粒度、跨模態時間理解方面仍然存在困難，特別是在相關聯音頻和視頻流中的事件。我們通過兩個關鍵貢獻來應對這些挑戰：一個新的數據集和模型，分別稱為OCTAV和OMCAT。OCTAV（Omni Context and Temporal Audio Video）是一個新穎的數據集，旨在捕捉音頻和視頻之間的事件轉換。其次，OMCAT（Omni Context Aware Transformer）是一個強大的模型，利用RoTE（Rotary Time Embeddings），這是RoPE的一個創新擴展，以增強時間基準和計算效率在時間錨定任務中。通過一個強大的三階段訓練流程-特徵對齊、指導調整和OCTAV特定訓練-OMCAT在跨模態時間理解方面表現出色。我們的模型在音視覺問答（AVQA）任務和OCTAV基準上展示了最先進的性能，展示了在時間推理和跨模態對齊方面的顯著增益，通過全面的實驗和消融研究進行驗證。我們的數據集和代碼將公開提供。我們的演示頁面鏈接為https://om-cat.github.io。

English

Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io.

OMCAT：全情境感知轉換器

OMCAT: Omni Context Aware Transformer

摘要

Summary

Support

Support