ChatPaper.aiChatPaper

Oryx MLLM:任意分辨率下的即時時空理解

Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary Resolution

September 19, 2024
作者: Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
cs.AI

摘要

視覺數據呈現多樣形式,從僅有幾個像素的小圖標到長達數小時的視頻。現有的多模態LLM通常將這些多樣的視覺輸入標準化為固定分辨率,以供視覺編碼器使用,並為LLM生成相似數量的標記。這種方法對於多模態理解並不是最佳的,並且對於處理具有長短視覺內容的輸入是低效的。為了解決這個問題,我們提出了Oryx,一種統一的多模態架構,用於對圖像、視頻和多視角3D場景進行時空理解。Oryx提供了一種按需解決方案,可以通過兩個核心創新無縫高效地處理具有任意空間大小和時間長度的視覺輸入:1)一個預訓練的OryxViT模型,可以將圖像編碼為LLM友好的視覺表示,無論分辨率為何;2)一個動態壓縮模塊,可按需對視覺標記進行1倍至16倍的壓縮。這些設計特點使Oryx能夠應對極長的視覺上下文,例如視頻,並在保持高識別精度的同時,以較低的分辨率和高壓縮進行處理,適用於像文檔理解這樣的任務,具有本地分辨率且無壓縮。除了架構改進外,對於長上下文檢索和空間感知數據的增強數據整理和專門培訓有助於Oryx同時在圖像、視頻和3D多模態理解方面具有強大能力。我們的工作在https://github.com/Oryx-mllm/Oryx上開源。
English
Visual data comes in various forms, ranging from small icons of just a few pixels to long videos spanning hours. Existing multi-modal LLMs usually standardize these diverse visual inputs to a fixed resolution for visual encoders and yield similar numbers of tokens for LLMs. This approach is non-optimal for multimodal understanding and inefficient for processing inputs with long and short visual contents. To solve the problem, we propose Oryx, a unified multimodal architecture for the spatial-temporal understanding of images, videos, and multi-view 3D scenes. Oryx offers an on-demand solution to seamlessly and efficiently process visual inputs with arbitrary spatial sizes and temporal lengths through two core innovations: 1) a pre-trained OryxViT model that can encode images at any resolution into LLM-friendly visual representations; 2) a dynamic compressor module that supports 1x to 16x compression on visual tokens by request. These design features enable Oryx to accommodate extremely long visual contexts, such as videos, with lower resolution and high compression while maintaining high recognition precision for tasks like document understanding with native resolution and no compression. Beyond the architectural improvements, enhanced data curation and specialized training on long-context retrieval and spatial-aware data help Oryx achieve strong capabilities in image, video, and 3D multimodal understanding simultaneously. Our work is open-sourced at https://github.com/Oryx-mllm/Oryx.

Summary

AI-Generated Summary

PDF262November 16, 2024