一次性查看每個畫面：使用多軸梯度檢查點的Video-Ma^2mba進行高效長格式視頻理解

摘要

隨著視頻數據規模和複雜性的增長，有效處理長視頻序列面臨重大挑戰，這是由於現有基於Transformer的大型多模型（LMMs）所帶來的記憶和計算需求呈二次增長。為了應對這些問題，我們引入了Video-Ma^2mba，這是一種新穎的架構，它在Mamba-2框架中融入了狀態空間模型（SSMs），取代了注意機制。這使得LMMs在時間和記憶需求方面呈線性擴展，從而使其能夠處理長時間視頻內容。此外，我們通過引入多軸梯度檢查點（MA-GC）方法來增強記憶效率，該方法通過在多個計算軸上僅保留必要的激活來策略性地管理記憶。我們的方法相對於標準梯度檢查點方法顯著減少了記憶體佔用。實證分析表明，Video-Ma^2mba可以在單個GPU上處理廣泛的視頻序列，相當於數百萬個標記或超過兩小時的連續序列，每秒1幀。通過保持對時間動態的詳細捕獲，我們的模型提高了長視頻理解任務中回應的準確性和相關性，展示了相對於現有框架的顯著優勢。

English

With the growing scale and complexity of video data, efficiently processing long video sequences poses significant challenges due to the quadratic increase in memory and computational demands associated with existing transformer-based Large Multi-modal Models (LMMs). To address these issues, we introduce Video-Ma^2mba, a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework, replacing the attention mechanisms. This allows the LMMs to scale linearly in terms of time and memory requirements, making it feasible to handle long-duration video content. Furthermore, we enhance the memory efficiency introducing the Multi-Axis Gradient Checkpointing (MA-GC) method, which strategically manages memory by retaining only essential activations across multiple computational axes. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. Empirical analyses show that Video-Ma^2mba can process extensive video sequences-equivalent to millions of tokens or over two hours of continuous sequences at 1 FPS-on a single GPU. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks, demonstrating substantial advantages over existing frameworks.

一次性查看每個畫面：使用多軸梯度檢查點的Video-Ma^2mba進行高效長格式視頻理解

Look Every Frame All at Once: Video-Ma^2mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

摘要

Summary

Support