ChatPaper.aiChatPaper

LLaVA-3D:一個簡單而有效的方法,賦予語言模型更強大的3D感知能力

LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness

September 26, 2024
作者: Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, Xihui Liu
cs.AI

摘要

最近在大型多模型(LMMs)方面取得的進展大大提高了它們在2D視覺理解任務中的能力,使它們能夠有效地處理和理解圖像和視頻。然而,由於缺乏大規模的3D視覺語言數據集和強大的3D編碼器,導致具有3D感知能力的LMMs在3D場景理解方面的發展受到阻礙。在本文中,我們介紹了一個名為LLaVA-3D的簡單而有效的框架。利用從LLaVA中獲得的強大2D理解先驗知識,我們的LLaVA-3D可以有效地將LLaVA調整為3D場景理解,而不會影響2D理解能力。為了實現這一目標,我們採用了一種簡單而有效的表示形式,即3D Patch,它將2D CLIP patch特徵與它們在3D空間中對應的位置相連接。通過將3D Patches集成到2D LMMs中並應用聯合2D和3D視覺語言指導調整,我們建立了一個統一的架構,既適用於2D圖像理解又適用於3D場景理解。實驗結果表明,當在3D視覺語言數據集上進行訓練時,LLaVA-3D的收斂速度比現有的3D LMMs快3.5倍。此外,LLaVA-3D不僅在各種3D任務中實現了最先進的性能,而且在2D圖像理解和視覺語言對話能力方面與LLaVA保持了可比的水平。
English
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

Summary

AI-Generated Summary

PDF352November 16, 2024