ChatPaper.aiChatPaper

Robin3D:通過強健指導調整改進3D大型語言模型

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

September 30, 2024
作者: Weitai Kang, Haifeng Huang, Yuzhang Shang, Mubarak Shah, Yan Yan
cs.AI

摘要

最近在3D大型語言模型(3DLLMs)方面的進展突顯了它們在建構3D真實世界中通用代理的潛力,然而由於缺乏高質量堅固的指示跟隨數據,導致3DLLMs的辨識能力和泛化能力有限,挑戰依然存在。在本文中,我們介紹了Robin3D,這是一個強大的3DLLM,通過我們的新型數據引擎Robust Instruction Generation(RIG)引擎生成的大規模指示跟隨數據進行訓練。RIG生成了兩類關鍵指示數據:1)對抗式指示跟隨數據,其中包含混合的負面和正面樣本,以增強模型的辨識理解。2)多樣化指示跟隨數據,其中包含各種指示風格,以增強模型的泛化能力。因此,我們構建了100萬條指示跟隨數據,其中包括344K對抗式樣本、508K多樣化樣本和165K基準訓練集樣本。為了更好地處理這些複雜的指示,Robin3D首先引入了關係增強投影機來增強空間理解,然後通過ID-Feature Bonding加強對象引用和定位能力。Robin3D在五個廣泛使用的3D多模態學習基準測試中始終優於先前的方法,而無需進行任務特定的微調。值得注意的是,在定位任務(Multi3DRefer)中實現了7.8%的改進,在字幕任務(Scan2Cap)中實現了6.9%的改進。
English
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8\% improvement in the grounding task (Multi3DRefer) and a 6.9\% improvement in the captioning task (Scan2Cap).

Summary

AI-Generated Summary

PDF52November 16, 2024