ChatPaper.aiChatPaper

InteractVLM:基於2D基礎模型的3D互動推理

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

April 7, 2025
作者: Sai Kumar Dwivedi, Dimitrije Antić, Shashank Tripathi, Omid Taheri, Cordelia Schmid, Michael J. Black, Dimitrios Tzionas
cs.AI

摘要

我們提出了InteractVLM,這是一種新穎的方法,能夠從單張野外拍攝的圖像中估計人體與物體的三維接觸點,從而實現精確的人體與物體三維聯合重建。這項任務面臨著遮擋、深度模糊以及物體形狀多樣性等挑戰。現有方法依賴於通過昂貴的動作捕捉系統或繁瑣的手動標註收集的三維接觸數據,這限制了其可擴展性和泛化能力。為克服這些限制,InteractVLM利用了大型視覺-語言模型(VLMs)的廣泛視覺知識,並通過有限的三維接觸數據進行微調。然而,直接應用這些模型並非易事,因為它們僅在二維空間中進行推理,而人體與物體的接觸本質上是三維的。因此,我們引入了一個新穎的渲染-定位-提升模塊,該模塊:(1)通過多視角渲染將三維人體和物體表面嵌入二維空間,(2)訓練一個新穎的多視角定位模型(MV-Loc)來推斷二維接觸點,(3)將這些接觸點提升到三維空間。此外,我們提出了一項名為語義人體接觸估計的新任務,其中人體接觸預測明確地以物體語義為條件,從而實現更豐富的交互建模。InteractVLM在接觸估計方面超越了現有工作,並促進了從野外圖像進行三維重建的能力。代碼和模型可在https://interactvlm.is.tue.mpg.de獲取。
English
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at https://interactvlm.is.tue.mpg.de.

Summary

AI-Generated Summary

PDF22April 14, 2025