InteractVLM：基于2D基础模型的3D交互推理

摘要

我们提出了InteractVLM，一种从单张野外图像中估计人体与物体间三维接触点的新方法，实现了精确的人体-物体三维联合重建。这一任务面临遮挡、深度模糊及物体形状多样性的挑战。现有方法依赖于通过昂贵动作捕捉系统或繁琐手动标注收集的三维接触数据，限制了其可扩展性和泛化能力。为解决这一问题，InteractVLM利用大规模视觉-语言模型（VLMs）的广泛视觉知识，并通过有限的三维接触数据进行微调。然而，直接应用这些模型并非易事，因为它们仅在二维空间进行推理，而人体与物体的接触本质上是三维的。因此，我们引入了一个新颖的渲染-定位-提升模块，该模块：（1）通过多视角渲染将三维人体和物体表面嵌入二维空间，（2）训练一个新颖的多视角定位模型（MV-Loc）以在二维中推断接触点，（3）将这些接触点提升至三维。此外，我们提出了一项新任务——语义人体接触估计，其中人体接触预测明确基于物体语义，从而实现了更丰富的交互建模。InteractVLM在接触估计上超越了现有工作，并促进了从野外图像进行三维重建。代码和模型可在https://interactvlm.is.tue.mpg.de获取。

English

We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image. Code and models are available at https://interactvlm.is.tue.mpg.de.

InteractVLM：基于2D基础模型的3D交互推理

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

摘要

Summary

Support

Support