4D LangSplat:基于多模态大语言模型的四维语言高斯溅射
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
March 13, 2025
作者: Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister
cs.AI
摘要
学习4D语言场以实现动态场景中时间敏感、开放式的语言查询,对于众多现实世界应用至关重要。尽管LangSplat成功地将CLIP特征嵌入到3D高斯表示中,在静态3D场景中实现了精确与高效,但它无法处理动态的4D场,因为CLIP专为静态图像-文本任务设计,难以捕捉视频中的时间动态。现实环境本质上是动态的,物体语义随时间演变。构建精确的4D语言场需要获取像素对齐、对象级别的视频特征,而当前视觉模型在这方面面临挑战。为解决这些问题,我们提出了4D LangSplat,它学习4D语言场,以高效处理动态场景中时间无关或时间敏感的开放词汇查询。4D LangSplat绕过了从视觉特征学习语言场的过程,而是直接通过多模态大语言模型(MLLMs)从对象级别视频描述生成的文本中学习。具体而言,我们提出了一种多模态对象级别视频提示方法,结合视觉与文本提示,引导MLLMs为视频中的对象生成详细、时间一致、高质量的描述。这些描述通过大语言模型编码为高质量的句子嵌入,随后作为像素对齐、对象特定的特征监督,通过共享嵌入空间促进开放词汇文本查询。认识到4D场景中的对象状态间存在平滑过渡,我们进一步提出了状态可变形网络,以有效建模这些随时间连续变化的状态。我们在多个基准测试中的结果表明,4D LangSplat在时间敏感与时间无关的开放词汇查询上均取得了精确且高效的结果。
English
Learning 4D language fields to enable time-sensitive, open-ended language
queries in dynamic scenes is essential for many real-world applications. While
LangSplat successfully grounds CLIP features into 3D Gaussian representations,
achieving precision and efficiency in 3D static scenes, it lacks the ability to
handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot
capture temporal dynamics in videos. Real-world environments are inherently
dynamic, with object semantics evolving over time. Building a precise 4D
language field necessitates obtaining pixel-aligned, object-wise video
features, which current vision models struggle to achieve. To address these
challenges, we propose 4D LangSplat, which learns 4D language fields to handle
time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes
efficiently. 4D LangSplat bypasses learning the language field from vision
features and instead learns directly from text generated from object-wise video
captions via Multimodal Large Language Models (MLLMs). Specifically, we propose
a multimodal object-wise video prompting method, consisting of visual and text
prompts that guide MLLMs to generate detailed, temporally consistent,
high-quality captions for objects throughout a video. These captions are
encoded using a Large Language Model into high-quality sentence embeddings,
which then serve as pixel-aligned, object-specific feature supervision,
facilitating open-vocabulary text queries through shared embedding spaces.
Recognizing that objects in 4D scenes exhibit smooth transitions across states,
we further propose a status deformable network to model these continuous
changes over time effectively. Our results across multiple benchmarks
demonstrate that 4D LangSplat attains precise and efficient results for both
time-sensitive and time-agnostic open-vocabulary queries.Summary
AI-Generated Summary