ChatPaper.aiChatPaper

通过局部随机访问序列建模实现三维场景理解

3D Scene Understanding Through Local Random Access Sequence Modeling

April 4, 2025
作者: Wanhee Lee, Klemen Kotar, Rahul Mysore Venkatesh, Jared Watrous, Honglin Chen, Khai Loong Aw, Daniel L. K. Yamins
cs.AI

摘要

从单张图像理解三维场景是计算机视觉领域的一个关键问题,在图形学、增强现实和机器人技术中有着广泛的应用。尽管基于扩散的建模方法已展现出潜力,但在复杂的现实场景中,它们往往难以保持物体和场景的一致性。为解决这些局限,我们提出了一种自回归生成方法,称为局部随机访问序列(LRAS)建模,该方法采用局部块量化和随机顺序序列生成。通过利用光流作为三维场景编辑的中间表示,我们的实验表明,LRAS在新型视图合成和三维物体操控能力上达到了业界领先水平。此外,我们展示了通过简单调整序列设计,该框架可自然扩展到自监督深度估计任务。通过在多项三维场景理解任务中取得优异表现,LRAS为构建下一代三维视觉模型提供了一个统一且高效的框架。
English
3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

Summary

AI-Generated Summary

PDF52April 9, 2025