비디오 생성에서 물리적 이해를 향하여: 3D 포인트 정규화 접근 방식

초록

우리는 3차원 기하학과 동적 인식을 통합한 혁신적인 비디오 생성 프레임워크를 제안합니다. 이를 위해 2D 비디오에 3D 점 궤적을 추가하고 픽셀 공간에서 정렬합니다. 이러한 결과로 얻어진 3D 인식 비디오 데이터 세트, PointVid,은 잠재 확산 모델을 세밀하게 조정하여 2D 객체를 3D 직교 좌표로 추적할 수 있게 합니다. 이를 기반으로 비디오 내 객체의 모양과 움직임을 규제하여 비물리적 변형과 같은 원치 않는 아티팩트를 제거합니다. 결과적으로 생성된 RGB 비디오의 품질을 향상시키고 현재의 비디오 모델에서 주로 나타나는 객체 변형과 같은 일반적인 문제를 완화합니다. 우리의 3D 추가 및 규제를 통해 우리 모델은 3D 정보가 필수적인 작업 지향 비디오와 같은 접촉이 많은 시나리오를 처리할 수 있습니다. 이러한 비디오는 고체물의 복잡한 상호작용을 포함하며, 여기서 3D 정보는 변형과 접촉을 인식하는 데 필수적입니다. 더 나아가, 우리 모델은 이동 객체의 3D 일관성을 촉진하고 모양과 움직임의 급격한 변화를 줄이는 것을 통해 비디오 생성의 전반적인 품질을 향상시킵니다.

English

We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness. To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space. The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model, enabling it to track 2D objects with 3D Cartesian coordinates. Building on this, we regularize the shape and motion of objects in the video to eliminate undesired artifacts, \eg, nonphysical deformation. Consequently, we enhance the quality of generated RGB videos and alleviate common issues like object morphing, which are prevalent in current video models due to a lack of shape awareness. With our 3D augmentation and regularization, our model is capable of handling contact-rich scenarios such as task-oriented videos. These videos involve complex interactions of solids, where 3D information is essential for perceiving deformation and contact. Furthermore, our model improves the overall quality of video generation by promoting the 3D consistency of moving objects and reducing abrupt changes in shape and motion.

비디오 생성에서 물리적 이해를 향하여: 3D 포인트 정규화 접근 방식

Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach

초록

Support