LVSM: Een Groot Beeldsynthesemodel met Minimale 3D Inductieve Bias

Samenvatting

Wij stellen het Large View Synthesis Model (LVSM) voor, een nieuw transformer-gebaseerde benadering voor schaalbare en generaliseerbare synthese van nieuwe weergaven vanuit spaarzaam bekeken invoer. We introduceren twee architecturen: (1) een encoder-decoder LVSM, die invoerbeeldtokens codeert naar een vast aantal 1D latente tokens, functionerend als een volledig aangeleerde scène representatie, en nieuwe-weergavebeelden decodeert vanuit deze tokens; en (2) een decoder-only LVSM, die invoerbeelden direct afbeeldt naar nieuwe-weergave-uitvoer, waarbij volledig de tussenliggende scène representaties worden geëlimineerd. Beide modellen omzeilen de 3D inductieve vooroordelen die gebruikt worden in eerdere methoden -- van 3D representaties (bijv. NeRF, 3DGS) tot netwerkontwerpen (bijv. epipolaire projecties, vlakke sweeps) -- door nieuwe weergave synthese aan te pakken met een volledig op data gebaseerde benadering. Terwijl het encoder-decoder model snellere inferentie biedt vanwege zijn onafhankelijke latente representatie, behaalt de decoder-only LVSM superieure kwaliteit, schaalbaarheid en zero-shot generalisatie, waarbij eerdere state-of-the-art methoden met 1.5 tot 3.5 dB PSNR worden overtroffen. Uitgebreide evaluaties over meerdere datasets tonen aan dat beide LVSM varianten state-of-the-art kwaliteit voor nieuwe weergave synthese behalen. Opmerkelijk is dat onze modellen alle eerdere methoden overtreffen zelfs met verminderde rekenbronnen (1-2 GPU's). Zie onze website voor meer details: https://haian-jin.github.io/projects/LVSM/ .

English

We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: https://haian-jin.github.io/projects/LVSM/ .

LVSM: Een Groot Beeldsynthesemodel met Minimale 3D Inductieve Bias

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

Samenvatting

Summary

Support