Related Work

Instance-specific 3D representations

Driven by the recent emergence of neural fields, a growing number of methods seek to accurately capture the details of a specific object or scene given multiple images. Leveraging either volumetric, implicit (NeUS, UNISURF), mesh-based (NeRS), or hybrid representations(TensoRF, Plenoxels), these methods learn instance-specific representations capable of synthesizing novel views.

However, as these methods do not learn generic data-driven priors, they typically require densely sampled views to be able to infer geometrically consistent underlying representations and are incapable of predicting beyond what they directly observe.

Projection-guided generalizable view synthesis

Several methods have aimed to learn models capable of view-synthesis across instances. While initial attempts (Scene Representation Networks) used global-variable-conditioned neural fields, subsequent approaches (pixelNeRF, MVSNeRF, GRF) obtained significant improvements by instead using features extracted via projection onto the context views. NerFormer further demonstrated the benefits of learning the aggregation mechanisms across the features along a query ray, but the projection-guided features remained the fundamental building blocks.

While these projection-based methods are effective at generating novel views by transforming the visible structures, they struggle to deal with large viewpoint changes (as the underlying geometry maybe uncertain), and are fundamentally unable to generate plausible visual information not directly observed in the context views. Arguably, this is because these methods lack the mechanisms to learn and utilize contexts globally when generating query views.

Geometry-free view synthesis

To allow using global context for view synthesis, an alternate class of methods uses ‘geometry-free’ encodings to infer novel views. The initial learning-based methods typically focused on novel-view prediction given a single image via global conditioning. Subsequent approaches (GFVS, ViewFormer) improved performance using different architectures such as Transformers, while also allowing for probabilistic view synthesis using VQ-VAEs and VQ-GANs. While this leads to detailed and realistic outputs, the renderings are not 3D-consistent due to stochastic sampling.

Our work is inspired by the recently proposed Scene Representation Transformer (SRT) , which uses a set-latent representation that encodes both patch-level and global scene context. This design engenders a fast, deterministic rendering pipeline that, unlike projection-based methods, furnishes plausible hallucinations in the invisible regions.

However, these benefits come at the cost of detail – unlike the projection-based methods, this geometry-free approach is unable to capture precise details in the visible aspects. Motivated by this need to improve the detail, we propose mechanisms to inject geometric biases in this framework, and find that this significantly improves the performance while preserving global reasoning and efficiency.