Our model generates plausible novel views, conditioned on only a single input view, enabling to handle both in-domain images (top) and out-of-domain images (bottom).

Application: Single Image to 3DGS

Our model can be applied to various downstream tasks. For example, given a single image, our model generates 3-4 novel view images, followed by feeding them into fast 3DGS reconstructors such as InstantSplat. Then we can easily obtain a 3DGS scene in 30 seconds.

Warping-and-Inpainting vs. Ours

We introduce a novel approach where a diffusion model learns to implicitly conduct geometric warping conditioned on MDE depth-based correspondence, instead of warping the pixels or the features directly. We design the model to interactively compensate for the ill-warped regions during its generation process, thereby preventing artifacts typically caused by explicit warping.

GenWarp Knows Where to Warp and Where to Refine

In our augmented self-attention, the original self-attention part is more attentive to regions requiring generative priors, such as occluded or ill-warped areas (top), while the cross-view attention part focuses on regions that can be reliably warped from the input view (bottom). By aggregating both attentions at once, the model naturally determines which regions to generate and which to warp.

Qualitative Results on In-The-Wild Images

Overall Framework

Given an input view and a desired camera viewpoint, we obtain a pair of embeddings: a 2D coordinate embedding for the input view, and a warped coordinate embedding for the novel view. With these embeddings, a semantic preserver network produces a semantic feature of the input view, and a diffusion model conditioned on them learns to conduct geometric warping to generate novel views.


Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on. Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images. In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints. In this paper, we propose a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting crossview attention with self-attention. Our approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals. Qualitative and quantitative evaluations demonstrate that our model outperforms existing methods in both in-domain and out-of-domain scenarios.



We thank Jisang Han for helping with the 3DGS application in this project page. The website template was borrowed from Michaël Gharbi.