GenWildSplat
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

CVPR 2026
TL;DR: Feed-forward in-the-wild scene reconstruction from sparse views in 3s on a A6000 GPU.

Note: All video results are presented on real-world scenes that were not seen during training!

Video Comparison Against State-of-the-Art Methods

Each row shows the input view, reconstructions from WildGaussians and NexusSplats, and our GenWildSplat result. Under sparse inputs and varied lighting, our method yields more consistent novel views.

Input
WildGaussians
NexusSplats
GenWildSplat (Ours)
Input
WildGaussians
NexusSplats
GenWildSplat (Ours)
Input
WildGaussians
NexusSplats
GenWildSplat (Ours)

Side-by-Side Comparison of Appearance Modeling

Each left-right pair compares a baseline method with GenWildSplat under identical target views and lighting conditions.

WildGaussiansGenWildSplat
NexusSplatsGenWildSplat

Our GenWildSplat Framework

Given sparse, unposed images with varying appearance and transient objects, our approach first extracts multi-view features capturing both semantic and geometric information. Dedicated prediction heads estimate depth, camera parameters, and 3D Gaussian attributes, which are then mapped into a canonical 3D representation. A light encoder captures per-image illumination, allowing the model to modulate the Gaussians' colors consistently across views using the appearance adapter. Using a pre-trained segmentation network, transient objects are masked out, and the reconstruction loss focuses on static scene content. This enables photorealistic, view-consistent reconstructions from sparse, in-the-wild images.

Constant Appearance Renderings

Our method enables rendering a scene under a constant appearance code while preserving full 3D consistency across views - an ability that 2D appearance-transfer or 2D relighting methods typically struggle to maintain.

Input
Appearance #1
Appearance #2
Appearance #3

Flexible Viewpoint and Appearance Control

Our method supports flexible appearance change and free-viewpoint rendering from a 3D scene representation. The results demonstrate the model's ability to preserve geometric consistency while generating novel combinations of viewpoints and illumination conditions.

Input View
Same View Different Lighting
Different View Same Lighting
Different View Different Lighting

Cross-Scene Illumination Transfer

GenWildSplat can transfer lighting or appearance from one scene to another, enabling controlled appearance changes while preserving geometry. Such appearance transfer is not feasible with prior methods like WildGaussians and NexusSplats, which couple appearance and geometry optimization.

Effect of Input View Count on Scene Reconstruction

The results show that 3D reconstruction quality directly correlates with input view count. Increasing context views (2 to 6) significantly enhances novel-view synthesis. Low-view input (1-3 views) leads to geometric holes and artifacts, while a higher number of views (5-6 views) yields a more robust, hole-free reconstruction.

1 View
2 Views
3 Views
4 Views
5 Views
6 Views

Appearance Interpolation

A slider enables interactive blending between different appearances of the same scene by moving the blue dot along the axis. The resulting smooth transitions demonstrate that our light encoder learns semantically meaningful lighting codes and effectively captures diverse scene appearances.

Depth Prediction

For reference, we show the depth prediction rendered by rasterizing the Gaussians' centers.

RGBDepth
RGBDepth

Curated Scenes from MegaScenes Dataset

We showcase a few scenes from our curated set from the MegaScenes dataset used in our evaluations. These outdoor scenes include sparse viewpoints, significant illumination variations, and transient occluders, providing a challenging test of GenWildSplat's generalization capability.

Upper Bound Comparison with Prior Methods

Prior methods require significantly more views to approach accurate geometry. NexusSplat reaches our performance only at around 216 inputs, whereas GenWildSplat achieves high-quality reconstructions with just 6 views.

Input
6 Views
36 Views
216 Views
Ours (6 Views)
Input view example 1
6 views example 1
36 views example 1
216 views example 1
Ours (6 views) example 1
Input view example 2
6 views example 2
36 views example 2
216 views example 2
Ours (6 views) example 2

Synthetic Data Generation Pipeline for Curriculum Training

We train our model in a three-stage fashion with progressively increasing complexity. In Stage I, using only a single scene from the DL3DV dataset, we employ DiffusionRenderer to relight images with diverse, random lighting, generating synthetic data to help the model resolve geometry and appearance ambiguities. Stage II extends this approach to multiple scenes, allowing the model to generalize across scenes and varied appearances. In Stage III, we introduce synthetic occlusions and task the model to remove them, building on the representations learned in the earlier stages.

Synthetic Data Generation

Gallery of Results (100+ Scenes)

Browse through hundreds of MegaScenes test scenes (never seen by our model during training) reconstructed by GenWildSplat, showcasing its performance across a wide range of outdoor settings.

BibTeX

@article{gupta2026genwildsplat,
      title   = {Generalizable Sparse-View 3D Reconstruction from Unconstrained Images},
      author  = {Gupta, Vinayak and Lin, Chih-Hao and Wang, Shenlong and Bhattad, Anand and Huang, Jia-Bin},
      journal = {CVPR},
      year    = {2026}
    }