Video Comparison Against State-of-the-Art Methods

Each row shows the input view, reconstructions from WildGaussians and NexusSplats, and our GenWildSplat result. Under sparse inputs and varied lighting, our method yields more consistent novel views.

Input

WildGaussians

NexusSplats

GenWildSplat (Ours)

Input

WildGaussians

NexusSplats

GenWildSplat (Ours)

Input

WildGaussians

NexusSplats

GenWildSplat (Ours)

Our GenWildSplat Framework

Given sparse, unposed images with varying appearance and transient objects, our approach first extracts multi-view features capturing both semantic and geometric information. Dedicated prediction heads estimate depth, camera parameters, and 3D Gaussian attributes, which are then mapped into a canonical 3D representation. A light encoder captures per-image illumination, allowing the model to modulate the Gaussians' colors consistently across views using the appearance adapter. Using a pre-trained segmentation network, transient objects are masked out, and the reconstruction loss focuses on static scene content. This enables photorealistic, view-consistent reconstructions from sparse, in-the-wild images.

Constant Appearance Renderings

Our method enables rendering a scene under a constant appearance code while preserving full 3D consistency across views - an ability that 2D appearance-transfer or 2D relighting methods typically struggle to maintain.

Input

Appearance #1

Appearance #2

Appearance #3

Flexible Viewpoint and Appearance Control

Our method supports flexible appearance change and free-viewpoint rendering from a 3D scene representation. The results demonstrate the model's ability to preserve geometric consistency while generating novel combinations of viewpoints and illumination conditions.

Input View

Same View Different Lighting

Different View Same Lighting

Different View Different Lighting

Cross-Scene Illumination Transfer

GenWildSplat can transfer lighting or appearance from one scene to another, enabling controlled appearance changes while preserving geometry. Such appearance transfer is not feasible with prior methods like WildGaussians and NexusSplats, which couple appearance and geometry optimization.

Video Results across Varying Input View Count

2 Input Views

Example reconstructions from varying numbers of input views. Use the controls below to browse results for 2-6 input images. This illustrates that reconstruction quality improves with more views, while GenWildSplat remains robust even with only two inputs. Experiments are limited to six views due to computational constraints, though the method can handle more in practice.

Effect of Input View Count on Scene Reconstruction

The results show that 3D reconstruction quality directly correlates with input view count. Increasing context views (2 to 6) significantly enhances novel-view synthesis. Low-view input (1-3 views) leads to geometric holes and artifacts, while a higher number of views (5-6 views) yields a more robust, hole-free reconstruction.

1 View

2 Views

3 Views

4 Views

5 Views

6 Views

Appearance Interpolation

A slider enables interactive blending between different appearances of the same scene by moving the blue dot along the axis. The resulting smooth transitions demonstrate that our light encoder learns semantically meaningful lighting codes and effectively captures diverse scene appearances.

Curated Scenes from MegaScenes Dataset

We showcase a few scenes from our curated set from the MegaScenes dataset used in our evaluations. These outdoor scenes include sparse viewpoints, significant illumination variations, and transient occluders, providing a challenging test of GenWildSplat's generalization capability.

Upper Bound Comparison with Prior Methods

Prior methods require significantly more views to approach accurate geometry. NexusSplat reaches our performance only at around 216 inputs, whereas GenWildSplat achieves high-quality reconstructions with just 6 views.

Input

6 Views

36 Views

216 Views

Ours (6 Views)

Synthetic Data Generation Pipeline for Curriculum Training

We train our model in a three-stage fashion with progressively increasing complexity. In Stage I, using only a single scene from the DL3DV dataset, we employ DiffusionRenderer to relight images with diverse, random lighting, generating synthetic data to help the model resolve geometry and appearance ambiguities. Stage II extends this approach to multiple scenes, allowing the model to generalize across scenes and varied appearances. In Stage III, we introduce synthetic occlusions and task the model to remove them, building on the representations learned in the earlier stages.

Analysis of Failure Cases

1. Large Motion Changes on Small Objects

These examples illustrate the current limitations of our framework. Use the controls below to navigate through the four discussed failure scenarios.

BibTeX

@article{gupta2026genwildsplat,
      title   = {Generalizable Sparse-View 3D Reconstruction from Unconstrained Images},
      author  = {Gupta, Vinayak and Lin, Chih-Hao and Wang, Shenlong and Bhattad, Anand and Huang, Jia-Bin},
      journal = {CVPR},
      year    = {2026}
    }

GenWildSplat
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

TL;DR: Feed-forward in-the-wild scene reconstruction from sparse views in 3s on a A6000 GPU.

Video Comparison Against State-of-the-Art Methods

Side-by-Side Comparison of Appearance Modeling

Our GenWildSplat Framework

Constant Appearance Renderings

Flexible Viewpoint and Appearance Control

Cross-Scene Illumination Transfer

Video Results across Varying Input View Count

2 Input Views

Effect of Input View Count on Scene Reconstruction

Appearance Interpolation

Depth Prediction

Curated Scenes from MegaScenes Dataset

Upper Bound Comparison with Prior Methods

Synthetic Data Generation Pipeline for Curriculum Training

Gallery of Results (100+ Scenes)

Analysis of Failure Cases

1. Large Motion Changes on Small Objects

BibTeX

GenWildSplat Generalizable Sparse-View 3D Reconstruction from Unconstrained Images

TL;DR: Feed-forward in-the-wild scene reconstruction from sparse views in 3s on a A6000 GPU.

Video Comparison Against State-of-the-Art Methods

Side-by-Side Comparison of Appearance Modeling

Our GenWildSplat Framework

Constant Appearance Renderings

Flexible Viewpoint and Appearance Control

Cross-Scene Illumination Transfer

Video Results across Varying Input View Count

2 Input Views

Effect of Input View Count on Scene Reconstruction

Appearance Interpolation

Depth Prediction

Curated Scenes from MegaScenes Dataset

Upper Bound Comparison with Prior Methods

Synthetic Data Generation Pipeline for Curriculum Training

Gallery of Results (100+ Scenes)

Analysis of Failure Cases

1. Large Motion Changes on Small Objects

Related Work

BibTeX

GenWildSplat
Generalizable Sparse-View 3D Reconstruction from Unconstrained Images