Note: All video results are presented on real-world scenes that were not seen during training!
Each row shows the input view, reconstructions from WildGaussians and NexusSplats, and our GenWildSplat result. Under sparse inputs and varied lighting, our method yields more consistent novel views.
Each left-right pair compares a baseline method with GenWildSplat under identical target views and lighting conditions.
Our method enables rendering a scene under a constant appearance code while preserving full 3D consistency across views - an ability that 2D appearance-transfer or 2D relighting methods typically struggle to maintain.
Our method supports flexible appearance change and free-viewpoint rendering from a 3D scene representation. The results demonstrate the model's ability to preserve geometric consistency while generating novel combinations of viewpoints and illumination conditions.
GenWildSplat can transfer lighting or appearance from one scene to another, enabling controlled appearance changes while preserving geometry. Such appearance transfer is not feasible with prior methods like WildGaussians and NexusSplats, which couple appearance and geometry optimization.
Example reconstructions from varying numbers of input views. Use the controls below to browse results for 2-6 input images. This illustrates that reconstruction quality improves with more views, while GenWildSplat remains robust even with only two inputs. Experiments are limited to six views due to computational constraints, though the method can handle more in practice.
The results show that 3D reconstruction quality directly correlates with input view count. Increasing context views (2 to 6) significantly enhances novel-view synthesis. Low-view input (1-3 views) leads to geometric holes and artifacts, while a higher number of views (5-6 views) yields a more robust, hole-free reconstruction.
A slider enables interactive blending between different appearances of the same scene by moving the blue dot along the axis. The resulting smooth transitions demonstrate that our light encoder learns semantically meaningful lighting codes and effectively captures diverse scene appearances.
For reference, we show the depth prediction rendered by rasterizing the Gaussians' centers.
We showcase a few scenes from our curated set from the MegaScenes dataset used in our evaluations. These outdoor scenes include sparse viewpoints, significant illumination variations, and transient occluders, providing a challenging test of GenWildSplat's generalization capability.
Prior methods require significantly more views to approach accurate geometry. NexusSplat reaches our performance only at around 216 inputs, whereas GenWildSplat achieves high-quality reconstructions with just 6 views.
We train our model in a three-stage fashion with progressively increasing complexity. In Stage I, using only a single scene from the DL3DV dataset, we employ DiffusionRenderer to relight images with diverse, random lighting, generating synthetic data to help the model resolve geometry and appearance ambiguities. Stage II extends this approach to multiple scenes, allowing the model to generalize across scenes and varied appearances. In Stage III, we introduce synthetic occlusions and task the model to remove them, building on the representations learned in the earlier stages.
Browse through hundreds of MegaScenes test scenes (never seen by our model during training) reconstructed by GenWildSplat, showcasing its performance across a wide range of outdoor settings.
These examples illustrate the current limitations of our framework. Use the controls below to navigate through the four discussed failure scenarios.