NeRF On-the-go

Abstract

Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes, but face challenges in dynamic, real-world environments with distractors like moving objects, shadows, and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality, especially under high occlusion scenarios. In this paper, we introduce NeRF On-the-go, a simple yet effective approach that enables the robust synthesis of novel views in complex, in-the-wild scenes from only casually captured image sequences. Delving into uncertainty, our method not only efficiently eliminates distractors, even when they are predominant in captures, but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes, our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications.

Video

Method

A pre-trained DINOv2 network extracts feature maps from posed images, followed by a dilated patch sampler that selects rays. The uncertainty MLP \(G\) then takes the DINOv2 features of these rays as inputs to generate the uncertainties \(\beta\textbf(r)\). Both the uncertainty and NeRF MLPs are trained during the forward pass, the uncertainty and NeRF MLPs run in parallel. Three losses (on the right) are used to optimize \(G\) and the NeRF model. Note that the training process is facilitated by detaching the gradient flows as indicated by the colored dashed lines.

On-the-go Dataset

To rigorously evaluate our approach in real-world settings, we captured a dataset that contains 12 casually captured sequences, including 10 outdoor and 2 indoor scenes. We name this dataset On-the-go dataset. This dataset features a wide range of dynamic objects including pedestrians, cyclists, strollers, toys, cars, robots, and trams, along with diverse occlusion ratios ranging from 5% to 30%.

Additional Results

Comparison with RobustNeRF

RobustNeRF employs hard thresholding to eliminate distractors, which makes it sensitive to the threshold value and may not generalize effectively in complex scenes. Our method is more robust to the distractors and can handle more complicated scenes.

Comparison with NeRF-W

Compare with NeRF-W, our method can handle more complicated scenes with higher occlusion ratio. Furthermore, it does not depend on transient embedding, which adds extra complexity and can potentially result in the loss of high-frequency details.

Here, we show more comparisons with NeRF-W and RobustNeRF.

BibTeX

@InProceedings{Ren2024NeRF,
    title={NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild},
    author={Ren, Weining and Zhu, Zihan and Sun, Boyang and Chen, Jiaqi and Pollefeys, Marc and Peng, Songyou},
    booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2024},
}

Acknowledgements

We thank the Max Planck ETH Center for Learning Systems (CLS) for supporting Songyou Peng. We also thank Yiming Zhao and Clément Jambon for helpful discussions.