Neural Pixel Composition for
3D-4D View Synthesis from Multi-Views

Aayush Bansal

Michael Zollhoefer

Reality Labs Research, Pittsburgh

{aayushb4, zollhoefer}@meta.com

Our approach allows us to capture small details better than existing methods. We show novel views (top-row) synthesized using our approach and zoom on the details for each view (bottom-row).

Abstract

We present Neural Pixel Composition (NPC), a novel approach for continuous 3D-4D view synthesis given only a discrete set of multi-view observations as input. Existing state-of-the-art approaches require dense multi-view supervision and an extensive computational budget. The proposed formulation reliably operates on sparse and wide-baseline multi-view imagery and can be trained efficiently within a few seconds to 10 minutes for hi-res (12MP) content, i.e., 200-400X faster convergence than existing methods. Crucial to our approach are two core novelties: 1) a representation of a pixel that contains color and depth information accumulated from multi-views for a particular location and time along a line of sight, and 2) a multi-layer perceptron (MLP) that enables the composition of this rich information provided for a pixel location to obtain the final color output. We experiment with a large variety of multi-view sequences, compare to existing approaches, and achieve better results in diverse and challenging settings.

A. Bansal and M. Zollhoefer
Neural Pixel Composition for
3D-4D View Synthesis from Multi-Views
[Paper] / [Bibtex]
(27 pages, 23 figures, 20 tables)

A short video explaining our work.

Real-World Unconstrained Multi-Views

Our approach allows us to operate on sparse and unconstrained multi-views of unbounded scenes. We learn models given a fixed time instant for unbounded scenes. Minimum distance between adjacent cameras in these sequences is more than 50cm. Prior approaches such as NeRF and DS-NeRF lead to degenerate output on these sequences.

Hi-Resolution (12MP) View Synthesis

We capture hi-res details in a scene.

Our approach allows us to capture small details better than prior approaches. We contrast with NeRF and NeX. For the first four results, we train the NeRF model for 2M iterations that take 64 hours of training on a single NVIDIA V100 GPU. We use the results provided by NeX for the last three sequences from Shiny dataset.

Learning from a Single Subject and a Single Time Instant

Our approach allows us to capture hi-res facial details such as hair, eyes, teeth, and skin details. The model is trained on a single time instant. It generalizes to unseen expressions and unseen subjects.

4D Visualization

We learn one model for a given temporal sequence. This is in contrast to above results where a model was trained for one time instant.

3D Reconstruction

9 views capturing two people.