PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

Abstract

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the camera poses in 1.3 seconds on a single A100 GPU. PF-LRM is a highly scalable method utilizing the self-attention blocks to exchange information between 3D object tokens and 2D image tokens; we predict coarse geometry for each view, and then use a differentiable Perspective-n-Point (PnP) solver to obtain camera poses. When trained on a huge amount of multi-view data, PF-LRM shows strong cross-dataset generalization ability, and outperforms baseline methods by a large margin in terms of pose prediction accuracy and 3D reconstruction quality on various evaluation datasets. We also demonstrate our model's robustness to variable numbers of input views and segmentation mask errors.



Novel dataset tests



Amazon Berkeley Objects



Input images



Our NeRF renderings by our pose predictions

More cases

Omniobject3D



Input images



Our NeRF renderings by our pose predictions

More cases

NeRF Synthetic



Input images



Our NeRF renderings by our pose predictions

More cases

DTU



Input images



Our NeRF renderings by our pose predictions

More cases

Wild



Input images



Our NeRF renderings by our pose predictions

More cases