SING3R-SLAM: Submap-based Indoor Monocular Gaussian SLAM with 3D Reconstruction Priors

1Technical University of Munich, 2Google,
3Munich Center for Machine Learning, 4Visualais
teaser

Abstract

Recent advances in dense 3D reconstruction have demonstrated strong capability in accurately capturing local geometry. However, extending these methods to incremental global reconstruction, as required in SLAM systems, remains challenging. Without explicit modeling of global geometric consistency, existing approaches often suffer from accumulated drift, scale inconsistency, and suboptimal local geometry. To address these issues, we propose SING3R-SLAM, a globally consistent Gaussian-based monocular indoor SLAM framework. Our approach represents the scene with a Global Gaussian Map that serves as a persistent, differentiable memory, incorporates local geometric reconstruction via submap-level global alignment, and leverages global map's consistency to further refine local geometry. This design enables efficient and versatile 3D mapping for multiple downstream applications. Extensive experiments show that SING3R-SLAM achieves state-of-the-art performance in pose estimation, 3D reconstruction, and novel view rendering. It improves pose accuracy by over 10%, produces finer and more detailed geometry, and maintains a compact and memory-efficient global representation on real-world datasets.

pipeline

Reconstructed Point Clouds on 7-scenes (office)

We present three video sequences of the reconstructed point cloud on 7-scenes (office) from different viewpoints. The first sequence shows a top-down view, where our method produces a notably cleaner and more coherent point cloud. The other two sequences provide zoom-in views of the left and middle walls. These close-up inspections demonstrate that our method reconstructs each wall as a single, well-aligned surface, whereas MASt3R-SLAM and VGGT-SLAM exhibit misalignment across views, resulting in ghosting artifacts and incorrect geometry.

MASt3R-SLAM

VGGT-SLAM

SING3R-SLAM(Ours)

MASt3R-SLAM reconstructs a single unified wall but contains many floating artifacts on the table.

VGGT-SLAM produces multiple walls and fails to align them properly.

Our method generates a single unified wall without any floating artifacts.

MASt3R-SLAM fails to accurately align the wall and produces numerous floating artifacts.

VGGT-SLAM generates a noisy reconstruction with many floating artifacts.

Our method achieves a clean and coherent reconstruction.

Reconstructed Meshes on ScanNet-v2 (scene0000)

We show the reconstructed meshes from different viewpoints, compared with the ground truth and HI-SLAM2. Our method successfully reconstructs the bicycle in scene_0000, which HI-SLAM2 fails to recover. Even in the ground truth, the bicycle geometry is incomplete, highlighting that our approach can recover fine object details. We provide two video sequences: a top-down view of the full scene and a zoom-in view on the bicycle, clearly demonstrating the improved reconstruction fidelity of our method.

Ground Truth

HI-SLAM2

SING3R-SLAM(Ours)

HI-SLAM2 provides clean mesh, but without fine details such as the bicycle.

Our method reconstructs the scene with high fidelity, capturing fine details such as the bicycle, which is typically challenging to recover.