CVPR 2026

Grounded 3D-Aware
Spatial Vision-Language Modeling

An-Chieh Cheng^1,2* Yang Fu² Yatai Ji¹ Ligeng Zhu¹ Guanqi Zhan¹ Zhuoyang Zhang¹ Zhaojing Yang² Song Han¹ Yao Lu¹ Pavlo Molchanov¹ Vidya Nariyambut Murali¹ Jan Kautz¹ Xiaolong Wang² Hongxu Yin¹ Sifei Liu¹

* Work done during internship at NVIDIA

Paper CodeComing Soon WeightsComing Soon

Unified Spatial Reasoning & Grounding. GR3D bridges the gap between 2D pixel-space and 3D metric-space by integrating multiple grounding capabilities into visual CoT.

Abstract

GR3D is a spatial vision language model equipped with three complementary grounding capabilities—explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding—within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference.

Method

GR3D extends a foundational Spatial VLM (SR3D) with a streaming region-insertion loop. At every step of the chain-of-thought, the model:

Predicts a 2D region for the entity it is about to discuss.

Extracts a region embedding and re-injects it as a token.

Conditioned on the refreshed visual cue, continues reasoning toward the next step or final task output.

Experimental Results

Results on the BLINK-Depth benchmark for point-level spatial understanding. GR3D's grounding CoT capabilities serve as a structural anchor, allowing it to outperform previous methods.

GR3D generalizes across diverse domains, including novel object categories 🦦, OOD outdoor scenes 🏠, warehouse environments 🏭, and potential robotics applications 🤖.

Key Insights

Detect then Lift vs Direct 3D Prediction

Grounding the target in 2D before predicting its 3D bounding box leads to clear performance gains, as shown in the table below. This two-step design encourages the model to first learn object-specific visual features and leverage abundant 2D supervision, which helps build stronger spatial priors and improves downstream 3D detection. We also compare against Qwen3-VL, a direct 3D prediction approach, and observe that it more easily misses the target.

Does spatial pretraining help 3D detection?

Yes. Spatial pretraining noticeably improves performance, especially in outdoor domains. Due to strong dataset imbalance in Omni3D, where outdoor samples are much fewer than indoor ones, models trained from scratch struggle to generalize. Spatial pretraining injects generic 2D spatial and grounding knowledge, allowing the model to transfer stronger spatial priors to 3D detection. As shown in the results, leveraging 2D supervision is particularly effective when 3D training data is limited or unevenly distributed.

Scaling Effect of 3D Point-Map Data

Pointmap reconstruction serves as an effective auxiliary task for 3D detection by improving the alignment between region-level visual features and their underlying 3D geometry. We show that increasing pointmap supervision leads to consistent performance gains on SUN-RGBD, suggesting that dense geometric reconstruction provides strong structural priors that enhance downstream 3D box prediction.

BibTeX

@inproceedings{gr3d,
  title={Grounded 3D-Aware Spatial Vision-Language Modeling},
  author={Cheng, An-Chieh and Fu, Yang and Ji, Yatai and Zhu, Ligeng and Zhan, Guanqi and Zhang, Zhuoyang and Yang, Zhaojing and Han, Song and Lu, Yao and Molchanov, Pavlo and Murali, Vidya Nariyambut and Kautz, Jan and Wang, Xiaolong and Yin, Hongxu and Liu, Sifei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}