CVPR 2026

Grounded 3D-Aware
Spatial Vision-Language Modeling

1 NVIDIA logo
2UCSD logo

* Work done during internship at NVIDIA

Unified Spatial Reasoning & Grounding. GR3D bridges the gap between 2D pixel-space and 3D metric-space by integrating multiple grounding capabilities into visual CoT.
GR3D is a spatial vision language model equipped with three complementary grounding capabilities—explicit 2D grounding, implicit 2D grounding, and monocular 3D grounding—within a single framework. GR3D introduces an implicit grounding mechanism that identifies entity mentions during generation and inserts the corresponding region tokens into the text stream, allowing the model to reference visual evidence on the fly when producing spatial chain-of-thought responses. Together, these grounding capabilities enable GR3D to decompose complex spatial understanding problems into grounded 2D perception followed by 3D inference.
GR3D extends a foundational Spatial VLM (SR3D) with a streaming region-insertion loop. At every step of the chain-of-thought, the model:
01
Predicts a 2D region for the entity it is about to discuss.
02
Extracts a region embedding and re-injects it as a token.
03
Conditioned on the refreshed visual cue, continues reasoning toward the next step or final task output.
GR3D Methodology
DetAny3D (specialist) Qwen3-VL-4B Qwen3-VL-8B GR3D (Ours) 80 60 40 20 0 AP₁₅ 43.5 67.5 71.7 16.4 22.2 23.0 SUN-RGBD ARKit Objectron Hypersim KITTI nuScenes
Cube R-CNN (specialist) Qwen3-VL-8B GR3D (Ours) 60 45 30 15 0 AP 38.9 46.2 51.7 28.5 20.5 22.2 SUN-RGBD ARKit Objectron Hypersim KITTI nuScenes
Grounding-DINO-L (specialist) Qwen2-VL-7B InternVL3.5-8B GR3D (Ours) 95 90 80 70 60 Acc (%) axis: 60–95 91.8 87.5 89.5 89.7 RefCOCO (val) RefCOCO+ (val) RefCOCOg (val) RefCOCOg (test)
LLaVA-7B (AF) Qwen2.5-VL-7B (AF) LLaVA-GCoT-7B GR3D (Ours) 80 60 40 20 0 Acc 78.3 74.2 67.7 AccA · answer accuracy AccG · grounding accuracy Consistency
Results on the BLINK-Depth benchmark for point-level spatial understanding. GR3D's grounding CoT capabilities serve as a structural anchor, allowing it to outperform previous methods.
Spatial Reasoning Visualization
GR3D generalizes across diverse domains, including novel object categories 🦦, OOD outdoor scenes 🏠, warehouse environments 🏭, and potential robotics applications 🤖.
GR3D Qualitative Results Grid

Detect then Lift vs Direct 3D Prediction

Grounding the target in 2D before predicting its 3D bounding box leads to clear performance gains, as shown in the table below. This two-step design encourages the model to first learn object-specific visual features and leverage abundant 2D supervision, which helps build stronger spatial priors and improves downstream 3D detection. We also compare against Qwen3-VL, a direct 3D prediction approach, and observe that it more easily misses the target.

Visualizing Implicit Grounding
Direct 3D Prediction 2D→3D (Ours) 50 40 30 20 10 0 AP 42.3 29.9 15.6 10.0 AP15 · SUN-RGBD AP3D · SUN-RGBD AP15 · KITTI AP3D · KITTI

Does spatial pretraining help 3D detection?

Yes. Spatial pretraining noticeably improves performance, especially in outdoor domains. Due to strong dataset imbalance in Omni3D, where outdoor samples are much fewer than indoor ones, models trained from scratch struggle to generalize. Spatial pretraining injects generic 2D spatial and grounding knowledge, allowing the model to transfer stronger spatial priors to 3D detection. As shown in the results, leveraging 2D supervision is particularly effective when 3D training data is limited or unevenly distributed.

w/o spatial pretraining w/ spatial pretraining 50 40 30 20 10 0 AP 41.2 31.0 21.6 14.4 AP15 · SUN-RGBD AP3D · SUN-RGBD AP15 · KITTI AP3D · KITTI

Scaling Effect of 3D Point-Map Data

Pointmap reconstruction serves as an effective auxiliary task for 3D detection by improving the alignment between region-level visual features and their underlying 3D geometry. We show that increasing pointmap supervision leads to consistent performance gains on SUN-RGBD, suggesting that dense geometric reconstruction provides strong structural priors that enhance downstream 3D box prediction.

AP3D15 AP3D AR3D 60 55 50 45 40 35 Score 0 5 100 Region→PointMap Data (%)
@inproceedings{gr3d,
  title={Grounded 3D-Aware Spatial Vision-Language Modeling},
  author={Cheng, An-Chieh and Fu, Yang and Ji, Yatai and Zhu, Ligeng and Zhan, Guanqi and Zhang, Zhuoyang and Yang, Zhaojing and Han, Song and Lu, Yao and Molchanov, Pavlo and Murali, Vidya Nariyambut and Kautz, Jan and Wang, Xiaolong and Yin, Hongxu and Liu, Sifei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}