SR-3D logo

3D Aware Region Prompted Vision Language Model

Region-level spatial reasoning for in-the-wild scenes without sensory 3D inputs.

3D‑Aware Region Prompting Multi‑View Spatial Reasoning

Architecture

A key idea of SR-3D is the introduction of a canonical positional representation shared across single-view and multi-view inputs. This unified representation enables large-scale single-view pretraining and supports the transfer of learned spatial priors to multi-view settings.

SR-3D model architecture diagram
❄️ Frozen and 🔥 Trainable parameters.

Performance on 2D Spatial Benchmarks

Incorporating 3D positional information improves spatial understanding in single-view models; comparing to the base model NVILA-Lite-8B, SR-3D achieves higher spatial performance.

3D Region-Level Spatial Understanding

SR-3D model architecture diagram
We show extreme cases where the same region prompts are used across samples but with different target objects. SR-3D answers all queries correctly, showing strong evidence that it truly understands 3D spatial relationships.

3D Scene Benchmarks

SR-3D model architecture diagram
Results on VSI-Bench. SR-3D answers spatial questions correctly even without region prompts.

Citation

@article{cheng2025sr3d,
  title={3D Aware Region Prompted Vision Language Model},
  author={An-Chieh Cheng and Yang Fu and Yukang Chen and Zhijian Liu and Xiaolong Li and Subhashree Radhakrishnan and Song Han and Yao Lu and Jan Kautz and Pavlo Molchanov and Hongxu Yin and Xiaolong Wang and Sifei Liu},
  journal={arXiv preprint arXiv:2509.13317},
  year={2025},
}

Acknowledgement

Teaser videos are sourced from publicly available YouTube channels ([1], [2], and [3]) for academic purposes only.