SRGPT_face

SpatialRGPT: Grounded Spatial Reasoning in
Vision Language Model

UC San Diego, The University of Hong Kong, NVIDIA

SpatialRGPT_icon Project Video

SpatialRGPT_icon Integrating Relative Depth into VLMs

SpatialRGPT use relative depth maps alongside RGB images to enhance geometric reasoning. However, incorporating depth information is challenging because VLM visual encoders are typically only trained on text and 2D images. Simply combining RGB and depth features can harm performance. To address this, we introduce an add-on module that processes depth maps using the same image encoder and a depth-to-language connector. The weights of this connector, initialized from the RGB connector and trained on spatial-related QAs, enable the 2D encoder to flexibly utilize depth information without requiring extensive training data.

An architecture overview of Spatial RGPT. ❄ 🔥 denotes freezed/trainable parameters.

SpatialRGPT_icon Open Spatial Dataset (OSD)

Having an effective training dataset is crucial. Our data pipeline generates 3D region-aware annotations from 2D images at scale by constructing a 3D scene graph for each image. This process involves three components: (i) open-vocabulary detection and segmentation (ii) metric depth estimation, and (iii) camera calibration. The scene graphs are then transformed into region-aware spatial QAs using both template-based and LLM-based approaches.

The figures shows our automatic data curation pipeline using from single images.

Our pipeline is completely automated and only needs RGB images. This enables us to utilize any available open-source data. We curate our dataset using OpenImages, resulting in 8.7M spatial concepts grounded in 5M unique regions from 1M images. Our results show that combining template-based QAs and LLM-based reasoning QAs helps develop a model capable of handling more complex spatial reasoning questions.

Samples from our Open Spatial Dataset.

SpatialRGPT_icon SpatialRGPT-Bench

Currently, no visual-language benchmarks focus on VLMs’ understanding of 3D spatial concepts like metric distance or size differences between objects. To fill this gap, we developed SpatialRGPT-Bench, a spatial reasoning VQA benchmark using data from urban (nuScenes, KITTI), indoor (SUNRGBD, ARKitScenes), and simulated (Hypersim) environments. We use preprocessed ground-truth 3D cuboids from Omni3D, positioned within a unified 3D camera coordinate system and categorized by object classes.

Our SpatialRGPT-Bench includes 657 qualitative and 749 quantitative VQA pairs, covering 88 distinct classes.
Samples from our SpatialRGPT-Bench.
SpatialRGPT-Bench Qualitative Results: Numbers represent success rates in percentage (↑).
Method Below/
Above
Left/
Right
Big/
Small
Tall/
Short
Wide/
Thin
Behind/
Front
Avg.
GPT-4 64.1 42.8 42.8 61.6 61.6 49.0 57.8
GPT-4V 63.3 46.6 64.1 60.7 68.2 45.4 58.1
LLaVA-v1.6-34B 44.1 45.7 36.7 53.5 37.5 45.4 43.9
GPT-4V+SoM 75.0 55.2 42.4 54.4 49.0 47.2 54.3
LLaVA-v1.6-34B+SoM 44.1 40.0 33.9 47.3 41.3 46.3 42.3
Kosmos-2 28.3 15.2 4.71 26.7 12.5 12.7 17.0
RegionVILA 30.8 47.6 35.8 44.6 35.5 49.0 40.4
SpatialRGPT 99.1 99.0 79.2 89.2 83.6 87.2 89.8
SpatialRGPT-Depth 99.1 99.0 80.1 91.9 87.5 91.8 91.7
SpatialRGPT-Bench Quantitative Results: Numbers represent success rates within ±25% of the ground-truth in percentage (↑) and absolute relative error in metric scale (↓).
Method Direct
Distance
Horizontal
Distance
Vertical
Distance
Width Height Direction
GPT-4 21.6 1.29 11.5 2.08 33.0 0.65 52.3 0.52 48.1 1.40 34.6 83.7°
GPT-4V 29.7 0.92 25.4 2.75 33.0 0.48 51.1 0.37 68.4 1.57 43.9 69.9°
LLaVA-v1.6-34B 24.3 0.76 24.5 1.59 30.1 0.62 30.8 0.40 42.8 1.96 33.6 78.2°
GPT-4V+SoM 25.7 1.02 22.1 2.36 33.9 0.64 45.8 0.70 62.4 1.08 54.2 55.5°
LLaVA-v1.6-34B+SoM 12.8 1.15 20.4 1.79 11.3 0.95 9.02 0.91 7.52 3.11 12.8 33.3°
Kosmos-2 4.05 >10 4.91 >10 1.89 2.26 3.01 5.42 1.50 3.82 1.86 104°
RegionVILA 22.3 1.30 24.6 3.26 17.9 >10 36.8 >10 49.6 1.61 35.5 79.8°
SpatialRGPT 35.1 0.35 59.0 0.27 53.8 0.27 51.9 0.31 54.9 0.63 95.3 17.1°
SpatialRGPT-Depth 41.2 0.33 65.6 0.25 51.9 0.27 49.6 0.31 57.9 0.61 95.3 15.4°

SpatialRGPT_icon Applications

Complex Spatial Reasoning

SpatialRGPT can function as a complex spatial reasoner on its own. Unlike SpatialVLM, which uses GPT-4 for reasoning tasks and employs VLM only for basic spatial queries, SpatialRGPT directly integrates these two capabilities. In the sample below, we show that SpatialRGPT is capable of complex spatial reasoning, addressing gaps that current leading vision language models, such as GPT-4V, struggle with.

Region-aware Dense Reward Annotator

Recent research has shown that VLMs can annotate rewards for robotics tasks using natural language. However, challenges arise due to language ambiguity. SpatialRGPT addresses this by allowing direct specification of regions of interest. We conducted a real robot experiment where SpatialRGPT use bounding boxes for the fingertip and a green cube to annotate rewards based on the distance between these regions. The results indicated a decreasing distance as the fingertip approached the cube, with the depth variant performing slightly better than the RGB variant. This demonstrates SpatialRGPT’s effectiveness as a precise and efficient region-aware dense reward annotator.

Citation


  @article{cheng2024spatialrgpt,
          title={SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models},
          author={Cheng, An-Chieh and Yin, Hongxu and Fu, Yang and Guo, Qiushan and Yang, Ruihan and Kautz, Jan and Wang, Xiaolong and Liu, Sifei},
          journal={arXiv preprint arXiv:2406.01584},
          year={2024}
  }
  

Acknowledgement

This website is adapted from Nerfies and GLaMM, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.