SpatialRGPT: Grounded Spatial Reasoning in
Vision Language Models

NeurIPS 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang,

Jan Kautz, Xiaolong Wang, Sifei Liu

UC San Diego, NVIDIA

pdf arxiv poster Gradio Video Code Dataset Benchmark

SpatialRGPT is a powerful vision-language model adept at understanding both 2D and 3D spatial arrangements. It can process any region proposal, such as boxes or masks, and provide answers to complex spatial reasoning questions.

Project Video

Integrating Relative Depth into VLMs

SpatialRGPT use relative depth maps alongside RGB images to enhance geometric reasoning. However, incorporating depth information is challenging because VLM visual encoders are typically only trained on text and 2D images. Simply combining RGB and depth features can harm performance. To address this, we introduce an add-on module that processes depth maps using the same image encoder and a depth-to-language connector. The weights of this connector, initialized from the RGB connector and trained on spatial-related QAs, enable the 2D encoder to flexibly utilize depth information without requiring extensive training data.

An architecture overview of Spatial RGPT. ❄ 🔥 denotes freezed/trainable parameters.

Open Spatial Dataset (OSD)

Having an effective training dataset is crucial. Our data pipeline generates 3D region-aware annotations from 2D images at scale by constructing a 3D scene graph for each image. This process involves three components: (i) open-vocabulary detection and segmentation (ii) metric depth estimation, and (iii) camera calibration. The scene graphs are then transformed into region-aware spatial QAs using both template-based and LLM-based approaches.

The figures shows our automatic data curation pipeline using from single images.

Our pipeline is completely automated and only needs RGB images. This enables us to utilize any available open-source data. We curate our dataset using OpenImages, resulting in 8.7M spatial concepts grounded in 5M unique regions from 1M images. Our results show that combining template-based QAs and LLM-based reasoning QAs helps develop a model capable of handling more complex spatial reasoning questions.

SpatialRGPT-Bench

Currently, no visual-language benchmarks focus on VLMs’ understanding of 3D spatial concepts like metric distance or size differences between objects. To fill this gap, we developed SpatialRGPT-Bench, a spatial reasoning VQA benchmark using data from urban (nuScenes, KITTI), indoor (SUNRGBD, ARKitScenes), and simulated (Hypersim) environments. We use preprocessed ground-truth 3D cuboids from Omni3D, positioned within a unified 3D camera coordinate system and categorized by object classes.

**SpatialRGPT-Bench Qualitative Results**: Numbers represent success rates in percentage (↑).
Method	Below/ Above	Left/ Right	Big/ Small	Tall/ Short	Wide/ Thin	Behind/ Front	Avg.
GPT-4	64.1	42.8	42.8	61.6	61.6	49.0	57.8
GPT-4V	63.3	46.6	64.1	60.7	68.2	45.4	58.1
LLaVA-v1.6-34B	44.1	45.7	36.7	53.5	37.5	45.4	43.9
GPT-4V+SoM	75.0	55.2	42.4	54.4	49.0	47.2	54.3
LLaVA-v1.6-34B+SoM	44.1	40.0	33.9	47.3	41.3	46.3	42.3
Kosmos-2	28.3	15.2	4.71	26.7	12.5	12.7	17.0
RegionVILA	30.8	47.6	35.8	44.6	35.5	49.0	40.4
SpatialRGPT	99.1	99.0	79.2	89.2	83.6	87.2	89.8
SpatialRGPT-Depth	99.1	99.0	80.1	91.9	87.5	91.8	91.7

**SpatialRGPT-Bench Quantitative Results**: Numbers represent success rates within ±25% of the ground-truth in percentage (↑) and absolute relative error in metric scale (↓).
Method	Direct Distance		Horizontal Distance		Vertical Distance		Width		Height		Direction
Method	GPT-4	21.6	1.29	11.5	2.08	33.0	0.65	52.3	0.52	48.1	1.40	34.6	83.7°
GPT-4V	29.7	0.92	25.4	2.75	33.0	0.48	51.1	0.37	68.4	1.57	43.9	69.9°
LLaVA-v1.6-34B	24.3	0.76	24.5	1.59	30.1	0.62	30.8	0.40	42.8	1.96	33.6	78.2°
GPT-4V+SoM	25.7	1.02	22.1	2.36	33.9	0.64	45.8	0.70	62.4	1.08	54.2	55.5°
LLaVA-v1.6-34B+SoM	12.8	1.15	20.4	1.79	11.3	0.95	9.02	0.91	7.52	3.11	12.8	33.3°
Kosmos-2	4.05	>10	4.91	>10	1.89	2.26	3.01	5.42	1.50	3.82	1.86	104°
RegionVILA	22.3	1.30	24.6	3.26	17.9	>10	36.8	>10	49.6	1.61	35.5	79.8°
SpatialRGPT	35.1	0.35	59.0	0.27	53.8	0.27	51.9	0.31	54.9	0.63	95.3	17.1°
SpatialRGPT-Depth	41.2	0.33	65.6	0.25	51.9	0.27	49.6	0.31	57.9	0.61	95.3	15.4°

Applications

Complex Spatial Reasoning

SpatialRGPT can function as a complex spatial reasoner on its own. Unlike SpatialVLM, which uses GPT-4 for reasoning tasks and employs VLM only for basic spatial queries, SpatialRGPT directly integrates these two capabilities. In the sample below, we show that SpatialRGPT is capable of complex spatial reasoning, addressing gaps that current leading vision language models, such as GPT-4V, struggle with.

Region-aware Dense Reward Annotator

Recent research has shown that VLMs can annotate rewards for robotics tasks using natural language. However, challenges arise due to language ambiguity. SpatialRGPT addresses this by allowing direct specification of regions of interest. We conducted a real robot experiment where SpatialRGPT use bounding boxes for the fingertip and a green cube to annotate rewards based on the distance between these regions. The results indicated a decreasing distance as the fingertip approached the cube, with the depth variant performing slightly better than the RGB variant. This demonstrates SpatialRGPT’s effectiveness as a precise and efficient region-aware dense reward annotator.

Citation


  @inproceedings{cheng2024spatialrgpt,
          title={SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models},
          author={Cheng, An-Chieh and Yin, Hongxu and Fu, Yang and Guo, Qiushan and Yang, Ruihan and Kautz, Jan and Wang, Xiaolong and Liu, Sifei},
          booktitle={NeurIPS},
          year={2024}
  }

Acknowledgement

This website is adapted from Nerfies and GLaMM, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models