My research focuses on 3D Vision. I’m interested in how we can enable machines to better understand
the physical world through self-supervised methods.
Vision Language Models (VLMs) have
demonstrated remarkable performance in 2D vision and language tasks. However, their ability to
reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT
(SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT
advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline
that enables effective learning of regional representation from 3D scene graphs, and (2) a
flexible plugin module for integrating depth information into the visual encoder of existing
VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can
accurately perceive their relative directions and distances. Additionally, we propose
SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor,
and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate
that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and
without local region prompts. The model also exhibits strong generalization capabilities,
effectively reasoning about complex spatial relations and functioning as a region-aware dense
reward annotator for robotic tasks.
@inproceedings{cheng2024spatialrgpt,
title = {SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models},
author = {Cheng, An-Chieh and Yin, Hongxu and Fu, Yang and Guo, Qiushan and Yang, Ruihan and Kautz, Jan and Wang, Xiaolong and Liu, Sifei},
booktitle = {arXiv preprint arXiv:2406.01584}
year = {2024}
}
Textures are a vital aspect of creating
visually
appealing and realistic 3D models. In this paper, we study the problem of generating
high-fidelity
texture given shapes of 3D assets, which has been relatively less explored compared with generic
3D
shape modeling. Our goal is to facilitate a controllable texture generation process, such that
one
texture code can correspond to a particular appearance style independent of any input shapes
from a
category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable
UV sphere
space rather than directly on the 3D shape. This allows the texture to be disentangled from the
underlying shape and transferable to other shapes that share the same UV space, i.e., from the
same
category. We integrate the UV sphere space with the radiance field, which provides a more
efficient and
accurate representation of textures than traditional texture maps. We perform our experiments on
real-world object datasets where we achieve not only realistic synthesis, but also substantial
improvements over state-of-the-arts on texture controlling and editing.
@inproceedings{cheng2024tuvf,
title = {TUVF: Learning Generalizable Texture UV Radiance Fields},
author = {Cheng, An-Chieh and Li, Xueting and Liu, Sifei and Wang, Xiaolong},
booktitle = {International Conference on Learning Representations}
year = {2024}
}
Autoregressive 3D Shape Generation via Canonical Mapping An-Chieh Cheng*, Xueting Li*, Sifei Liu, Min Sun, Ming-Hsuan Yang
ECCV, 2022 We decompose the point cloud into
meaningful shape sequences, then we encode these sequences through a transformer for
generation.
With the capacity of modeling long-range
dependencies in sequential data, transformers have shown remarkable performances in a variety of
generative tasks such as image, audio, and text generation. Yet, taming them in generating less
structured and voluminous data formats such as high-resolution point clouds have seldom been
explored
due to ambiguous sequentialization processes and infeasible computation burden. In this paper,
we aim to
further exploit the power of transformers and employ them for the task of 3D point cloud
generation. The
key idea is to decompose point clouds of one category into semantically aligned sequences of
shape
compositions, via a learned canonical space. These shape compositions can then be quantized and
used to
learn a context-rich composition codebook for point cloud generation. Experimental results on
point
cloud reconstruction and unconditional generation show that our model performs favorably against
state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape
completion as an application for conditional shape generation.
@inproceedings{cheng2022autoregressive,
title = {Learning 3D Dense Correspondence via Canonical Point Autoencoder},
author = {Cheng, An-Chieh and Li, Xueting and Liu, Sifei and Sun, Min and Yang, Ming-Hsuan},
booktitle = {ECCV}
year = {2022}
}
We propose a canonical point
autoencoder
(CPAE) that predicts dense correspondences between 3D shapes of the same category. The
autoencoder
performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical
primitive,
e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As
being
placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds
on the
canonical surface and to be reconstructed in an ordered fashion. Once trained, points from
different
shape instances that are mapped to the same locations on the primitive surface are determined to
be a
pair of correspondence. Our method does not require any form of annotation or self-supervised
part
segmentation network and can handle unaligned input point clouds. Experimental results on 3D
semantic
keypoint transfer and part segmentation transfer show that our model performs favorably against
state-of-the-art correspondence learning methods.
@inproceedings{cheng2021learning,
title = {Learning 3D Dense Correspondence via Canonical Point Autoencoder},
author = {Cheng, An-Chieh and Li, Xueting and Sun, Min and Yang, Ming-Hsuan and Liu, Sifei},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021}
}