My research focuses on 3D Vision. I’m interested in how we can enable machines to better understand
the physical world through self-supervised methods.
NaVILA: Legged Robot Vision-Language-Action Model for Navigation An-Chieh Cheng*, Yandong Ji*, Zhaojing Yang*, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin♠,
Sifei Liu♠, Xiaolong Wang♠ preprint, 2024 A two-level framework that combines VLAs with locomotion skills for navigation. The VLA is adapted
from a VLM and learns from human touring videos.
This paper proposes to solve the problem of
Vision-and-Language Navigation with legged robots, which not only provides a
flexible way for humans to command but also allows the robot to navigate through more challenging and
cluttered scenes.
However, it is non-trivial to translate human language instructions all the way to low-level leg joint
actions. We
propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills.
Instead of
directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial
information in the
form of language, (e.g., “moving forward 75cm”), which serves as an input for a visual locomotion RL policy
for
execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are
demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level
controls, and
real-world robot experiments.
@article{cheng2024navila,
title = {NaVILA: Legged Robot Vision-Language-Action Model for Navigation},
author = {Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Zou, Xueyan
and Kautz, Jan and Biyik, Erdem and Yin, Hongxu and Liu, Sifei and Wang, Xiaolong},
journal = {arXiv preprint arXiv:2412.04453},
year = {2024}
}
Vision Language Models (VLMs) have
demonstrated remarkable performance in 2D vision and language tasks. However, their ability to
reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT
(SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT
advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline
that enables effective learning of regional representation from 3D scene graphs, and (2) a
flexible plugin module for integrating depth information into the visual encoder of existing
VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can
accurately perceive their relative directions and distances. Additionally, we propose
SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor,
and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate
that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and
without local region prompts. The model also exhibits strong generalization capabilities,
effectively reasoning about complex spatial relations and functioning as a region-aware dense
reward annotator for robotic tasks.
@inproceedings{cheng2024spatialrgpt,
title = {SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models},
author = {Cheng, An-Chieh and Yin, Hongxu and Fu, Yang and Guo, Qiushan and Yang, Ruihan and Kautz, Jan and Wang, Xiaolong and Liu, Sifei},
booktitle = {NeurIPS}
year = {2024}
}
Textures are a vital aspect of creating
visually
appealing and realistic 3D models. In this paper, we study the problem of generating
high-fidelity
texture given shapes of 3D assets, which has been relatively less explored compared with generic
3D
shape modeling. Our goal is to facilitate a controllable texture generation process, such that
one
texture code can correspond to a particular appearance style independent of any input shapes
from a
category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable
UV sphere
space rather than directly on the 3D shape. This allows the texture to be disentangled from the
underlying shape and transferable to other shapes that share the same UV space, i.e., from the
same
category. We integrate the UV sphere space with the radiance field, which provides a more
efficient and
accurate representation of textures than traditional texture maps. We perform our experiments on
real-world object datasets where we achieve not only realistic synthesis, but also substantial
improvements over state-of-the-arts on texture controlling and editing.
@inproceedings{cheng2024tuvf,
title = {TUVF: Learning Generalizable Texture UV Radiance Fields},
author = {Cheng, An-Chieh and Li, Xueting and Liu, Sifei and Wang, Xiaolong},
booktitle = {International Conference on Learning Representations}
year = {2024}
}
Autoregressive 3D Shape Generation via Canonical Mapping An-Chieh Cheng*, Xueting Li*, Sifei Liu, Min Sun, Ming-Hsuan Yang
ECCV, 2022 We decompose the point cloud into
meaningful shape sequences, then we encode these sequences through a transformer for
generation.
With the capacity of modeling long-range
dependencies in sequential data, transformers have shown remarkable performances in a variety of
generative tasks such as image, audio, and text generation. Yet, taming them in generating less
structured and voluminous data formats such as high-resolution point clouds have seldom been
explored
due to ambiguous sequentialization processes and infeasible computation burden. In this paper,
we aim to
further exploit the power of transformers and employ them for the task of 3D point cloud
generation. The
key idea is to decompose point clouds of one category into semantically aligned sequences of
shape
compositions, via a learned canonical space. These shape compositions can then be quantized and
used to
learn a context-rich composition codebook for point cloud generation. Experimental results on
point
cloud reconstruction and unconditional generation show that our model performs favorably against
state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape
completion as an application for conditional shape generation.
@inproceedings{cheng2022autoregressive,
title={Autoregressive 3d shape generation via canonical mapping},
author={Cheng, An-Chieh and Li, Xueting and Liu, Sifei and Sun, Min and Yang, Ming-Hsuan},
booktitle={European Conference on Computer Vision},
year={2022},
}
We propose a canonical point
autoencoder
(CPAE) that predicts dense correspondences between 3D shapes of the same category. The
autoencoder
performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical
primitive,
e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As
being
placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds
on the
canonical surface and to be reconstructed in an ordered fashion. Once trained, points from
different
shape instances that are mapped to the same locations on the primitive surface are determined to
be a
pair of correspondence. Our method does not require any form of annotation or self-supervised
part
segmentation network and can handle unaligned input point clouds. Experimental results on 3D
semantic
keypoint transfer and part segmentation transfer show that our model performs favorably against
state-of-the-art correspondence learning methods.
@inproceedings{cheng2021learning,
title = {Learning 3D Dense Correspondence via Canonical Point Autoencoder},
author = {Cheng, An-Chieh and Li, Xueting and Sun, Min and Yang, Ming-Hsuan and Liu, Sifei},
booktitle = {Advances in Neural Information Processing Systems},
year = {2021}
}