An-Chieh Cheng

An-Chieh Cheng 鄭安傑
a8cheng at ucsd dot edu

I'm a PhD student at University of California San Diego, advised by Prof. Xiaolong Wang. I received my Master's and Bachelor's degrees in computer science from National Tsing Hua University. Recently, I'm honored with the Qualcomm Innovation Fellowship.

I'm interested in building multimodal foundation models capable of general spatial understanding and actionable intelligence.

Google Scholar / Curriculum Vitæ / Github / LinkedIn / Twitter

UC San Diego
PhD
Sep. '22 - Present

NVIDIA
Research Intern
2025 Summer

Adobe
Research Intern
2023 Summer

UC Merced
Remote Visiting
Jul. '20 - Mar. '22

National Tsing Hua University
M.Sc./B.S. in CS

News

Jul 2025: We’ve open-sourced the NaVILA framework. Including the VLA, locomotion policy, and the navigation benchmark.
Mar 2025: SpatialRGPT was demoed at GTC 2025 as a part of Agentic AI for Physical Operations!

Selected Publications [Full List]

NaVILA: Legged Robot Vision-Language-Action Model for Navigation
An-Chieh Cheng*, Yandong Ji*, Zhaojing Yang*, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem Bıyık, Hongxu Yin^✝︎, Sifei Liu^✝︎, Xiaolong Wang^✝︎
RSS, 2025
A two-level framework that combines VLAs with locomotion skills for navigation. The VLA is adapted from a VLM and learns from human touring videos.

pdf | website | video | code | abstract

This paper proposes to solve the problem of Vision-and-Language Navigation with legged robots, which not only provides a flexible way for humans to command but also allows the robot to navigate through more challenging and cluttered scenes. However, it is non-trivial to translate human language instructions all the way to low-level leg joint actions. We propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with locomotion skills. Instead of directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with spatial information in the form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual locomotion RL policy for execution. NaVILA substantially improves previous approaches on existing benchmarks. The same advantages are demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes, low-level controls, and real-world robot experiments.

NVILA: Efficient Frontier Visual Language Models
NVILA Team
CVPR, 2025
Efficient frontier VLM models with efficient training and inference.

pdf | website | demo | code | abstract

Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, Sifei Liu
NeurIPS, 2024
A powerful region-level VLM adept at 3D spatial reasoning.
✨ Demoed at GTC 2025 as a part of Agentic AI for Physical Operations!

pdf | website | video | code | abstract

Vision Language Models (VLMs) have demonstrated remarkable performance in 2D vision and language tasks. However, their ability to reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline that enables effective learning of regional representation from 3D scene graphs, and (2) a flexible plugin module for integrating depth information into the visual encoder of existing VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances. Additionally, we propose SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor, and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts. The model also exhibits strong generalization capabilities, effectively reasoning about complex spatial relations and functioning as a region-aware dense reward annotator for robotic tasks.

TUVF: Learning Generalizable Texture UV Radiance Fields
An-Chieh Cheng, Xueting Li, Sifei Liu^✝︎, Xiaolong Wang^✝︎
ICLR, 2024
Learning generalizable texture UV radiance fields for shapes.

pdf | website | video | code | abstract

Textures are a vital aspect of creating visually appealing and realistic 3D models. In this paper, we study the problem of generating high-fidelity texture given shapes of 3D assets, which has been relatively less explored compared with generic 3D shape modeling. Our goal is to facilitate a controllable texture generation process, such that one texture code can correspond to a particular appearance style independent of any input shapes from a category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable UV sphere space rather than directly on the 3D shape. This allows the texture to be disentangled from the underlying shape and transferable to other shapes that share the same UV space, i.e., from the same category. We integrate the UV sphere space with the radiance field, which provides a more efficient and accurate representation of textures than traditional texture maps. We perform our experiments on real-world object datasets where we achieve not only realistic synthesis, but also substantial improvements over state-of-the-arts on texture controlling and editing.

Autoregressive 3D Shape Generation via Canonical Mapping
An-Chieh Cheng*, Xueting Li*, Sifei Liu, Min Sun, Ming-Hsuan Yang
ECCV, 2022
We decompose the point cloud into meaningful shape sequences, then we encode these sequences through a transformer for generation.

pdf | code | abstract

With the capacity of modeling long-range dependencies in sequential data, transformers have shown remarkable performances in a variety of generative tasks such as image, audio, and text generation. Yet, taming them in generating less structured and voluminous data formats such as high-resolution point clouds have seldom been explored due to ambiguous sequentialization processes and infeasible computation burden. In this paper, we aim to further exploit the power of transformers and employ them for the task of 3D point cloud generation. The key idea is to decompose point clouds of one category into semantically aligned sequences of shape compositions, via a learned canonical space. These shape compositions can then be quantized and used to learn a context-rich composition codebook for point cloud generation. Experimental results on point cloud reconstruction and unconditional generation show that our model performs favorably against state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape completion as an application for conditional shape generation.

Learning 3D Dense Correspondence via Canonical Point Autoencoder
An-Chieh Cheng, Xueting Li, Min Sun, Ming-Hsuan Yang, Sifei Liu
NeurIPS, 2021
Unsupervised learning of dense 3D correspondence.

pdf | website | code | abstract

We propose a canonical point autoencoder (CPAE) that predicts dense correspondences between 3D shapes of the same category. The autoencoder performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical primitive, e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As being placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds on the canonical surface and to be reconstructed in an ordered fashion. Once trained, points from different shape instances that are mapped to the same locations on the primitive surface are determined to be a pair of correspondence. Our method does not require any form of annotation or self-supervised part segmentation network and can handle unaligned input point clouds. Experimental results on 3D semantic keypoint transfer and part segmentation transfer show that our model performs favorably against state-of-the-art correspondence learning methods.

Template from this awesome website.