Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
3D Aware Region Prompted Vision Language Model An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov,
Hongxu Yin✝︎, Xiaolong Wang✝︎, Sifei Liu✝︎ICLR, 2026 Region-level spatial reasoning for both single-view and multi-view inputs.
We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D
data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions
with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame
labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to
draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not
co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial
benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D
and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without
sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric
measurements.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
Ruihan Yang*, Qinxi Yu*, Yecheng Wu, Rui Yan, Borui Li, An-Chieh Cheng, Xueyan Zou, Yunhao Fang, Xuxin Cheng, Ri-Zhao Qiu, Hongxu Yin, Sifei Liu, Song Han, Yao Lu, Xiaolong Wang
Preprint, 2025 Robust dexterous manipulation generalist model utilizing diverse egocentric human manipulation videos.
Real robot data collection for imitation learning has led to significant advancements in robotic manipulation. However, the requirement for robot hardware in the process fundamentally constrains the scale of the data. In this paper, we explore training Vision-Language-Action (VLA) models using egocentric human videos. The benefit of using human videos is not only for their scale but more importantly for the richness of scenes and tasks. With a VLA trained on human video that predicts human wrist and hand actions, we can perform Inverse Kinematics and retargeting to convert the human actions to robot actions. We fine-tune the model using a few robot manipulation demonstrations to obtain the robot policy, namely EgoVLA. We propose a simulation benchmark called Ego Humanoid Manipulation Benchmark, where we design diverse bimanual manipulation tasks with demonstrations. We fine-tune and evaluate EgoVLA with Ego Humanoid Manipulation Benchmark and show significant improvements over baselines and ablate the importance of human data.
NaVILA: Legged Robot Vision-Language-Action Model for Navigation An-Chieh Cheng*, Yandong Ji*, Zhaojing Yang*, Zaitian Gongye, Xueyan Zou, Jan Kautz, Erdem
Bıyık,
Hongxu
Yin✝︎,
Sifei Liu✝︎, Xiaolong Wang✝︎RSS, 2025 A two-level framework that combines VLAs with locomotion skills for navigation. The
VLA is adapted
from a VLM and learns from human touring videos.
This paper proposes to solve the problem
of
Vision-and-Language Navigation with legged robots, which not only provides a
flexible way for humans to command but also allows the robot to navigate through more
challenging and
cluttered scenes.
However, it is non-trivial to translate human language instructions all the way to low-level leg
joint
actions. We
propose NaVILA, a 2-level framework that unifies a Vision-Language-Action model (VLA) with
locomotion skills.
Instead of
directly predicting low-level actions from VLA, NaVILA first generates mid-level actions with
spatial
information in the
form of language, (e.g., "moving forward 75cm"), which serves as an input for a visual
locomotion RL policy
for
execution. NaVILA substantially improves previous approaches on existing benchmarks. The same
advantages are
demonstrated in our newly developed benchmarks with IsaacLab, featuring more realistic scenes,
low-level
controls, and
real-world robot experiments.
NVILA: Efficient Frontier Visual Language Models Zhijian Liu et al.
Zhijian Liu*, Ligeng Zhu*, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi
Cao, Yuxian
Gu,
Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng
,
Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov,
Jan Kautz,
Hongxu Yin✝︎, Song Han✝︎, Yao
Lu✝︎ CVPR, 2025 Efficient frontier VLM models with efficient training and inference.
Visual language models (VLMs) have made significant
advances in accuracy in recent years. However, their efficiency has
received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both
efficiency
and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and
temporal
resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to
efficiently
process high-resolution images and long videos. We also conduct a systematic investigation to enhance the
efficiency of
NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses
the
accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the
same time,
it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and
decoding
latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.
Vision Language Models (VLMs) have
demonstrated remarkable performance in 2D vision and language tasks. However, their ability to
reason about spatial arrangements remains limited. In this work, we introduce Spatial Region GPT
(SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities. SpatialRGPT
advances VLMs' spatial understanding through two key innovations: (1) a data curation pipeline
that enables effective learning of regional representation from 3D scene graphs, and (2) a
flexible plugin module for integrating depth information into the visual encoder of existing
VLMs. During inference, when provided with user-specified region proposals, SpatialRGPT can
accurately perceive their relative directions and distances. Additionally, we propose
SpatialRGBT-Bench, a benchmark with ground-truth 3D annotations encompassing indoor, outdoor,
and simulated environments, for evaluating 3D spatial cognition in VLMs. Our results demonstrate
that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and
without local region prompts. The model also exhibits strong generalization capabilities,
effectively reasoning about complex spatial relations and functioning as a region-aware dense
reward annotator for robotic tasks.
Textures are a vital aspect of creating
visually
appealing and realistic 3D models. In this paper, we study the problem of generating
high-fidelity
texture given shapes of 3D assets, which has been relatively less explored compared with generic
3D
shape modeling. Our goal is to facilitate a controllable texture generation process, such that
one
texture code can correspond to a particular appearance style independent of any input shapes
from a
category. We introduce Texture UV Radiance Fields (TUVF) that generate textures in a learnable
UV sphere
space rather than directly on the 3D shape. This allows the texture to be disentangled from the
underlying shape and transferable to other shapes that share the same UV space, i.e., from the
same
category. We integrate the UV sphere space with the radiance field, which provides a more
efficient and
accurate representation of textures than traditional texture maps. We perform our experiments on
real-world object datasets where we achieve not only realistic synthesis, but also substantial
improvements over state-of-the-arts on texture controlling and editing.
Autoregressive 3D Shape Generation via Canonical Mapping An-Chieh Cheng*, Xueting Li*, Sifei Liu, Min Sun, Ming-Hsuan Yang
ECCV, 2022 We decompose the point cloud into
meaningful shape sequences, then we encode these sequences through a transformer for
generation.
With the capacity of modeling long-range
dependencies in sequential data, transformers have shown remarkable performances in a variety of
generative tasks such as image, audio, and text generation. Yet, taming them in generating less
structured and voluminous data formats such as high-resolution point clouds have seldom been
explored
due to ambiguous sequentialization processes and infeasible computation burden. In this paper,
we aim to
further exploit the power of transformers and employ them for the task of 3D point cloud
generation. The
key idea is to decompose point clouds of one category into semantically aligned sequences of
shape
compositions, via a learned canonical space. These shape compositions can then be quantized and
used to
learn a context-rich composition codebook for point cloud generation. Experimental results on
point
cloud reconstruction and unconditional generation show that our model performs favorably against
state-of-the-art approaches. Furthermore, our model can be easily extended to multi-modal shape
completion as an application for conditional shape generation.
We propose a canonical point
autoencoder
(CPAE) that predicts dense correspondences between 3D shapes of the same category. The
autoencoder
performs two key functions: (a) encoding an arbitrarily ordered point cloud to a canonical
primitive,
e.g., a sphere, and (b) decoding the primitive back to the original input instance shape. As
being
placed in the bottleneck, this primitive plays a key role to map all the unordered point clouds
on the
canonical surface and to be reconstructed in an ordered fashion. Once trained, points from
different
shape instances that are mapped to the same locations on the primitive surface are determined to
be a
pair of correspondence. Our method does not require any form of annotation or self-supervised
part
segmentation network and can handle unaligned input point clouds. Experimental results on 3D
semantic
keypoint transfer and part segmentation transfer show that our model performs favorably against
state-of-the-art correspondence learning methods.