Ego-Vision World Model for Humanoid Contact Planning
UC Berkeley, UM Ann Arbor, CUHK
1 University of California, Berkeley
2 University of Michigan, Ann Arbor
3 Chinese University of Hong Kong
Video
🔊 Sound on (recommended)
Abstract
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved data efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images.
Highlights
Our world model and sampling-based MPC enables real-time visual contact planning for diverse object interactions in real-world scenarios, with only ego-centric depth camera and proprioception.
Methods
Multi-Task
Multi-task performance and latent space visualization. (a) A joint model matches single-task performance. (b-c) t-SNE shows clear task separation: latent h_t
captures evolving dynamics, while latent z_t
encodes compact observations.

Acknowledgments
We would like to thank Jiaze Cai and Yen-Jen Wang for their help in experiments. We are also grateful to Bike Zhang, Fangchen Liu, Chaoyi Pan, Junfeng Long, and Yiyang Shao for their valuable discussions.
This project website is built with Next.js, adapted from the AIRIO website, and incorporates trajectory visualization methods inspired by DIAL-MPC.