top of page

I lead Machine Learning infrastructure, data and delivery efforts in Project Starline at Google.

Before that, I founded the simulation team for Everyday Robots Project at X (formerly Google [x]). I built and managed cross geolocation teams with software engineers, researchers, and technical artists. I led the team to collaborate with Google Brain and DeepMind on a dozen research projects, including Sim2Real, and PaLM-SayCan.


I received my Ph.D. degree in Computer Science from Georgia Institute of Technology in 2015, under the advice of Dr. C. Karen Liu. My thesis focuses on designing algorithms for synthesizing human motion of object manipulation. I was a member of Computer Graphics Lab in Georgia Tech.


I received my B.E. degree from Tsinghua University, in 2010.


The success of deep reinforcement learning (RL) and imitation learning (IL) in vision-based robotic manipulation typically hinges on the expense of large scale data collection. We introduce RetinaGAN, a generative adversarial network (GAN) approach to adapt simulated images to realistic ones with object-detection consistency. We show our method bridges the visual gap for three real world robot tasks: grasping, pushing, and door opening.

General contact-rich manipulation problems are long-standing challenges in robotics due to the difficulty of understanding complicated contact physics. We propose Contact-aware Online COntext Inference (COCOI), a deep RL method that encodes a context embedding of dynamics properties online using contact-rich interactions

We introduce Meta Strategy Optimization, a meta-learning algorithm for training policies with latent variable inputs that can quickly adapt to new scenarios with a handful of trials in the target environment. We evaluate our method on a real quadruped robot and demonstrate successful adaptation in various scenarios, including sim-to-real transfer.

Learning a complex vision-based task requires an impractical number of demonstrations. We propose a method that can learn to learn from both demonstrations and trial-and-error experience with sparse reward feedback.

We propose a self-supervised approach for learning representations of objects from monocular videos and demonstrate it is particularly useful in situated settings such as robotics. 

Training a deep network policy for robot manipulation is notoriously costly and time consuming as it depends on collecting a significant amount of real world data. we propose a method that learns to perform table-top instance grasping of a wide variety of objects while using no real world grasping data by using learned 3D point cloud of object as input.

bottom of page