This&That: Language-Gesture Controlled Video Generation for Robot Planning

Paper Code VDM Code


Abstract

We propose a robot learning method for communicating, planning, and executing a wide range of tasks, dubbed This&That. We achieve robot planning for general tasks by leveraging the power of video generative models trained on internet-scale data containing rich physical and semantic context. In this work, we tackle three fundamental challenges in video-based planning: 1) unambiguous task communication with simple human instructions, 2) controllable video generation that respects user intents, and 3) translating visual planning into robot actions. We propose language-gesture conditioning to generate videos, which is both simpler and clearer than existing language-only methods, especially in complex and uncertain environments. We then suggest a behavioral cloning design that seamlessly incorporates the video plans. This&That demonstrates state-of-the-art effectiveness in addressing the above three challenges, and justifies the use of video generation as an intermediate representation for generalizable task planning and execution.


This & That Video Generation Demo

First frame
Gesture
Our Video Generation
dog
dog
Put this inside that
dog
dog
Close this
dog
dog
Put this inside that
dog
dog
Put this near there

Comparison vs. Previous Language-Conditioned Method

Condition
AVDC (Language-Only)
Our Video Generation
dog
Put carrot in pot or pan
Put this to there
dog
Put the yellow cube on top of the blue cube
Put this to there
dog
Close the drawer
Close this to there
dog
Fold the cloth from the bottom to top
Fold this to there
dog
Put the ball to the cup
Put this to there

Simulation Rollout Comparison

Ground Truth
Language-Only
Language-Gesture (Ours)
Stack right green cube on top of left green cube
Stack this to there
Move cyan cylinder to the right of left gray cube
Move this to there
Stack rightmost red cube on top of second leftmost red cube
Stack this to there
Move leftmost cyan cylinder behind second rightmost cyan cylinder
Move this to there

Limitation of Gesture-Only Conditioning

Condition
Gesture-Only
Language-Gesture (Ours)
dog
Fold this to there

Limitation of Language-Only Conditioning

Condition
Language-Only
Language-Gesture (Ours)
dog
Take the blue rectangular box and put in the top left of the table
Take this to there