Learning Universal Policies via Text-Guided Video Generation
Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Artificial Intelligence goal is to construct an agent that can solve a variety of tasks Recent progress in text-guided image synthesis has yielded models with ability to generate complex novel images Investigating if such tools can be used to construct more general-purpose agents Sequential decision making problem cast as text-conditioned video generation problem Text-encoded specification of desired goal used to synthesize set of future frames Control actions extracted from generated video Leveraging text as underlying goal specification enables combinatorial generalization to novel goals Policy-as-video formulation can represent environments with different state and action spaces in unified space of images Leveraging pretrained language embeddings and widely available videos enables knowledge transfer Paper Content Introduction Building models that solve a diverse set of tasks is a dominant paradigm in vision and language Large pretrained models have demonstrated zero-shot learning of new language tasks Models have shown zero-shot classification and object recognition capabilities Training agents faces challenge of environmental diversity Universal tokens used to encode different environments Video used as universal interface for conveying action and observation behavior Text used as universal interface for expressing task descriptions Model enables combinatorial generalization, multi-task learning, action planning, and internet-scale knowledge transfer Problem formulation Introduces a new abstraction, the Unified Predictive Decision Process (UPDP), as an alternative to the Markov Decision Process (MDP) Presents an instantiation of UPDP with diffusion models Markov decision process Markov Decision Process (MDP) is a broad abstraction used to formulate many sequential decision making problems Many RL algorithms have been derived from MDPs with empirical successes Existing algorithms are typically unable to combinatorially generalize across different environments Lack of universal state interface across different control environments Explicit requirement of real-valued reward function in an MDP Dynamics model in an MDP is environment and agent dependent Unified Predictive Decision Process (UPDP) exploits images as a universal interface across environments, texts as task specifiers to avoid reward design, and a task-agnostic planning module UPDP bypasses reward design, state extraction and explicit planning, and allows for non-Markovian modeling of image-based state space UPDP isolates video-based planner from deferred action selection UPDP leverages existing large text-video models that have been pretrained on massive, web-scale datasets UPDP uses a continuous-time diffusion model to define a forward process and a generative process to reverse the forward process Decision making with videos Proposed approach UniPi is an instantiation of the diffusion UPDP UniPi incorporates two main components: a diffusion model and a task-specific action generator Universal video-based planner Text-to-video models have been successful We want to construct a video diffusion module as a trajectory planner This is more challenging than typical text-to-video models We use a constrained video synthesis model We use tiling to ensure environment consistency We use hierarchical planning We use flexible behavioral modulation Task specific action adaptation Train a small model to estimate actions given input images Generate an action sequence given x 0 and c by synthesizing H image frames and applying the learned inverse-dynamics model Inferred actions can be executed via closed-loop or open-loop control Use open-loop control for computational efficiency Experimental evaluation Combinatorial policy synthesis Measured ability of UniPi to generalize to different language tasks Used combinatorial robot planning tasks Robot must manipulate blocks to satisfy language instructions Split language instructions into two sets, one seen during training and one seen during testing Compared UniPi to three separate representative approaches Measured final task completion accuracy UniPi generalizes well to seen and novel combinations of language prompts Ablated UniPi on seen language instructions and in-relation-to tasks All components of UniPi are crucial for good performance Assessed ability of UniPi to adapt at test time to new constraints Multi-environment transfer Evaluated ability of UniPi to learn across different tasks and generalize to unseen environments Used language guided manipulation tasks from Shridhar et al....