Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Large language models are good at complex tasks Enabling general inference in the real world is challenging Embodied language models incorporate real-world continuous sensor modalities Input to the model is multi-modal sentences Model is trained end-to-end for multiple embodied tasks Evaluations show model can address a variety of embodied reasoning tasks Model benefits from diverse joint training Largest model is a visual-language generalist with state-of-the-art performance Paper Content Introduction LLMs demonstrate strong reasoning capabilities across various domains Limitation of LLMs is the issue of grounding Previous work interfaces LLMs with learned robotic policies and affordance functions LLMs are only provided with textual input, which is insufficient for many tasks Current state-of-the-art visuallanguage models cannot directly solve robotic reasoning tasks Proposed embodied language models incorporate continuous inputs from sensor modalities Embodied language models are trained end-to-end to output sequential decisions Evaluated in a variety of settings Multi-task training improves performance Transfer across tasks leads to high data-efficiency for robotics tasks PaLM-E-562B achieves state-of-the-art performance on the OK-VQA benchmark PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought reasoning Related work Recent years have seen a growing interest in large vision-language models (VLMs) VLMs can be applied to tasks such as visual question answering, captioning, optical character recognition, and object detection Different methods exist for integrating images into VLMs Prior works focus on combining vision and language inputs in an embodied setting with the goal of direct action prediction LLMs contain vast amounts of internalized knowledge about the world LLMs have been employed in embodied domains to understand natural language goals and to represent planning PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding PaLM-E investigates a generalist, multi-embodiment model, across multiple modalities Palm-e: an embodied multimodal language model PaLM-E injects continuous observations into the language embedding space of a pre-trained language model PaLM-E is a decoder-only LLM that generates textual completions autoregressively Inputs to PaLM-E consist of text and continuous observations Output of PaLM-E is text generated auto-regressively Decoder-only LLMs predict the probability of a piece of text Prefix-decoder-only LLMs can be conditioned on a prefix Tokens are embedded into a word token embedding space Continuous observations are injected into the LLM by mapping them into the language embedding space PaLM-E can be used to generate text or decisions/plans Low-level policies translate decisions into low-level actions PaLM-E is integrated into a control-loop to sequence and control low-level policies Input & scene representations for different sensor modalities Incorporate different modalities into PaLM-E Set up encoders to map modalities into language embedding space Investigate state estimation vectors, Vision Transformers (ViTs) and Object Scene Representation Transformer (OSRT) Consider object-centric representations to factor observations into tokens Investigate ViT token learner architecture Label multi-modal tokens in input prompt to enable PaLM-E to reference objects Training recipes PaLM-E is a decoder-only model that uses multi-modal sentences and text tokens to make predictions....