arxiv-summary: AI-summarized AI papers

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Significant progress has been made in developing reinforcement learning training systems. Parallel environment execution is often the slowest part of the system but receives little attention. EnvPool improves the RL environment simulation speed across different hardware setups. EnvPool is compatible with existing RL training libraries. EnvPool allows researchers to iterate their ideas quickly....

Global Context Vision Transformers

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a novel architecture to enhance parameter and compute utilization for computer vision tasks Model uses global and local self-attention modules to model long and short-range spatial interactions Addresses lack of inductive bias and improves modeling of inter-channel dependencies Achieves new state-of-the-art performance across image classification, object detection and semantic segmentation tasks Paper Content Introduction Transformers have achieved SOTA performance in NLP benchmarks....

MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Autonomous agents have made progress in specialist domains Humans learn and adapt in the open world Three ingredients for building generalist agents: environment, knowledge base, and agent architecture MineDojo is a new framework built on Minecraft with thousands of tasks and an internet-scale knowledge base Novel agent learning algorithm uses pre-trained video-language models as reward function Agent is able to solve open-ended tasks without manually designed reward Paper Content Introduction Developing autonomous embodied agents to attain human-level performance is a long-standing goal for AI research Progress has been made in games and robotics Agents are typically trained tabula rasa in isolated worlds with limited complexity and diversity Humans inhabit an infinitely rich reality and can leverage large amounts of prior knowledge MINEDOJO is a framework to develop open-ended, generally capable agents MINEDOJO features a benchmarking suite with thousands of diverse open-ended tasks MINEDOJO also provides an internet-scale, multimodal knowledge base MINEDOJO’s simulator provides unified observation and action spaces MINEDOJO’s tasks are divided into programmatic and creative tasks Programmatic tasks can be automatically assessed Creative tasks do not have well-defined success criteria MINEDOJO uses a learned model to evaluate creative tasks Task suite i: programmatic tasks Formalize each programmatic task as a 5-tuple Leverage OpenAI’s GPT-3-davinci API to generate detailed guidance Initial conditions of the agent and the world Success criterion is a deterministic function Optional dense reward function 4 categories of programmatic tasks with 1,581 template-generated natural language goals Task suite ii: creative tasks Creative tasks are defined as a 3-tuple A novel task evaluation metric is designed based on a pre-trained contrastive video-language model Human evaluations show high agreement with the learned metric 216 Creative tasks are manually authored 1,560 Creative tasks are generated through two systematic approaches Approach 1 mines tasks from YouTube tutorial videos Approach 2 uses GPT-3 to generate new task ideas Internet-scale knowledge base Two approaches to train embodied agents include RL with reward functions or human-demonstrations Crafting reward functions is challenging for the task suite Turn to the open web as an ever-growing source of learning material Harvest domain knowledge by web scraping and filtering Collect 33 years of YouTube videos, 6K+ Wiki pages, and millions of Reddit comment threads Language is a key component of the database Take special measures to filter out low-quality and toxic contents 730K+ narrated Minecraft videos, 2....

VPIT: Real-time Embedded Single Object 3D Tracking Using Voxel Pseudo Images

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a novel voxel-based 3D single object tracking (3D SOT) method called Voxel Pseudo Image Tracking (VPIT). VPIT uses voxel pseudo images as an input to a 2D-like Siamese SOT method. VPIT uses Bird’s-eye View (BEV) coordinates, so only object rotation can change in the new coordinate system. VPIT is the fastest 3D SOT method and maintains competitive Success and Precision values....

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Transformers are slow and memory-hungry on long sequences. Approximate attention methods have attempted to reduce compute complexity, but often do not achieve wall-clock speedup. FlashAttention is an IO-aware exact attention algorithm that uses tiling to reduce memory reads/writes. FlashAttention trains Transformers faster than existing baselines. FlashAttention enables longer context in Transformers, yielding higher quality models....

GIT: A Generative Image-to-text Transformer for Vision and Language

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Designed and trained a Generative Image-to-text Transformer (GIT) to unify vision-language tasks Simplified architecture with one image encoder and one text decoder Scaled up pre-training data and model size to boost performance Established new state of the arts on 12 challenging benchmarks Surpassed human performance on TextCaps Presented a new scheme of generation-based image classification and scene text recognition Paper Content Introduction Pre-training on large-scale image-text pairs Masked Language Modeling (MLM) and Image-Text Matching (ITM) tasks used Task-specific adaptation needed Unified generative models for pre-training Multi-modal encoder and text decoder with careful design Generative Image-to-text Transformer (GIT) proposed GIT achieves new state of the arts across numerous challenging benchmarks Image encoder is a Swin-like vision transformer Text decoder is a transformer network Language modeling task used for pre-training New generation-based scheme for ImageNet classification proposed Network architecture Image encoder based on contrastive pre-trained model Input is raw image, output is 2D feature map Extra linear layer and layernorm layer to project image features into D dimensions Text decoder is transformer module with self-attention layer and feed-forward layer Text tokenized and embedded into D dimensions Image features concatenated with text embeddings as input to transformer module Text decoder randomly initialized Alternative architecture is cross-attention-based decoder Self-attention-based decoder better with large-scale pre-training Pre-training Model is trained using language modeling (LM) loss Alternative choice is MLM, which predicts 15% of input tokens LM can predict all tokens, which is more efficient for large-scale pre-training data Number of epochs is limited to 2 due to computational resource limitation Model is similar to GPT3 in architecture wise Fine-tuning Applied same LM task to fine-tune GIT for image captioning For VQA, question and answer concatenated as new caption during fine-tuning Generative approach chosen over discriminative existing work No OCR engine used, model learns to read scene text with pre-training Simple architecture change for video domain Generation model applied to image classification task Experiments Setting 0....

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Relative positional embeddings (RPE) have been studied for their ability to model the relative distance between tokens and enable length extrapolation. KERPLE is a framework that uses conditionally positive definite (CPD) kernels to generalize RPEs for length extrapolation. CPD kernels can be transformed into PD kernels by adding a constant offset, which is absorbed in the Softmax normalization during self-attention....

Vectorized and performance-portable Quicksort

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent works have shown that Quicksort implementations using vector CPU instructions can outperform non-vectorized algorithms. The proposed ‘vqsort’ algorithm integrates into the state-of-the-art parallel sorter ‘ips4o’, with a geometric mean speedup of 1.59. It works on seven instruction sets across four platforms, and supports floating-point and 16-128 bit integer keys. It is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries....

Fast Sampling of Diffusion Models with Exponential Integrator

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models (DMs) have been successful in generative modeling tasks. Sampling procedure of DMs is slow and requires hundreds to thousands of time discretization steps. Goal is to develop a fast sampling method for DMs with fewer steps while retaining high sample quality. Discretization method is most crucial factor affecting sample quality....

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduction of Super-NaturalInstructions, a benchmark of 1,616 diverse NLP tasks and their expert-written instructions 76 distinct task types, including classification, extraction, infilling, sequence tagging, text rewriting, and text composition Tk-Instruct, a transformer model trained to follow a variety of in-context instructions Tk-Instruct outperforms existing instruction-following models Analysis of generalization as a function of various scaling parameters Paper Content Introduction NLP community has made progress in building models for generalization to unseen tasks Models like InstructGPT are successful, but the contribution of design choices is opaque Need for large-scale public benchmarks of NLP tasks and instructions to facilitate research Constructed meta-dataset of 1,616 NLP tasks and instructions Model Tk-INSTRUCT outperforms InstructGPT by 9....