arxiv-summary: AI-summarized AI papers

RT-1: Robotics Transformer for Real-World Control at Scale

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Modern machine learning models can solve specific tasks with small datasets. Generalization capabilities of models are important in robotics due to difficulty of collecting data. Robotics Transformer model class exhibits promising scalable model properties. Study of model classes and their ability to generalize based on data size, model size, and data diversity....

Multi-Concept Customization of Text-to-Image Diffusion

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Generative models produce high-quality images from a large-scale database Propose Custom Diffusion to augment existing text-to-image models Optimizing a few parameters in the text-to-image conditioning mechanism is sufficient to represent new concepts Can jointly train for multiple concepts or combine multiple fine-tuned models Generates variations of multiple, new concepts and composes them with existing concepts Outperforms several baselines and concurrent works Paper Content Introduction Text-to-image models can generate images of unprecedented quality Users often want to generate images of personal concepts, such as family, friends, pets, or personal objects and places Describing these concepts through text is difficult and unable to produce the personal concept with sufficient fidelity Model customization is needed to generate these personal concepts Challenges include model forgetting, overfitting, and concept mixing and omission Custom Diffusion is a fine-tuning technique to overcome these challenges It is computationally and memory efficient It can compose multiple new concepts efficiently Related work Generative models aim to synthesize samples from a data distribution Types of generative models include GANs, VAEs, autoregressive, flow-based, and diffusion models Models can be conditioned on a class, image, or text prompt Text-to-image models have demonstrated remarkable generalization ability Aim to adapt models to become specialists in new concepts Leverage generative models for image and model editing Transfer learning used to produce a whole distribution of images Aim to acquire multiple new concepts without catastrophic forgetting Method Updates a small subset of weights in the cross-attention layers of the model Uses a regularization set of real images to prevent overfitting Single-concept fine-tuning Stable Diffusion model is built on Latent Diffusion Model Images are encoded into a latent representation using VAE, Patch-GAN and LPIPS Diffusion models aim to approximate the original data distribution Model is trained to learn the reverse process of a fixed-length Markov chain Model is conditioned on timestep and can be further conditioned on any other modality, e....

ViTPose+: Vision Transformer Foundation Model for Generic Body Pose Estimation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract ViTPose is a simple baseline model that uses a plain and non-hierarchical vision transformer for body pose estimation. ViTPose can be scaled up from 20M to 1B parameters, providing a new Pareto front for throughput and performance. ViTPose is flexible regarding attention type, input resolution, and pre-training and fine-tuning strategy. ViTPose+ is a novel model that deals with heterogeneous body keypoint categories in different types of body pose estimation tasks....

Robust Speech Recognition via Large-Scale Weak Supervision

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Studied capabilities of speech processing systems trained to predict audio transcripts Models generalized well to standard benchmarks and competitive with prior results without fine-tuning Models approach accuracy and robustness of humans Paper Content Introduction Progress in speech recognition has been improved by unsupervised pre-training techniques These methods can use large datasets of unlabeled speech Pre-trained audio encoders learn high-quality representations of speech, but lack an equivalently performant decoder Fine-tuning can be complex and can lead to models exploiting dataset-specific quirks Recent efforts have created larger datasets for speech recognition using automated pipelines Moving beyond gold-standard datasets to larger weakly supervised datasets improves robustness and generalization This work scales weakly supervised speech recognition to 680,000 hours of labeled audio data This dataset includes 96 languages and 125,000 hours of X→en translation data Joint multilingual and multitask training is beneficial Inference code and models are available online Approach Data processing Minimalist approach to data pre-processing Training models to predict raw text of transcripts without standardization Construct dataset from audio paired with transcripts on the internet Automated filtering methods to improve transcript quality Heuristics to detect and remove machine-generated transcripts Model Used an off-the-shelf architecture to avoid confounding findings Audio re-sampled to 16,000 Hz and 80-channel logmagnitude Mel spectrogram representation computed on 25-millisecond windows with a stride of 10 milliseconds Multitask format Speech recognition involves many components, making it complex We want a single model to perform the entire speech processing pipeline We use a simple format to specify tasks and conditioning information We use a sequence-to-sequence Transformer model for many different speech processing tasks Tasks are represented as a sequence of tokens to be predicted by the decoder Special tokens are used as task specifiers or classification targets Training details Trained suite of models of various sizes Used data parallelism, FP16, dynamic loss scaling, activation checkpointing Trained with AdamW and gradient norm clipping Batch size of 256 segments, trained for 2-3 passes over dataset No data augmentation or regularization Fine-tuned models on subset of transcripts without speaker annotations Additional Large model trained for 2....

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Foundation models have shown good performance on computer vision tasks Existing models focus on image-level pretraining and adaption, which are limited for complex video-level understanding tasks InternVideo explores masked video modeling and video-language contrastive learning as pretraining objectives InternVideo achieves state-of-the-art performance on 39 video datasets from various tasks InternVideo obtains 91....

Box2Mask: Box-supervised Instance Segmentation via Level-set Evolution

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Box-supervised instance segmentation uses simple box annotations instead of pixel-wise mask labels. Box2Mask is a novel single-shot instance segmentation approach that integrates level-set evolution into deep neural network learning. Box2Mask consists of two types of single-stage frameworks, CNN-based and transformer-based. Box2Mask achieves outstanding performance on five challenging testbeds. Paper Content Introduction Instance segmentation aims to obtain pixel-wise mask labels of objects Used in applications such as autonomous driving, robotic manipulation, image editing, cell segmentation Recent advances in CNN and transformer architectures have improved instance segmentation Existing methods require pixel-wise instance mask annotations, which is expensive and tedious Box-supervised instance segmentation uses simple and label-efficient box annotations Methods have been developed to enable pixel-wise supervision with box annotation Recent approaches use pairwise affinity modeling to enable end-to-end training Box2Mask proposed to use classical level-set model to model pixel affinities Box2Mask uses low-level and high-level features to robustly evolve level-set curves Box2Mask consists of level-set evolution, instance-aware decoder and box-level matching assignment Box2Mask improves previous state-of-the-art 38....

Scaling Language-Image Pre-training via Masking

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract FLIP is a method for training CLIP that is more efficient. FLIP masks and removes image patches during training. FLIP improves accuracy and speed compared to the no-masking baseline. FLIP outperforms CLIP counterparts on downstream tasks. FLIP encourages research on scaling vision-language learning. Paper Content Introduction Language-supervised visual pre-training is a powerful methodology for learning representations Pre-trained models have strong zeroshot transferability Pre-trained encoders can improve multimodal and unimodal visual tasks Natural language provides richer forms of supervision Large-scale training is essential for language-supervised models FLIP is a method for efficient CLIP training FLIP trains faster and is more accurate than its CLIP counterpart FLIP reduces computation by 2-4x and allows larger batches FLIP outperforms its CLIP counterparts on downstream tasks FLIP can improve accuracy with data scaling at no extra training cost Related work Denoising Autoencoders proposed as unsupervised representation learning method BERT is an application of Denoising Autoencoders Masked Autoencoder (MAE) reduces training time and memory MAE applied to videos, point clouds, graphs, audio, visual control, vision-language, and other modalities Our work related to MAE and its vision-language extensions Focus on scaling aspect enabled by sparse computation CLIP performs contrastive learning on pairs of image and text samples Randomly mask out image patches with high masking ratio Language-supervised learning popularized by CLIP and related works Generative learning methods explored, optionally combined with contrastive losses Method Masking reduces computation in CLIP training Tradeoff between encoding density and number of samples compared Image masking randomly masks out a portion of patches Text masking optionally masks out a portion of tokens Contrastive loss used to train encoders No reconstruction loss used Unmasking tuning strategy to close distribution gap Implementation Implementation follows CLIP and OpenCLIP with modifications Image encoder follows ViT paper, no extra LayerNorm Global average pooling at end of image encoder Text encoder is non-autoregressive Transformer WordPiece tokenizer, sequences padded/cut to length of 32 Outputs of image/text encoder projected to same-dimensional embedding space Cosine similarities of embeddings used as input to InfoNCE loss Prompt engineering used for zero-shot transfer Implementation based on JAX with t5x library Training run on TPU v3 infrastructures Experiments FLIP design ablated Image encoder is ViT-L/16 Text encoder has smaller size Trained on LAION-400M Evaluated zeroshot accuracy on ImageNet-1K Image masking ratios studied Batch size studied Text masking studied Inference unmasking studied Unmasked tuning studied Reconstruction studied Accuracy vs....

Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract A diffusion model is used to predict a vector field of gradients. Chain rule is applied to the learned gradients. The score of the diffusion model is back-propagated through a differentiable renderer. 2D scores are aggregated into a 3D score. A pretrained 2D model is repurposed for 3D data generation. A novel estimation mechanism is proposed to address a distribution mismatch....

Paint by Example: Exemplar-based Image Editing with Diffusion Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Language-guided image editing has been successful Investigated exemplar-guided image editing for more precise control Leveraged self-supervised training to disentangle and re-organize source and exemplar Proposed information bottleneck and strong augmentations to avoid copying and pasting Designed arbitrary shape mask for exemplar image and leveraged classifier-free guidance Involves single forward of diffusion model without iterative optimization Impressive performance and controllable editing on in-the-wild images with high fidelity Paper Content Introduction Creative editing for photos is becoming more common due to advances in social media platforms AI-based techniques can make image editing easier Deep neural networks can produce results for various low-level image editing tasks Semantic image editing is more challenging and requires manipulation of high-level semantics Language-image models have enabled various image manipulation tasks Exemplar-based image editing allows accurate semantic manipulation with an exemplar image Image-conditioned diffusion model is trained in a self-supervised manner Techniques are proposed to tackle degenerate challenge Approach performs favorably over prior arts for inthe-wild image editing Related work Cutting and pasting one image onto another to create a realistic composite is a common photo editing operation Many methods have been proposed to make the composite look more realistic Traditional methods extract handcrafted features to match the color distribution Recent works leverage deep semantic features to improve robustness Existing methods are limited to specific image genres, such as faces, cars, birds, cats, etc....

DAMO-YOLO : A Report on Real-Time Object Detection Design

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Presents a fast and accurate object detection method called DAMO-YOLO Uses Neural Architecture Search, Reparameterized Generalized-FPN, AlignedOTA label assignment, and distillation enhancement Searches for a detection backbone with low latency and high performance Follows the rule of “large neck, small head” Investigates how detector head size affects detection performance Achieves 43.0/46.8/50.0 mAPs on COCO with the latency of 2....