arxiv-summary: AI-summarized AI papers

DiffusionDet: Diffusion Model for Object Detection

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a new framework for object detection called DiffusionDet DiffusionDet formulates object detection as a denoising diffusion process from noisy boxes to object boxes During training, object boxes diffuse from ground-truth boxes to random distribution Model learns to reverse this noising process during inference Evaluations on MS-COCO and LVIS show favorable performance compared to previous detectors Random boxes are effective object candidates Object detection can be solved by a generative way Paper Content Introduction We propose DiffusionDet, a novel noise-to-box object detection framework, which decouples the training and evaluation and enables progressive refinement....

VeLO: Training Versatile Learned Optimizers by Scaling Up

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Deep learning models are trained with hand-designed optimizers. This work leverages the same scaling approach behind the success of deep learning to learn versatile optimizers. An optimizer for deep learning is trained, which is a small neural network that ingests gradients and outputs parameter updates. The optimizer is meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks....

DeepPrivacy2: Towards Realistic Full-Body Anonymization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract GANs are used for anonymization of human figures DeepPrivacy2 is a novel anonymization framework for realistic anonymization of human figures and faces A new large and diverse dataset for human figure synthesis is introduced A style-based GAN is proposed to produce high quality, diverse and editable anonymizations Full-body anonymization framework provides stronger privacy guarantees than previously proposed methods Paper Content Introduction Collecting and storing images is common Collecting privacy-sensitive data without anonymization or consent is troublesome Traditional image anonymization distorts data Realistic anonymization has been introduced as an alternative Current methods focus on face anonymization SG-GAN proposes full-body anonymization SG-GAN has limited visual quality and insufficient anonymization This work extends SG-GAN to address these issues Images are annotated with keypoints, pixel-to-vertex correspondences and a segmentation mask A novel anonymization framework is proposed GAN generates high-quality and diverse identities GAN is extended for face anonymization DeepPrivacy2 surpasses all previous state-of-the-art realistic anonymization methods Related work Naive image anonymization methods are widely used but degrade image quality Early work focused on K-same family of algorithms which provide better privacy and data usability Recent work on deep generative models can realistically anonymize data while retaining usability Inpainting-based methods provide stronger privacy guarantees than transformative methods Prior work focuses on face anonymization, leaving primary and secondary identifiers untouched Limited amount of work focusing on full-body anonymization with low-resolution images or visual artifacts Recent work on full-body synthesis focuses on limited tasks such as transferring source appearances Ma et al....

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent advances in diffusion models have set a milestone in many generation tasks New approaches focus on extensions and performance rather than capacity Versatile Diffusion (VD) is a multi-flow network that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model VD has competitive quality, novel extensions and applications, and provides more semantic insights of the generated outputs Code and models are open-sourced Paper Content Introduction Multi-modality is a challenge for computer vision and machine learning Deep learning has improved accuracy of traditional tasks Multi-modal research has focused on discriminative tasks Generative tasks of a large scope are challenging GAN research has focused on specific domains and tasks Diffusion models have been successful across many domains and tasks Diffusion models have robust training objectives Diffusion models have competitive performance Diffusion models have disadvantages such as data hunger and high inference costs Diffusion models have achieved smooth translation in cross-modal latent spaces Versatile Diffusion (VD) is introduced to solve text, images, and variations in one unified model Related works Multi-modalities are unions of information with different forms, including vision, text, audio, etc....

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract EVA is a vision-centric foundation model that uses publicly accessible data. EVA is pre-trained to reconstruct masked out image-text aligned vision features. EVA can be scaled up to one billion parameters and sets new records on a range of vision tasks. EVA can serve as a vision-centric, multi-modal pivot to connect images and text....

MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract DPM is a hot topic in computer vision It is used for image generation, deblurring, super-resolution and anomaly detection MedSegDiff is the first DPM based model for medical image segmentation Dynamic conditional encoding and FF-Parser are proposed to enhance the step-wise regional attention MedSegDiff outperforms SOTA methods on three medical segmentation tasks Paper Content I....

NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a novel 3D mapping pipeline for scene reconstruction from monocular images Leveraged advances in dense monocular SLAM and real-time hierarchical volumetric neural radiance fields Proposed uncertainty-based depth loss for good photometric and geometric accuracy Better accuracy than competing approaches (up to 179% better PSNR and 86% better L1 depth) Real-time performance using only monocular images Paper Content Introduction 3D reconstruction from monocular images is a difficult computer vision problem Applications in robotics, surveying, and gaming can be enabled by achieving 3D reconstructions in real-time from images alone RGB-D and Lidar sensors are used for 3D reconstruction, but monocular cameras are simpler and cheaper Monocular 3D reconstruction is challenging due to lack of explicit measurements of depth Deep-learning approaches have been used to try to solve the problem Neural Radiance Fields (NeRFs) can provide photometrically accurate 3D representations, but are difficult to infer Recent work has shown that NeRFs can be fit in real-time without ground-truth poses Adding depth supervision can improve removal of ghost geometry and lead to faster convergence Combining dense monocular SLAM with hierarchical volumetric neural radiance fields can build accurate radiance fields from a stream of images in real-time Related work Dense monocular SLAM Neural radiance fields Dense slam Computational complexity of dense SLAM Historically, decoupling pose and depth estimation RGB-D or Lidar sensors provide explicit depth measurements Recent research on dense SLAM has achieved impressive results CodeSLAM optimizes latent variables of auto-encoder to reduce depth variables Tandem reconstructs 3D scenes with monocular images Droid-SLAM adapts state-of-the-art dense optical flow estimation Rosinol et al....

MetaFormer Baselines for Vision

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract MetaFormer plays a significant role in achieving competitive performance MetaFormer ensures solid lower bound of performance MetaFormer works well with arbitrary token mixers MetaFormer effortlessly offers state-of-the-art results ConvFormer outperforms ConvNeXt CAFormer sets new record on ImageNet-1K StarReLU reduces 71% FLOPs of activation compared with GELU Paper Content Recap the concept of metaformer MetaFormer is a general architecture abstracted from Transformer....

High Fidelity Neural Audio Compression

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce a real-time, high-fidelity audio codec leveraging neural networks Use a streaming encoder-decoder architecture with quantized latent space Simplify and speed-up training with a single multiscale spectrogram adversary Introduce a novel loss balancer mechanism to stabilize training Use lightweight Transformer models to compress representation by up to 40% Paper Content Introduction 82% of internet traffic is streaming audio and video in 2021 Audio compression is an important problem Audio codecs use an encoder and decoder to remove redundancies in audio content Neural networks have been used to compress audio Problems with lossy neural compression include overfitting and efficient compression Human perception is used to evaluate audio codecs Related work Recent advancements in neural audio generation enabled computers to generate natural sounding audio Autoregressive models such as WaveNet produce convincing results but are slow Generative Adversarial Networks (GANs) can match the quality of autoregressive models Low bitrate parametric speech and audio codecs have been studied for a long time Neural based audio codecs have been proposed and show promising results Audio and speech can be represented using discrete values for various tasks Model Audio signal of duration d can be represented by a sequence of audio channels and samples EnCodec model composed of 3 components: encoder, quantization layer, decoder System trained end-to-end to minimize reconstruction and perceptual loss Visual description of method in Figure 1 Encoder & decoder architecture EnCodec model is a streaming, convolutional-based encoder-decoder architecture Used for various audio-related tasks, e....

FaceDancer: Pose- and Occlusion-Aware High Fidelity Face Swapping

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a new single-stage method for subject agnostic face swapping and identity transfer called FaceDancer Adaptive Feature Fusion Attention (AFFA) and Interpreted Feature Similarity Regularization (IFSR) modules embedded in the decoder AFFA adaptively learns to fuse attribute features and features conditioned on identity information without requiring any additional facial segmentation process IFSR leverages intermediate features in an identity encoder to preserve important attributes while transferring the identity of the source face Experiments show FaceDancer outperforms other state-of-the-art networks in terms of identity transfer and has better pose preservation than most previous methods Paper Content Introduction Face swapping is a task that shifts the identity of a source face into a target face while preserving the target face’s attributes....