arxiv-summary: AI-summarized AI papers

Universal Instance Perception as Object Discovery and Retrieval

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract UNINEXT is a universal instance perception model for object discovery and retrieval. Benefits of UNINEXT include exploiting data from different tasks and label vocabularies for joint training of general instance-level representations, and being parameter-efficient when handling multiple tasks. UNINEXT has shown superior performance on 20 challenging benchmarks from 10 instance-level tasks. Paper Content Introduction Object-centric understanding is a challenging problem in computer vision 10 sub-tasks are discussed, distributed on the vertices of a cube Object detection and instance segmentation require finding objects of specific categories Multiple Object Tracking, Multi-Object Tracking and Segmentation, and Video Instance Segmentation require finding object trajectories of specific categories in videos Referring Expression Comprehension, Referring Expression Segmentation, and Referring Video Object Segmentation aim to find objects matched with language expressions Single Object Tracking and Video Object Segmentation take the target annotations given in the first frame as the reference Fragmented task definitions split the field into pieces, causing redundant parameters and overlooking the possibility of mutual collaboration UNINEXT is proposed as a universal instance perception model of the next generation UNINEXT can flexibly perceive different instances by changing the input prompts UNINEXT achieves superior performance on 20 challenging benchmarks Related work Retrieval by Category Names: Object detection and instance segmentation Retrieval by Language Expressions: REC, RES, and R-VOS Retrieval by Reference Annotations: SOT and VOS Unified Vision Models: Unified learning paradigms and unified model architectures Object detection and instance segmentation are foundations for other instance perception tasks REC methods divided into two-stage, one-stage, and Transformer-based RES approaches focus on designing diverse attention mechanisms R-VOS is an extension of RES from images to videos SOT and VOS extract target features and fuse target information with representations of the current frame Unified vision models attempt to solve multiple vision or multi-modal tasks by a single model Unified learning paradigms cover many tasks and modalities Unified model architectures designed for a group of closely related tasks Approach Categorize existing instance perception tasks into three classes Object detection, instance segmentation, MOT, MOTS, and VIS use category names as prompts REC, RES, and R-VOS use an expression as the prompt SOT and VOS use annotation given in the first frame as the prompt Reformulate all instance perception tasks into a prompt-guided object discovery and retrieval problem UNINEXT consists of three components: prompt generation, image-prompt feature fusion, object discovery and retrieval Prompt generation A prompt generation module is used to transform the original diverse prompt inputs into a unified form....

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposes a unified diffusion framework to fit multi-modal data in one model Learns diffusion models for marginal, conditional, and joint distributions by predicting the noise in the perturbed data Perturbation levels can be different for different modalities Learns all distributions simultaneously with a minimal modification to the original diffusion model Implemented on large-scale paired image-text data Able to perform image, text, text-to-image, image-to-text, and image-text pair generation Produces perceptually realistic samples in all tasks Quantitative results are superior to existing general-purpose models and comparable to bespoken models Paper Content Introduction Content-creation revolution driven by advances in generative modeling Diffusion models create high-fidelity and diverse data Humans can generate multi-modal content simultaneously Unified training framework needed to cover all types of multi-modal generative tasks Probabilistic modeling used to fit relevant distributions UniDiffuser framework proposed to fit all distributions in one model UniDiffuser uses transformer-based backbone UniDiffuser able to perform image, text, text-to-image, image-to-text, and image-text pair generation UniDiffuser produces perceptually realistic samples in all tasks Background Diffusion models perturb data by injecting noise Noise is formalized by a Markov chain Data can be generated by reversing the process Optimal mean is estimated by a noise prediction network Classifier-free guidance improves sample quality of a conditional diffusion model Method UniDiffuser is a single diffusion model to capture marginal, conditional, and joint distributions determined by multi-modal data UniDiffuser can be extended to more modalities UniDiffuser is able to capture all relevant distributions determined by two modalities of data UniDiffuser is equivalent to estimating a conditional expectation over the noise UniDiffuser employs a joint noise prediction network to predict the noise injected to two modalities UniDiffuser uses a transformer-based network UniDiffuser can perform unconditional, conditional, and joint sampling UniDiffuser is more efficient than learning a single joint distribution Classifier-free guidance for free CFG combines a conditional and an unconditional model linearly during sampling....

Prefix-tree Decoding for Predicting Mass Spectra from Molecules

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Computational predictions of mass spectra from molecules have enabled the discovery of clinically relevant metabolites. Predictive tools are limited, either operating with overly rigid constraints or by decoding lossy and nonphysical discretized spectra vectors. A new intermediate strategy is introduced for predicting mass spectra from molecules by treating them as sets of chemical formulae....

Probing neural representations of scene perception in a hippocampally dependent task using artificial neural networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract DNNs can accurately capture the hierarchy of neural responses in the mammalian visual system. DNNs have been less successful in explaining representations in higher cortical areas. A novel scene perception benchmark has been designed to probe the ability of DNNs to transform scenes viewed from different perspectives. A network architecture inspired by the connectivity between temporal lobe structures and the hippocampus has been used to demonstrate that DNNs can learn this task....

Resurrecting Recurrent Neural Networks for Long Sequences

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract RNNs offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform well on long sequence modeling tasks and have the added benefits of fast parallelizable training and RNN-like fast inference. This paper shows that careful design of deep RNNs can recover the impressive performance of deep SSMs on long-range reasoning tasks, while also matching their training speed....

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Training stability is important for Transformers. Examined the evolution of attention layers. Tracked attention entropy for each attention head during training. Low attention entropy is accompanied by high training instability. Proposed $\sigma$Reparam to prevent entropy collapse. Experiments show $\sigma$Reparam provides stability and robustness. Paper Content Introduction Transformers are state-of-the-art models in many application domains Residual connections and Layer Normalizations are used in the original paper Various works have attempted to promote better training stability and robustness Attention entropy is tightly correlated with model’s stability and convergence Small attention entropy often leads to slow convergence, fluctuations in training loss and divergence Modifying the temperature of the Transformer gives direct control over the attention entropy Sharpness of the Hessian is related to training stability Entropy collapse can be prevented by controlling the spectral norms of the query and key projections σReparam reparameterizes all weight matrices to update smoothly and in a controlled way Entropy collapse is commonly observed in baseline models of various benchmarks Related works Transformers use LNs to achieve training stability Entropy collapse happens even with extensive use of normalization layers σReparam does not rely on specific normalization layers and can work without it Weight reparameterization has been adopted in deep learning σReparam is the first simple reparameterization technique that provides competitive performance SpectralNorm explicitly constrains the model’s capacity Rank collapse of Transformer training was first identified by Dong et al....

StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract StyleGAN is limited to cropped aligned faces at a fixed image resolution. Dilated convolutions can be used to rescale the receptive fields of shallow layers in StyleGAN. This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions. An encoder is introduced to enable real face inversion and manipulation....

Rewarding Chatbots for Real-World Engagement with Millions of Users

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Pretrained large language models have been used to create social chatbots. These chatbots need to be engaging to retain users. This work proposes using human feedback to develop highly engaging chatbots. Evaluation metrics are used to measure the level of engagement. A/B testing shows an increase in user retention of up to 30%....

MVImgNet: A Large-scale Dataset of Multi-view Images

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Deep learning algorithms are data-driven ImageNet has driven a trend of “learning from large-scale data” in computer vision Pretraining on ImageNet has been used to benefit 2D visual tasks There is no generic dataset for 3D vision like ImageNet MVImgNet is a large-scale dataset of multi-view images MVImgNet has been used to benefit 3D and 2D visual tasks MVPNet is a 3D object point cloud dataset derived from MVImgNet Paper Content Introduction Deep learning algorithms are data-driven Training on large-scale datasets allows deep neural networks to extract rich representations ImageNet is the pioneer of large-scale real-world image datasets Pretraining on ImageNet boosts model performance Various 3D datasets have been produced to facilitate 3D visual applications Existing 3D datasets are either synthetic or not comparable to ImageNet Goal of paper is to build primary dataset and explore its effect Dataset is created from multi-view images Dataset contains 6....

Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Contrastive loss is used to learn representations from multiple modalities. Exact modality alignment is not optimal for downstream prediction tasks. Three approaches are proposed to construct latent modality structures. Experiments are conducted on two multi-modal representation learning frameworks. Method achieves consistent improvements over existing methods. Paper Content Introduction Aim to learn generic representations from images and texts Unify representations of two modalities in one encoder Represent image and text modality separately with modality-specific encoders Utilize contrastive learning to align modalities Modality gap defined as distance between feature distributions of two modalities Contrastive learning does not always reduce modality gap Theoretically study modality gap problem Propose regularizations to construct better latent structures Intra-modality, inter-modality, and intra-inter-modality regularizations Related work Unified models process both images and texts Separate encoders for images and texts used in second category Contrastive loss used to align multiple modalities Third category uses separate encoders and late-fusion multi-modal encoder Understanding the impact of modality gap on downstream performance Modality alignment in feature space through contrastive learning is an open question Notation: X T and X V denote input texts and images, Y denotes target variable Modality gap problem is formally formulated Relationship between modality gap and downstream performance is presented Information-theoretical analysis is provided Conditional entropy and cross-entropy loss are related Empirical analysis on modality gap Contrastive pretraining is used to align paired multimodal data in the feature space Positive pairs are aligned to be closer together, while negative pairs are farther away Experiments are conducted to explore the effect of reducing modality gap on image/text retrieval Alignment loss is optimized during training to reduce the gap between modalities Retrieval performance barely changes when changing the gap between two modalities An information-theoretic analysis on modality gap Inspired by empirical observation, reducing modality gap in feature space does not always lead to better downstream task performance Defined information gap to characterize gap of utility provided by two modalities towards predicting target variable Information gap only depends on joint distribution and is independent of modality encoders Information gap serves as lower bound of downstream prediction error if seeking to find features with zero modality gap Theorem 3....