arxiv-summary: AI-summarized AI papers

Self Supervision Does Not Help Natural Language Supervision at Scale

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Self supervision and natural language supervision are two ways to train general purpose image encoders. M3AE and SLIP have suggested that these approaches can be combined. Results from these approaches use small pre-training datasets (<50M samples). Investigating whether a similar approach can be effective with larger datasets. Combination of two state of the art approaches (MAE and CLIP) provides benefit when trained on 11....

Understanding and Detecting Hallucinations in Neural Machine Translation via Model Introspection

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Neural sequence generation models can produce outputs that are unrelated to the source text. It is unclear what conditions cause these hallucinations and how to prevent them. This work identifies internal model symptoms of hallucinations and uses them to design a detector. The detector outperforms model-free baselines and strong classifiers on English-Chinese and German-English translation test beds....

Learning-Rate-Free Learning by D-Adaptation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Gradient descent speed is dependent on learning rate A single-loop method can achieve optimal convergence rate without knowledge of distance to solution set Method does not require additional multiplicative log factors Experiments show method matches hand-tuned learning rates Method is practical, efficient and requires no additional evaluations Open-source implementation available Paper Content Introduction Problem of unconstrained convex minimization Standard approach is subgradient method Step size (learning rate) affects convergence Optimal step size requires knowledge of distance to solution Dual Averaging with D-Adaptation algorithm achieves optimal rate of convergence No need for hyper-parameter grid searches Algorithm Algorithm 1 is a modification of AdaGrad step size applied to weighted dual averaging Algorithm 1 uses a lower bound dk on D Algorithm 1 has two key differences from the classical bound Theorem 1 states that Algorithm 1 returns a point xn such that as n → ∞, where D = x 0 − x * Theorem 2 states that Algorithm 1 run for n ≥ log 2 (D/d 0 ) steps has a guarantee that is significantly better than using the subgradient method D-adapted adagrad D-Adaptation technique can be applied to coordinate-wise scaling variant of AdaGrad Algorithm 2 presents this method Estimates distance to solution in ∞-norm instead of Euclidean norm Same adaptive convergence rate as AdaGrad up to constant factors Theorem 3 for convex p-dimensional function with G ∞ in initialization of a 0 Discussion D-Adaptation is a computer science algorithm used to minimize an absolute value function The algorithm starts with a value of d 0 which is lower than the known D value The value of d k typically does not asymptotically approach D The algorithm uses a hyper-gradient quantity to estimate the magnitude of the optimal learning rate The algorithm is applicable to convex Lipschitz functions and can be extended to the stochastic setting Related work Optimizing Lipschitz functions Major classes of approaches reviewed Polyak step size Polyak step size can replace the requirement of knowledge of D Polyak step size gives optimal rate of convergence without additional log factors Estimates or approximations of f* can lead to unstable convergence Restarting scheme with lower bounds on f* can converge within a log factor of the optimal rate Exact line searches Method gives optimal rate without requiring knowledge of problem parameters Relaxing exact line search to approximate line search introduces additional dependencies on problem constants Bisection algorithm used to replace log dependence on d max with log log dependence Estimating d max allows for replacing d max with D in bound Coin-betting Coin betting approaches can be used when knowledge of G is assumed but not D....

Discrete Latent Structure in Neural Networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Data from many fields can be represented by discrete, compositional structures. Latent structure models are useful for learning to extract such representations. Three strategies for learning with discrete latent structure are explored. Paper Content Motivation Machine Learning is used to analyze data such as images, text, and sound. Natural language sentences can be analyzed in terms of their dependency structure....

Towards Models that Can See and Read

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Visual Question Answering (VQA) and Image Captioning (CAP) are popular vision-language tasks. Scene-text versions of these tasks require reasoning from the text in the image. Task-specific methods can either see or read, but not both. UniTNT is a Unified Text-Non-Text approach that grants existing multimodal architectures scene-text understanding capabilities. UniTNT leads to the first single model that successfully handles both task types....

Hierarchical Bayesian inference for community detection and connectivity of functional brain networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Many fMRI studies rely on estimates of brain networks. Existing methods for estimating the community structure of networks lack validation. A new multilayer community detection method based on Bayesian latent block modelling is developed. The method can detect the group-level community structure of weighted functional networks. A new community structure-based multivariate Gaussian generative model is proposed for validation....

Data thinning for convolution-closed distributions

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposes data thinning, a new approach for splitting an observation into two or more parts Data thinning can be applied to any observation drawn from a “convolution closed” distribution Data thinning has applications to model selection, evaluation, and inference Cross-validation via data thinning provides an alternative to sample splitting Data thinning can be used to validate the results of unsupervised learning approaches Paper Content Introduction Data sets are growing in size and complexity There is a need for methods to validate outputs of complex models Sample splitting is a common method used to validate models Data fission is an alternative to sample splitting proposed by Leiner et al....

EPiC-GAN: Equivariant Point Cloud Generation for Particle Jets

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Generative machine learning models enable fast event generation EPiC-GAN is a flexible framework based on deep sets for simulating sprays of particles EPiC layers do not rely on pairwise information sharing between particles EPiC-GAN scales well to large particle multiplicities and achieves high generation fidelity Paper Content Introduction Particle interactions are simulated for fundamental physics research....

GLIGEN: Open-Set Grounded Text-to-Image Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Large-scale text-to-image diffusion models have made advances Existing models use text input alone, which can impede controllability GLIGEN is a novel approach that builds upon and extends existing models GLIGEN preserves concept knowledge of pre-trained model and injects grounding information into new trainable layers GLIGEN achieves open-world grounded text2img generation with caption and bounding box condition inputs GLIGEN outperforms existing supervised layout-to-image baselines by a large margin Paper Content Introduction Image generation research has seen advances in recent years GANs and text conditional autoregressive and diffusion models have been used These models have practical use cases and can generate high quality images Existing models cannot be conditioned on other input modalities apart from text Propose a method to provide new grounding conditional inputs to pretrained text-to-image diffusion models Model can generalize to unseen objects Model’s zero-shot performance on layout2img tasks outperforms prior state-of-the-art Propose a method to build upon large pretrained generative models for downstream tasks Related work Autoregressive and diffusion models are state-of-the-art for text-to-image generation DALL-E and Parti demonstrate zero-shot and scaling up abilities Diffusion models have shown promising results Masked modeling can achieve SoTA-level generation performance Make-A-Scene incorporates semantic maps into text-to-image generation Layout2Im generates images from bounding boxes Existing layout2image methods are closed-set GANs and diffusion models have been explored for various conditioning information Our work investigates how to build upon existing models to enable open-set grounded image generation Preliminaries on latent diffusion models Diffusion-based methods are effective for text2image tasks Latent Diffusion Model (LDM) and Stable Diffusion are powerful models LDM has two stages: mapping network to obtain latent representation and diffusion model on latent Training objective is to denoise latent representations of image LDM can generate impressive language-to-image results with pretraining on internet-scale data Open-set grounded image generation Grounding instruction input Grounding: entities described through text or example image, spatial configuration described with bounding box or keypoints Caption and grounding entities are processed as input tokens to the diffusion model Existing lay-out2img works only deal with closed-set setting Training data requires both text and grounding entity Three types of data: grounding, detection, detection and caption Image prompt: entity described using an image instead of language Keypoints: richer spatial configurations than bounding boxes Continual learning for grounded generation Goal is to endow new spatial grounding capabilities to existing large language-to-image generation models Models pre-trained on web-scale imagetext to gain knowledge for synthesizing realistic images Original model weights retained while expanding new capability New gated self-attention layer added to enable spatial grounding ability Attention performed over concatenation of visual and grounding tokens Original denoising objective used for model continual learning Model learns to use additional localization information while retaining pre-trained concept knowledge Versatile interface allows user to ground entities that exist in caption input or add objects freely Scheduled sampling scheme used in inference to improve visual quality and extend model to other domains Experiments Evaluated model’s grounded text2img generation in closed-set and open-set settings Ablated components of model Showed extensions to image prompt and keypoint grounded generation Conducted quantitative experiments using pretrained LDM on LAION Closed-set grounded text2img generation Evaluated generation quality and grounding accuracy of model in closed-set setting Trained and evaluated on COCO2014 dataset Used 3 types of grounding instructions Compared to baseline models Used FID and YOLO score to evaluate Model trained with detection annotation instructions had best performance Combining data from all grounding instructions can lead to complementary benefits Used gated self-attention to absorb grounding instruction Ablated on null caption and gated cross-attention Achieved state-of-the-art performance for image quality and grounding accuracy Pretrained model on larger dataset and evaluated zero-shot and finetuned results Open-set grounded text2img generation GLIGEN can generate grounded entities beyond the COCO categories GLIGEN learns to re-position the visual features corresponding to the grounding entities Model is evaluated on LVIS and outperforms supervised baseline Performance increases as training data is scaled up Model gains grounding ability compared to vanilla Stable Diffusion Inpainting comparison GLIGEN can be used for inpainting tasks An experiment was conducted on the COCO dataset to inpaint randomly masked objects of different sizes Results show that GLIGEN inpainted objects more tightly occupy the missing region compared to baselines Keypoints grounding Model uses bounding boxes and human keypoints as grounding conditions for generation Model compared to pix2pixHD Model trained with and without captions Model generates better image quality than pix2pixHD Model can be used to specify scene and person’s gender for image creation Image grounding Image grounded generation uses a reference image to represent a grounded entity....

Transformers as Algorithms: Generalization and Implicit Model Selection in In-context Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. ICL is an algorithm learning problem, treating the transformer model as a learning algorithm that can be specialized via training. Multitask learning is used to obtain generalization bounds for ICL....