arxiv-summary: AI-summarized AI papers

Towards a Complete Analysis of Langevin Monte Carlo: Beyond Poincaré Inequality

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Langevin diffusions are rapidly convergent under certain conditions. Vemapala and Wibisono (2019) and Chewi et al. (2022) established results for log-Sobolev and Poincaré inequalities respectively. This paper goes beyond Poincaré inequalities and establishes upper and lower bounds for Langevin diffusions and LMC under weak Poincaré inequalities. Results explicitly quantify the effect of the initializer on the performance of the LMC algorithm....

Structured Kernel Estimation for Photon-Limited Deconvolution

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Images taken in low light conditions have motion blur and photon shot noise. State-of-the-art image restoration networks are limited to well-illuminated scenes and don’t perform well with photon shot noise. A new blur estimation technique is proposed to address photon-limited conditions. The proposed method uses a gradient-based backpropagation method to estimate the blur kernel....

PaLM-E: An Embodied Multimodal Language Model

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Large language models are good at complex tasks Enabling general inference in the real world is challenging Embodied language models incorporate real-world continuous sensor modalities Input to the model is multi-modal sentences Model is trained end-to-end for multiple embodied tasks Evaluations show model can address a variety of embodied reasoning tasks Model benefits from diverse joint training Largest model is a visual-language generalist with state-of-the-art performance Paper Content Introduction LLMs demonstrate strong reasoning capabilities across various domains Limitation of LLMs is the issue of grounding Previous work interfaces LLMs with learned robotic policies and affordance functions LLMs are only provided with textual input, which is insufficient for many tasks Current state-of-the-art visuallanguage models cannot directly solve robotic reasoning tasks Proposed embodied language models incorporate continuous inputs from sensor modalities Embodied language models are trained end-to-end to output sequential decisions Evaluated in a variety of settings Multi-task training improves performance Transfer across tasks leads to high data-efficiency for robotics tasks PaLM-E-562B achieves state-of-the-art performance on the OK-VQA benchmark PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought reasoning Related work Recent years have seen a growing interest in large vision-language models (VLMs) VLMs can be applied to tasks such as visual question answering, captioning, optical character recognition, and object detection Different methods exist for integrating images into VLMs Prior works focus on combining vision and language inputs in an embodied setting with the goal of direct action prediction LLMs contain vast amounts of internalized knowledge about the world LLMs have been employed in embodied domains to understand natural language goals and to represent planning PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding PaLM-E investigates a generalist, multi-embodiment model, across multiple modalities Palm-e: an embodied multimodal language model PaLM-E injects continuous observations into the language embedding space of a pre-trained language model PaLM-E is a decoder-only LLM that generates textual completions autoregressively Inputs to PaLM-E consist of text and continuous observations Output of PaLM-E is text generated auto-regressively Decoder-only LLMs predict the probability of a piece of text Prefix-decoder-only LLMs can be conditioned on a prefix Tokens are embedded into a word token embedding space Continuous observations are injected into the LLM by mapping them into the language embedding space PaLM-E can be used to generate text or decisions/plans Low-level policies translate decisions into low-level actions PaLM-E is integrated into a control-loop to sequence and control low-level policies Input & scene representations for different sensor modalities Incorporate different modalities into PaLM-E Set up encoders to map modalities into language embedding space Investigate state estimation vectors, Vision Transformers (ViTs) and Object Scene Representation Transformer (OSRT) Consider object-centric representations to factor observations into tokens Investigate ViT token learner architecture Label multi-modal tokens in input prompt to enable PaLM-E to reference objects Training recipes PaLM-E is a decoder-only model that uses multi-modal sentences and text tokens to make predictions....

Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Activity and property prediction models are used in drug discovery and materials sciences. Currently, these models have to be trained or fine-tuned for new tasks. Scientific language models can be used for low-data tasks without training or fine-tuning. Predictive quality of language models is lacking. A novel type of activity prediction model is proposed that can adapt to new tasks at inference time....

Faithfulness-Aware Decoding Strategies for Abstractive Summarization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Study of how decoding strategies affect faithfulness in abstractive summarization Beam search with large beam sizes produces most faithful summaries, nucleus sampling produces least faithful Two faithfulness-aware generation methods proposed to improve faithfulness Distillation approach allows model to generate faithful summaries with greedy decoding Paper Content Introduction Recent developments in large pre-trained language models have achieved remarkable performance on abstractive summarization Problem of hallucinations, where generated summary contains facts not present in original document Prior research has analyzed and defined potential error types and typology Effect of decoding strategies on faithfulness of abstractive summarization is less understood Analysis of popular decoding strategies (greedy, beam, nucleus sampling) on two datasets Beam search provides most faithful summaries Randomness introduced by sampling hurts faithfulness Two faithfulness-aware decoding methods proposed Distillation approach to generate faithful summaries with greedy decoding Faithfulness behavior of popular decoding strategies Investigated effect of decoding strategies on faithfulness Investigated whether better exploration of search space can improve faithfulness Investigated how randomness impacts faithfulness Explored three common decoding strategies: greedy, beam search, and nucleus sampling Faithfulness-aware decoding strategies Hypothesize that current decoding methods may not explore paths that focus on faithfulness effectively Propose two faithfulness-aware methods to modify how the space is explored Method 1: Ranking makes use of beam search and picks the most faithful path Method 2: Lookahead guides the search process by adding faithfulness heuristics Ranking with faithfulness metrics Beam search explores many suitable candidates during the decoding process We propose to rerank the generated candidates according to faithfulness metrics Falke et al....

Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Sampling from Gibbs distributions and computing their log-partition function are important tasks in computer science. Algorithms for non-convex potentials suffer from the curse of dimensionality. Smooth functions allow faster convergence rates for optimization. It is possible to achieve similar rates for sampling and log-partition computation. Polynomial-time algorithms sometimes exhibit interesting behavior but no near-optimal rates....

OpenICL: An Open-Source Framework for In-context Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract In-context Learning (ICL) is a new paradigm for large language model (LLM) evaluation ICL adapts pre-trained models to unseen tasks without parameter updates OpenICL is an open-source toolkit for ICL and LLM evaluation OpenICL is research-friendly and provides various state-of-the-art retrieval and inference methods OpenICL has been validated on a wide range of NLP tasks Paper Content Introduction Rise of large language models (LLMs) has shown impressive emergent In-Context Learning (ICL) ability ICL can perform inference with model parameters frozen ICL yields comparable results to fine-tuned models in specific tasks OpenICL is an easy-to-use and extensible ICL framework for zero-/few-shot evaluation of language models OpenICL provides a wide range of ICL methods, LLMs, and tasks Related work In-context Learning is a new paradigm that uses pre-trained language models to perform new tasks without any gradient-based training Chain-of-thoughts is a method that surpasses the previous state-of-the-art methods on many reasoning tasks ICL has been criticized for being sensitive to the choice and ordering of in-context examples Prompt Learning is a special case of ICL without any in-context examples Openicl OpenICL has two design principles....

Prismer: A Vision-Language Model with An Ensemble of Experts

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Recent vision-language models have shown impressive multi-modal generation capabilities. Prismer is a data- and parameter-efficient vision-language model that leverages an ensemble of domain experts. Prismer requires training of a small number of components, with the majority of network weights inherited from pre-trained domain experts. Prismer can efficiently pool expert knowledge and adapt it to various vision-language reasoning tasks....

Unleashing Text-to-Image Diffusion Models for Visual Perception

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Diffusion models are a type of generative model that can be used to create images from text. VPD is a new framework that uses a pre-trained text-to-image diffusion model for visual perception tasks. VPD uses an adapter to refine text features and cross-attention maps to provide guidance. VPD achieves good results on semantic segmentation, referring image segmentation and depth estimation....

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance. A Cascade of Foundation models (CaFo) is proposed to incorporate diverse prior knowledge of various pre-training paradigms for better few-shot learning. CaFo works by ‘Prompt, Generate, then Cache’ to blend the predictions from CLIP and DINO....