ALERT: Adapting Language Models to Reasoning Tasks

ALERT: Adapting Language Models to Reasoning Tasks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Current large language models can perform well on complex tasks with few-shot learning ALERT is a benchmark and suite of analyses to assess language models’ reasoning ability ALERT covers 10 different reasoning skills and 20 datasets Finetuning helps language models learn more reasoning skills Finetuning can lead to overfitting and generalization problems Paper Content Introduction Large language models (LLMs) have shown increasing in-context learning capabilities with scaling up the model and data size LLMs still struggle with tasks such as commonsense reasoning and math word problems Recent work used different prompting methods to improve LLMs’ performance on tasks that require multiple steps of reasoning ALERT is a new pipeline to benchmark different LLMs on various reasoning skills ALERT covers 10 different reasoning skills including logical, causal, commonsense, abductive, spatial, analogical, argument and deductive reasoning Experiments indicate that there is no strong correlation between high vocabulary overlap and performance gain on evaluation datasets Finetuning helps to improve certain reasoning capabilities of LLMs but not all of them Finetuning can cause overfitting towards prompt templates Evaluating reasoning skills with ALERT provides new insights on how models have or have not succeeded in generalizing beyond their experience Motivation and our benchmark ALERT is a computer science paper that focuses on measuring LLMs performance on tasks that require contextual understanding and multi-step operations....

December 16, 2022 · 780 words · Ping Yu, Tianlu Wang, Olga Golovneva, Badr Alkhamissy, Gargi Ghosh and 2 others
SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation

SADM: Sequence-Aware Diffusion Model for Longitudinal Medical Image Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Human organs change due to short-term and long-term factors Medical image generation tasks rely on single images, ignoring sequential dependency Sequence-aware deep generative models are underexplored in medical imaging Sequence-aware diffusion model (SADM) proposed to generate longitudinal medical images SADM uses sequence-aware transformer to learn longitudinal dependency with missing data Paper Content Introduction Generative models used in medical domain due to hardware and data availability Two tasks: generation of longitudinal brain image and multi-frame cardiac image Generative adversarial networks and diffusion models used Sequence-aware deep generative models to learn temporal dependency Proposed sequence-aware diffusion model (SADM) to address issues of longitudinal medical data SADM works with single image input, longitudinal data with missing frames, and high-dimensional images State-of-the-art results in longitudinal image generation and missing data imputation Preliminary Diffusion models Diffusion models consist of a forward process and a reverse process....

December 16, 2022 · 608 words · Jee Seok Yoon, Chenghao Zhang, Heung-Il Suk, Jia Guo, Xiaoxiao Li
Economic impacts of AI-augmented R&D

Economic impacts of AI-augmented R&D

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Deep learning has become the most important technique in Artificial Intelligence (AI). Deep learning has been used in areas such as protein folding, drug discovery, integrated chip design, and weather prediction. AI idea production is more capital-intensive than traditional R&D. AI-augmented R&D has the potential to speed up technological change and economic growth....

December 15, 2022 · 1977 words · Tamay Besiroglu, Nicholas Emery-Xu, Neil Thompson
Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines

Improving Chess Commentaries by Combining Language Models with Symbolic Reasoning Engines

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Language models lack grounding in the real world and struggle with complex reasoning. AI systems outperform humans in games like chess and Go. Chess commentary requires reasoning over a complex board state and providing analyses in natural language. Combining symbolic reasoning engines with controllable language models can generate chess commentaries. Experiments show that this approach generates commentaries preferred by human judges....

December 15, 2022 · 1083 words · Andrew Lee, David Wu, Emily Dinan, Mike Lewis
FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference

FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract FiD is a powerful retrieval-augmented language model. Inference is expensive for FiD. Memory bandwidth constraints in the decoder cause most of the inference time. Two changes to the FiD architecture speed up inference by 7x. FiDO-Large-XXL performs faster inference than FiD-Base and better performance than FiD-Large. Paper Content Introduction Language model performance can be improved by augmenting with retrieved text Fusion-in-Decoder (FiD) architecture stands out for strong performance FiD is expensive and has high computational burden Performance and computational cost are two sides of the coin Encoder requires more Floating Point Operations (FLOPs) than decoder Majority of inference time is spent in decoder Memory bandwidth bottleneck is eliminated with proposed changes FiDO outperforms vanilla and efficient FiD models on question-answering datasets Analysis Retrieval-augmented models process many context tokens for each question or answer token....

December 15, 2022 · 864 words · Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai and 2 others
Efficient Long Sequence Modeling via State Space Augmented Transformer

Efficient Long Sequence Modeling via State Space Augmented Transformer

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Transformer models have achieved superior performance in NLP tasks. Attention mechanism has quadratic computational cost, limiting its practicality for long sequences. Existing attention variants improve computational efficiency but lack ability to compute global information. State space models are tailored for long sequences but lack flexibility to capture complicated local information. Paper Content Introduction Transformer models have achieved superior performance on various natural language processing tasks Leverage the attention mechanism which has quadratic time and space complexity Complexity is prohibitive for tasks with long sequences Transformer models with full attention are easy to overfit Various approaches have been proposed to reduce complexity and introduce structural biases Approximation methods approximate full attention with linear complexity Partial attention methods reduce complexity and introduce structural biases State space models introduce a different structural bias Proposed SPADE model is a multi-layer Transformer model that captures global and local information SPADE outperforms existing approaches on Long Range Arena benchmark SPADE is faster and yields better performance in autoregressive language modeling SPADE is scalable and outperforms baselines on natural language understanding and natural language generation benchmarks Code and pre-trained model checkpoints are publicly available Background Attention mechanism Attention mechanism takes input X and outputs alignment between any pair of input tokens....

December 15, 2022 · 967 words · Simiao Zuo, Xiaodong Liu, Jian Jiao, Denis Charles, Eren Manavoglu and 2 others
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Generating a chain of thought can improve LLM performance. Zero-shot CoT evaluations have been done mainly on logical tasks. This paper evaluates zero-shot CoT on two sensitive domains. Using zero-shot CoT can increase the likelihood of undesirable output. Zero-shot CoT should be avoided on tasks with marginalized groups or harmful topics. Paper Content Introduction LLMs improve performance on a range of tasks Popular approach to implementing CoT involves zero-shot generation Zero-shot CoT produces undesirable biases and toxicity Models can sabotage performance when requiring social knowledge Zero-shot CoT increases model bias and generation toxicity Zero-shot CoT increases stereotypical reasoning and encourages toxic behaviour Related work LLMs can use intermediate reasoning steps to improve performance on tasks like arithmetic, metaphor generation, and commonsense/symbolic reasoning Adding “Let’s think step by step” to a prompt can improve zero-shot performance on reasoning benchmarks Other prompting methods have also yielded performance increases LLMs are sensitive to prompting perturbations LLMs are prone to generating unreliable explanations Instruct-tuned and value-aligned LLMs aim to increase reliability and robustness NLP models exhibit a wide range of social and cultural biases LLMs also exhibit a range of biases and risks Stereotype & toxicity benchmarks Leveraged 3 widely used stereotype benchmark datasets: CrowS Pairs, Stereoset, and BBQ Bootstrapped a small set of explicitly harmful questions (HarmfulQ) Converted each dataset into a zero-shot reasoning task Evaluated out-of-the-box performance in a zero-shot setting Stereotype benchmarks CrowS Pairs is a dataset of 1508 sentences covering 9 stereotype dimensions StereoSet is a dataset of 17K instances of stereotypical bias annotated by crowd workers BBQ is a dataset of 50K questions targeting 11 stereotype categories All datasets are used to evaluate model bias Toxicity benchmark Evaluate how models handle open-ended toxic requests Created a benchmark of 200 explicitly toxic questions Prompted text-davinci-002 to generate harmful questions Manually removed repetitive questions with high text overlap Prompted LLM to generate questions across 6 adjectives: racist, stereotypical, sexist, illegal, toxic, and harmful Seeded prompt with three few-shot examples Methods Evaluating problematic outputs in a prompt-based setting Outlining prompt construction for each benchmark Discussing reasoning strategies Framing benchmarks as prompting tasks BBQ, HarmfulQ, CrowS Pairs, and Stereoset are framed as QA tasks For CrowS Pairs and Stereoset, models are prompted to select the more accurate sentence between the stereotypical and anti-stereotypical setting For stereotype datasets, target stereotype and anti-stereotype examples are included as options, with an “Unknown” option as the correct answer Synonyms for “Unknown” are randomly selected for each question to account for potential preference for a specific lexical item Positional bias is reduced by randomly shuffling the type of answer associated with each of the options Scoring bias and toxicity Evaluate biases in model completions using accuracy Models should not rely on stereotypes or antistereotypes Evaluate models by percent of pattern-matched unknown selections Manually label model outputs as encouraging or discouraging Calculate percent of model generations that encourage harmful behaviour Compute % point differences between CoT and Standard Prompting Models Evaluated best performing GPT-3 model from zero-shot CoT work Standard parameters provided by OpenAI’s API Generated 5 completions for both Standard and CoT Prompt settings Evaluations ran between Oct 28th and Dec 14th, 2022 Analyzed instruction-tuned davinci models in §5....

December 15, 2022 · 959 words · Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, Diyi Yang
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Direct speech-to-speech translation (S2ST) is advantageous for fast inference with a simplified pipeline. UnitY is a novel two-pass direct S2ST architecture. UnitY is enhanced by subword prediction, advanced two-pass decoder architecture design and search strategy, and better training regularization. UnitY is pre-trained with a self-supervised denoising auto-encoding task. UnitY outperforms a single-pass speech-to-unit translation model....

December 15, 2022 · 1632 words · Hirofumi Inaguma, Sravya Popuri, Ilia Kulikov, Peng-Jen Chen, Changhan Wang and 5 others
DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue

DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Virtual assistants use semantic parsing engines to convert user utterances to commands Semantic parsing is a difficult multilingual transfer task with low transfer efficiency Switching between languages is prevalent for bilingual users This work improves the zero-shot performance of a multilingual and codeswitched semantic parsing system Two stages of multilingual alignment are used to improve performance Results show improved English performance and transfer efficiency Paper Content Introduction Related work MMTs are effective for multilingual and intra-sentential codeswitching Prior work has studied explicit alignment of individual embeddings MMTs implicitly perform alignment within their hidden states Gap between performance on training language and zero-shot targets is larger in task-oriented parsing benchmarks Adversarial training can be used for cross-lingual transfer Regularization can be used to maintain multilingual knowledge Methods Utilize two stages of alignment to improve zero-shot transfer in DAMP Use contrastive learning during pretraining to improve alignment Use domain adversarial training and constrained optimization during finetuning for double alignment Apply improvements to pointer-generator network to produce a parse Baseline architecture Used pointer-generator network to generate semantic parses Tokenized words into sub-words and retrieved hidden states from encoder Used randomly initialized auto-regressive decoder to produce representations Used perceptron to predict over vocabulary of intents and slot types Used copy logit vector for arguments from original query Applied softmax to concatenation of logits and optimized negative log-likelihood of correct prediction Copy mechanism essential to generating parses with multilingual tokens Alignment pretraining AMBER is a contrastive pretraining process for semantic parsing AMBER combines 3 explicit alignment objectives: translation language modeling, sentence alignment, and word alignment using attention symmetry Translation language modeling uses parallel sentences as input and masking tokens in each language Sentence alignment directly optimizes similarity of representations across languages using a siamese network training process Word level alignment is optimized with an attention symmetry loss Cross-lingual adversarial alignment Use token-level language discriminator to get aligned representations at the word level Binary scheme treats all languages not found in the training data as a single class Introduce a general constrained optimization approach for adversarial training Train a discriminator to distinguish between in-domain training data and unlabeled out-of-domain data Data is sampled evenly from all languages to create an adversarial dataset Two-layer perceptron predicts probability that a token is English or Non-English Discriminator loss is binary cross-entropy loss Multi-class classification across all languages uses negative log-likelihood of the correct class as the loss function Optimize task loss while enforcing a constraint derived from first-principles Treat λ as a learnable parameter and optimize it to maximize the value of λ( − L d ) Experiments Evaluated effects of techniques on 3 benchmarks for task-oriented semantic parsing 2 datasets evaluate robustness to intra-sentential codeswitching 1 dataset uses multilingual data to evaluate robustness to inter-sentential codeswitching Examples divided into training, evaluation, and test data at 70/10/20 ratio Datasets MTOP benchmark evaluates multilingual transfer for a difficult compositional parse structure Benchmark contains queries in 6 languages CST5 benchmark evaluates Hindi-English intra-sentential codeswitching Spanish-English codeswitching dataset constructed using Google Translate Dataset has 5,803 queries in both English and Spanish-English Results Uses same hyperparameter configurations for all settings Encoder uses mBERT architecture Decoder is randomly initialized 4-layer, 8-head vanilla transformer AdamW used to optimize for 1....

December 15, 2022 · 954 words · William Held, Christopher Hidey, Fei Liu, Eric Zhu, Rahul Goel and 2 others
Objaverse: A Universe of Annotated 3D Objects

Objaverse: A Universe of Annotated 3D Objects

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Massive data corpora have enabled progress in AI 3D data is a notable omission in large-scale datasets Objaverse 1.0 is a large dataset of 3D models with descriptive captions, tags, and animations Objaverse has potential applications in training generative 3D models, improving tail category segmentation, training open-vocabulary object-navigation models, and creating a new benchmark for robustness analysis of vision models Paper Content Introduction Massive datasets have enabled and driven rapid progress in AI Language corpora on the web led to large language models Paired image and text datasets led to vision-and-language pretrained models YouTube video datasets led to video capable models Massive multimodal datasets led to models like CLIP and StableDiffusion Datasets moved from manually curated to harnessing the power of the web Datasets used to train deep learning models in other areas of research are not comparable 3D assets used in training generative 3D models are maximally on the order of thousands OBJAVERSE 1....

December 15, 2022 · 901 words · Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel and 5 others