arxiv-summary: AI-summarized AI papers

Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Generative models have been used to solve extractive tasks. Tokenization inconsistency is commonly neglected in training these models. This issue can lead to performance drop and hallucination. A simple fix is proposed and a case study is conducted on extractive QA. With consistent tokenization, the model performs better and converges faster. Paper Content Introduction Pretrained seq2seq models have achieved success in a range of tasks Tokenization is an important component of the models Tokenization inconsistency can affect the performance of seq2seq models Extractive QA is used as a case study to identify when tokenization inconsistency happens An approach is proposed to mitigate the issue of tokenization inconsistency Related work Byte-Pair-Encoding (BPE) and language-model-based segmentation are commonly used tokenizers for NLP models Poor model generalization can be caused by these tokenization approaches BPE-Dropout and Dynamic Programming Encoding (DPE) have been proposed to improve model robustness and generalization Domain-specific approaches have been proposed to improve segmentation on specific texts Character-and byte-level modeling is another line of research that is free of tokens Consistent tokenization: what it is and how to achieve it Seq2seq tasks take text as input and output a sequence of tokens....

Dexterous Manipulation from Images: Autonomous Real-World RL via Substep Guidance

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Reinforcement learning can enable robots to learn complex manipulation tasks System provides a “programming-free” approach for users to define new tasks System includes a framework for users to define tasks with image examples Reinforcement learning procedure learns the task autonomously without interventions Experimental results with a four-finger robotic hand learning tasks in the real world Paper Content I....

Policy learning 'without'' overlap: Pessimism and generalized empirical Bernstein's inequality

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Studies offline policy learning, which aims to learn an optimal decision rule for a given population Existing methods rely on a uniform overlap assumption, which can be unrealistic in many situations Proposed algorithm optimizes lower confidence bounds of policy values Data-dependent upper bound for suboptimality of algorithm only depends on overlap for optimal policy and complexity of policy class New self-normalized type concentration inequality developed for inverse-propensity-weighting estimators Paper Content Introduction Policy learning aims to create an optimal decision rule for a given population Policy learning is used in healthcare, advertising, and recommendation systems Data is collected by executing a fixed and potentially suboptimal policy Data is also collected by running an adaptive learning algorithm Policy learning involves two components: policy evaluation and policy optimization A good performance of policy learning relies on accurate evaluation for all policies Uniform overlap assumption is often violated in practice Fundamental challenge of policy learning is counterfactual reasoning Estimation of policy value can have a huge variance if evaluated at out-of-distribution actions Quality of assessment of policy value varies across policies Need to quantify statistical confidence in policy value and incorporate into learning process A pessimism-based framework Proposed novel solution to offline policy learning with two ingredients: pessimism principle and generalized empirical Bernstein inequality Algorithm optimizes confidence lower bounds (LCBs) instead of point estimates of policy values Generic data-dependent upper bound for performance of algorithm Complexity measure of policy and estimation error of optimal policy in upper bound Oracle property of approach: adapts to overlap of optimal policy in data Developed uniform concentration inequality for offline policy evaluation Standard O(1/ √ T ) rate for learning performance when e(X, π * (X)) is lower bounded by a constant Polynomial learning rate when propensity for optimal policy is decaying at a polynomial rate Background X and Y denote the space of contexts and rewards respectively A contextual bandit model C is specified by an action set A, a context distribution PX and a set of reward distributions Policy learning is done from an offline dataset collected from C Data collecting process Context Xt is drawn from PX at each time t Action A t is taken according to behavior policy Reward Yt is µ(Xt, At) + ǫt Behavior policy is potentially adaptive A t is treatment option taken for Xt Yt(a) follows population joint distribution Conditional independence condition: Yt(a) ⊥ ⊥ A t | X t , H t Bounded reward: Yt ∈ [0, 1] Known behavior policy: e t (x, a | H t ) for all (x, a) ∈ X ×A are known to learner Policy learning and performance metric Π is a collection of policies, each of which is a mapping from X to A The average policy value of a policy π in Π is defined using the contextual bandit model C The optimal policy π* maximizes the policy value The goal is to learn a policy π and measure the learning performance by its suboptimality Policy class complexity Natarajan dimension is used to quantify the complexity of a policy class Natarajan dimension is a generalization of the VC dimension Finite upper bounds for the Natarajan dimension have been established for many policy classes Related work Policy learning from a fixed class is an area at the intersection of economics, causal inference and statistical machine learning....

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Character-level tasks are challenging for models based in subword tokenization. Geiger et al. (2021) method is adapted to operate on type-level variables over characters. Character-level tasks are tested that vary in their dependence on meaning and sequence-level context. Simple character-level tokenization approaches perform best on purely form-based tasks. Our method is superior for more complex tasks that blend form, meaning, and context....

Training Trajectories of Language Models Across Scales

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Scaling up language models has led to performance gains Little is understood about how training dynamics change with model size Analyzed intermediate training checkpoints of differently sized models At a given perplexity, similar subset of training tokens see most significant reduction in loss Early in training, all models learn to reduce perplexity of grammatical sequences with hallucinations Perplexity is a strong predictor of in-context learning performance Paper Content Introduction Scaling up language models improves language modeling perplexity and zero-or few-shot end task accuracies Little is understood about why or how this happens Study training trajectories of differently-sized OPT models Analyze three aspects of model performance: next-token prediction, sequence-level generation, and downstream task performance Find that language modeling perplexity correlates well with few-shot in-context learning performance Experimental settings OPT models used in experiments Validation perplexity used to measure autoregressive language modeling Validation set covers a wide range of domains Trajectory of validation perplexity follows a power-law pattern Next-token prediction Autoregressive language models are used to predict the next token given a context....

Scalable Diffusion Models with Transformers

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Explored a new class of diffusion models based on the transformer architecture Replaced U-Net backbone with a transformer that operates on latent patches Analyzed scalability of Diffusion Transformers (DiTs) through Gflops Higher Gflops leads to lower FID DiT-XL/2 models outperform prior diffusion models on ImageNet 512x512 and 256x256 benchmarks Paper Content Introduction Machine learning is being powered by transformers Transformers are used in natural language processing, vision, and other domains Image-level generative models have not adopted transformers as much U-Net architecture is the de-facto choice for generative models This paper aims to replace U-Net with transformers Diffusion Transformers (DiTs) are based on Vision Transformers (ViTs) DiTs are scalable architectures for diffusion models DiTs can achieve state-of-the-art results on ImageNet generation benchmark Related work Transformers have replaced domain-specific architectures in language, vision, reinforcement learning and meta-learning....

Evaluating Human-Language Model Interaction

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Real-world applications of language models involve human-LM interaction. HALIE is a new framework to evaluate human-LM interaction. HALIE captures interactive process, subjective experience, and preference. Five tasks are designed to capture different forms of interaction. Non-interactive performance does not always result in better human-LM interaction. Paper Content Introduction Language models have advanced and can be used for a wide range of tasks Evaluation of language models is currently non-interactive Most benchmarks focus on non-interactive evaluation HALIE framework expands on non-interactive evaluation by considering the full process, first-person experience, and preference beyond quality LMs are already used interactively to brainstorm, paraphrase, reformulate, autocomplete, and write code Goal is to augment human capabilities rather than automate them Social dialogue Dialogue is a popular mode of interaction for language models We evaluate human-LM interaction in the context of open-ended dialogue about social situations Task: given a social scenario, users converse with a system until they choose to finish System logic: user input, possible actions, updating dialogue history User study: 189 crowd workers, 10 scenarios, survey questions Results: instruction tuning improves performance on most quality metrics, but not specificity Users may prefer to interact with a more specific LM Question answering Question answering is a task in NLP Users can query a system multiple times to answer a question System consists of multiple-choice question, user input, and system output 342 crowd workers recruited to answer questions with and without assistance from an LM Users with LM assistance generally outperformed an LM alone Count number of queries needed to answer each question as a proxy measurement for efficiency TextDavinci achieved highest accuracy while requiring least effort TextBabbage performed better than Davinci on most metrics Instruction tuned models were perceived most favorably in survey evaluation Crossword puzzles Crossword puzzles have been studied as a challenging task for AI systems Solving a crossword puzzle is a generative task requiring open-ended responses to clues Crossword puzzle task provides additional structure, whereby a user can check whether a candidate answer satisfies the lexical constraints of the puzzle Clues are often not straightforward and a user might need to reformulate the query to find the desired information System logic includes a state of a crossword puzzle, selected clue, user letters entered in the puzzle, dialogue history, and user input User study recruited 350 workers on Amazon Mechanical Turk, split across four language models and five puzzles Survey questions asked users to rank different qualities of the AI assistant on a 5-point Likert scale Results show that users significantly preferred Text-Davinci over other models with respect to helpfulness Misinformation was particularly pernicious using TextBabbage Short prompts exacerbate misinformation and toxicity Users demonstrate diverse engagement behavior Text summarization Text summarization is a long-standing problem in NLP We focus on human-LM interaction for single-document summarization System provides previous human-edited summaries as examples to the system to improve future summaries Task is to edit model-generated summary to be consistent, relevant, and coherent 964 documents randomly selected from XSum dataset 39 crowd workers recruited on Amazon Mechanical Turk Summary-level questions ask consistency, relevance, and coherence of the original and edited summaries Session-level questions evaluate users’ overall perceptions of the summarization system 100 documents randomly sampled and assessed by 3 different evaluators Metaphor generation Metaphors are used to communicate complex or abstract ideas Creating metaphors requires divergent, lateral thinking Prior work designed metaphor generation tools to help with ideation Task is to write metaphorical sentences that evoke a given metaphor System logic consists of seed metaphor, user sentence history, user input, and system suggestions 32 workers recruited on Amazon Mechanical Turk to come up with metaphorical sentences using the system 10 minutes given to each user to come up with as many sentences as possible Evaluation criteria from Gero and Chilton (2019) used Framework Introduce HALIE framework for evaluating human-LM interaction Describe tasks and system construction for studying human-LM interaction Use interaction traces to represent interaction process Propose dimensions and metrics for evaluating human-LM interaction Solving tasks interactively Studying human-LM interaction in the context of tasks Five tasks studied: social dialogue, question answering, crossword puzzles, text summarization, and metaphor generation Tasks span from goal-oriented to open-ended High coverage on common usages of LMs reported by Ouyang et al....

DSI++: Updating Transformer Memory with New Documents

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract DSIs encode documents in a model and use the same model to map queries to relevant documents. Reindexing the corpus is computationally expensive. DSI++ is a continual learning challenge to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Model experiences forgetting events during training, leading to unstable learning....

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce INSTRUCTOR, a new method for computing text embeddings given task instructions INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains Annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss Evaluate INSTRUCTOR on 70 embedding evaluation tasks, ranging from classification to text generation INSTRUCTOR achieves state-of-the-art performance, with an average improvement of 3....

Speaking Style Conversion With Discrete Self-Supervised Units

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Voice Conversion (VC) is the task of making a spoken utterance by one speaker sound like it was uttered by a different speaker. Current VC methods focus on spectral features like timbre, but ignore speaking style. This study introduces a method for converting not only timbre, but also prosodic information. The proposed approach is based on a pretrained, self-supervised model....