arxiv-summary: AI-summarized AI papers

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract We studied the design decisions of publicly available instruction tuning methods. We found that task balancing and enrichment techniques are important for effective instruction tuning. We showed that Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks. We made the Flan 2022 collection of datasets, templates, and methods publicly available....

Learning Data Representations with Joint Diffusion Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce a joint diffusion model that learns meaningful internal representations for both generative and predictive tasks. Joint machine learning models often offer uneven performance or are unstable to train. Contemporary deep diffusion-based generative models can be used in both generative and predictive settings. Extension of the vanilla diffusion model with a classifier allows for stable joint training with shared parametrization....

Robust Linear Regression: Gradient-descent, Early-stopping, and Beyond

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Early-stopping strategies on gradient-descent methods for linear regression are studied for robustness to adversarial attacks. Early-stopped GD is optimally robust against Euclidean-norm adversarial attacks. GD can converge to non-robust models in the case of classification. A GD scheme on a transformation of the data adapted to the attack is proposed to handle any Mahalanobis attack....

Scaling laws for single-agent reinforcement learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Cross-entropy loss improves with model size and training compute following a power law Intrinsic performance is a monotonic function of the return defined as the minimum compute required to achieve the given return Intrinsic performance scales as a power law in model size and environment interactions Optimal model size scales as a power law in the training compute budget Varying the “horizon length” of the task mostly changes the coefficient but not the exponent of this relationship Paper Content Introduction Recent studies have found relationships between neural network performance and model size/training compute to be governed by smooth power laws Studies have focused on generative modeling with cross-entropy loss This paper seeks to extend these results to reinforcement learning, which generally has no cross-entropy loss Introduces intrinsic performance, which is defined to be equal to training compute on the compute-efficient frontier Studies relationships between performance, model size and environment interactions across a range of environments Intrinsic performance Cross-entropy test loss scales smoothly with training compute in generative modeling Mean episode return in reinforcement learning does not necessarily scale smoothly Intrinsic performance is a metric that behaves like test loss and scales as a power law with compute Intrinsic performance is the minimum compute required to train a model of any size to reach the same return The power law for intrinsic performance Intrinsic performance I scales as a power law with model parameters N and environment interactions E....

Faithful Chain-of-Thought Reasoning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Chain-of-Thought (CoT) prompting improves Language Models’ (LM) performance on complex reasoning tasks. Faithful CoT is a framework that breaks down a reasoning task into two stages: Translation and Problem Solving. Faithful CoT outperforms traditional CoT prompting on 9 out of 10 datasets. Faithful CoT achieves new state-of-the-art few-shot performance on 7 out of 10 datasets....

A Bias-Variance-Privacy Trilemma for Statistical Estimation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Algorithm for differentially private mean estimation involves clipping samples and adding noise to their empirical mean. Clipping controls sensitivity and variance of noise added for privacy, but introduces statistical bias. Tradeoff between low bias, low variance, and low privacy loss is inherent. Unbiased mean estimation is possible under approximate differential privacy if distribution is symmetric....

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract 73% of papers do not perform any human evaluation on model-generated summaries LongEval is a set of guidelines for human evaluation of faithfulness in long-form summaries LongEval addresses challenges of high inter-annotator agreement, minimizing annotator workload, and automated alignment between summary and source snippets Switching to finer granularity of judgment reduces inter-annotator variance in faithfulness scores Partial annotation of fine-grained units highly correlates with scores from a full annotation workload Paper Content Introduction Human evaluation is labor-intensive, expensive and difficult to design Large number of judged examples needed to draw statistically significant conclusions Human evaluation is especially challenging for long sequences of generated text Survey of 162 publications and preprints on long-form summarization 73% of papers do not perform human evaluation on long-form summaries Lack of standardization in design decisions can impact inter-annotator agreement Human evaluation is expensive, difficult and time-consuming LONGEVAL guidelines for human evaluation of faithfulness in long-form summarization Empirical evaluation of LONGEVAL on two long-form summarization datasets Dataset with 3-way fine-grained human faithfulness judgments for 120 summaries Survey of human evaluation practices Human evaluation of long-form summaries is rarely done Most studies do not follow reproducible practices Human evaluation setups lack standardization Human evaluation is challenging and expensive LONGEVAL guidelines proposed to improve efficiency and standardization Experiments conducted on two long-form summarization datasets COARSE annotations have lower inter-annotator agreement than FINE FINE annotations should be preferred for long-form summaries Partial annotation proposed to reduce annotator workload Partial annotation has high correlation to full annotation Related work Recent work has focused on automatic evaluation methods for summarization Human evaluation is the gold standard for developing automatic metrics Pyramid method is a notable effort in this space Efficient Pyramid-like protocols have been used to collect large-scale datasets Focus on faithfulness and operate in a reference-free setting Focus on long-form summarization tasks like SQuALITY and PubMed Faithfulness in summarization differs from fact verification in three ways Conclusion We present the LONGEVAL guidelines for standardized human evaluation of long-form summarization....

Looped Transformers as Programmable Computers

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract A framework is presented for using transformer networks as universal computers. An input sequence acts as a punchcard, containing instructions and memory. Encoder layers can emulate basic computing blocks. These building blocks can emulate a small instruction-set computer. The transformer can emulate a basic calculator, linear algebra library, and in-context learning algorithms....

Quantifying Context Mixing in Transformers

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Self-attention weights are used to analyze token-to-token interactions in Transformer-based models. Other components in the encoder layer can also affect information mixing in the output representations. Value Zeroing is a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed. Evaluations are done with different view points based on linguistically informed rationales, probing, and faithfulness analysis....

Guiding Online Reinforcement Learning with Action-Free Offline Pretraining

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Offline RL methods reduce need for environment interaction by training agents using offline collected episodes. Action information needs to be logged during data collection, which can be difficult or impossible in some cases. AFP-RL investigates potential of using action-free offline datasets to improve online reinforcement learning. AF-Guide consists of AFDT and Guided SAC to learn from offline dataset and guide online training....