arxiv-summary: AI-summarized AI papers

Guiding Pretraining in Reinforcement Learning with Large Language Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Reinforcement learning algorithms have difficulty without a dense, well-shaped reward function. Intrinsically motivated exploration methods reward agents for visiting novel states or transitions, but are limited in large environments. ELLM uses background knowledge from text corpora to shape exploration. ELLM rewards agents for achieving goals suggested by a language model. ELLM guides agents toward human-meaningful and useful behaviors without requiring a human in the loop....

Symbolic Discovery of Optimization Algorithms

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Method to formulate algorithm discovery as program search Leverage efficient search techniques to explore an infinite and sparse program space Introduce program selection and simplification strategies Discovered a simple and effective optimization algorithm, $\textbf{Lion}$ Compares Lion with widely used optimizers Lion boosts accuracy and saves compute Lion outperforms Adam in diffusion models Lion exhibits similar or better performance compared to Adam Performance gain grows with training batch size Requires smaller learning rate than Adam Examines limitations of Lion Paper Content Introduction Optimization algorithms are important for training neural networks There are many handcrafted optimizers AdamW and Adafactor are the standard optimizers for deep neural networks Lion offers improved accuracy, efficiency, and performance on language modeling Lion requires a smaller learning rate and larger decoupled weight decay Symbolic discovery of algorithms Algorithm discovery is formulated as program search Symbolic representation (programs) used for advantages such as implementation, analysis, comprehension and transferability Program length used to estimate complexity and select simpler, more generalizable programs Method applicable to deep neural network training and other tasks Program search space Search space should be flexible to enable discovery of novel algorithms Programs should be easy to analyze and incorporate into machine learning workflow Focus on high-level algorithmic design rather than low-level implementation details Programs contain functions operating over n-dimensional arrays Train function has inputs of model weight, gradient, and learning rate schedule value Train function has outputs of update to weight Extra variables initialized as zeros to collect historical information 45 common math functions used Mutations include inserting, deleting, and modifying statement Search space is infinite and sparse Random search of 2M programs on low-cost proxy task still inferior to AdamW Efficient search techniques Regularized evolution is used to address the challenges posed by the infinite and sparse search space Tournament selection is used to pick the best performer as the parent The parent is then copied and mutated to produce a child algorithm Evolutionary search is warm-started with AdamW to accelerate the search Multiple restarts from the initial program are used to reduce variance in the search fitness Redundancies in the program space are pruned from three sources Abstract execution is used to detect programs with errors and identify redundant statements Low-cost proxies are used to reduce search cost Search experiments utilize 100 TPU V2 chips and run for ∼72h Five repeats of search experiments are used, followed by another round of search initializing from the best algorithm found thus far Generalization: program selection and simplification Search experiments can discover promising programs on proxy tasks....

Event-based Backpropagation for Analog Neuromorphic Hardware

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Neuromorphic computing uses lessons from biology to design computer architectures. Event-based scalable learning has been an elusive goal in large-scale systems. EventProp algorithm is used with BrainScaleS-2 analog neuromorphic hardware. Gradient estimation is improved by one order of magnitude. Results verify correctness of estimation and are used in a low-dimensional classification task....

Stitchable Neural Networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Public model zoo contains powerful pretrained models Question of how to assemble models for accuracy-efficiency trade-offs SN-Net framework for model deployment produces networks with different complexity and performance SN-Net splits pretrained networks (anchors) and stitches them together SN-Net can adapt to dynamic resource constraints SN-Net can challenge hundreds of models with a single network Paper Content Introduction Computational resources and data have enabled researchers to build powerful deep neural networks There are thousands of models available to download and execute Existing scalable deep learning frameworks are limited to a single model design space Stitchable Neural Network (SN-Net) is a novel scalable deep learning framework for efficient model design and deployment SN-Net stitches an off-the-shelf pretrained model family with much less training effort SN-Net covers a fine-grained level of model complexity/performance for a wide range of deployment scenarios SN-Net breaks the limit of a single pretrained model or supernet design Training SN-Net is as easy as training individual models SN-Net performance is almost predictable SN-Net is a new universal paradigm with a “many-to-many” pipeline SN-Net is a general approach for utilising the pretrained model families in the large-scale model zoo Method Introduce model stitching Describe proposed stitchable neural networks Preliminaries of model stitching Model parameters of a pretrained neural network are indicated by θ A feed-forward neural network can be defined as a composition of functions Model stitching involves splitting a neural network into two portions of functions at a layer index l A stitching layer is used to implement a transformation between the activation space of two different networks Model stitching can produce a sequence of stitched networks Different architectures can be stitched together without significant performance drop Stitchable neural networks SN-Net is a new “many-to-many” elastic model paradigm It is motivated by the increasing number of pretrained models in the publicly available model zoo SN-Net inserts a few stitching layers to connect a family of pretrained models Anchors should be consistent in terms of the pretrained domain Stitching layers are 1x1 convolutional layers Least-squares solution is used as the default initialization approach Fast-to-Slow is the default stitching direction Nearest stitching strategy is used Stitching is done as sliding windows Training strategy uses knowledge distillation Experiments are conducted on ImageNet-1K Models studied are DeiT, Swin Transformer, ResNet and CNN with ViT Main results Generate stitching configuration set by assembling ImageNet-1K pretrained DeiT-Ti/S/B Jointly train stitches in DeiT-based SN-Net on ImageNet with 50 epochs Visualize performance of all 71 stitches, including 3 anchors Performance increases when stitching more blocks from larger anchor Model-level interpolation between two anchors SN-Net achieves better performance than individually trained models from scratch SN-Net reduces training cost and disk storage compared to training and saving all individual networks Ablation study SN-Net ablated with default training strategy of 50 epochs on ImageNet LS Init serves as a good starting point for learning stitching layers compared to Kaiming Init Fast-to-Slow helps to ensure better performance for most stitches Nearest stitching strategy limits a stitch to connect with a pair of anchors with nearest model complexity/performance Tuning stitching layers is only promising for some stitches Tuning full model improves performance of stitches Conclusion Introduced Stitchable Neural Networks (SN-Net) Framework for developing elastic neural networks Inherit knowledge from pretrained model families Deliver fast and flexible accuracy-efficiency trade-offs Low cost for massive deployment of deep models Extendable to natural language processing, dense prediction and transfer learning Limitations: large stitching space requires more training epochs Nearest stitching strategy limits stitches to two types Different settings of sliding windows can produce different number of stitches Default setting of 50 training epochs 15 epochs still produces good performance Simple training strategy of randomly sampling a stitch Sandwich sampling rule and inplace distillation explored Pretrained weights of anchors necessary for convergence Default of 100 training images to initialize stitching layers More samples does not bring more performance gain SN-Net able to switch network topology at runtime

Sources of Richness and Ineffability for Phenomenally Conscious States

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Conscious states are both rich and ineffable. Explanatory gap: consciousness cannot be reduced to physical processes. Information theoretic dynamical systems perspective on richness and ineffability of consciousness. Richness of conscious experience corresponds to amount of information in conscious state. Ineffability corresponds to amount of information lost at different stages of processing. Attractor dynamics in working memory induce impoverished recollections of original experiences....

Stabilized In-Context Learning with Pre-trained Language Models for Few Shot Dialogue State Tracking

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract PLMs have shown impressive unaided performance across many NLP tasks. Adding a few labeled in-context exemplars can improve PLMs. Designing prompts for complex tasks like dialogue state tracking is difficult. Building in-context exemplars for dialogue tasks is difficult due to short model input lengths. A meta-learning scheme and novel training method are used to stabilize the model and find ideal in-context examples....

From high-dimensional & mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Investigates one-pass stochastic gradient descent (SGD) dynamics of a two-layer neural network Data and labels generated by a similar target function Analyse limiting dynamics via deterministic and low-dimensional description Bridges different regimes of interest, such as gradient-flow, high-dimensional, and overparameterised In high-dimensional limit, dynamics remains close to a low-dimensional subspace spanned by target principal directions Paper Content Introduction Stochastic gradient descent was introduced as a method for stochastic approximation It was applied to population risk minimization Its properties have been studied for finite learning rate and input dimension High-dimensional limits of SGD were studied for non-convex, single-index models SGD dynamics of two-layer neural networks was studied for synthetic data with simple target functions Mean-field limit of SGD dynamics of two-layer neural networks was studied and proved global convergence Dimension-free limits of mean-field equations were derived for low-dimensional target functions Global convergence of gradient flow dynamics at finite width was proven for orthogonal input data Unifying low-dimensional description of one-pass SGD dynamics was discussed as learning rate and hidden layer width scale with diverging input dimension Setting Supervised learning regression task with n independent samples Fully-connected two-layer neural network with trainable parameters Θ Mean-field normalization adopted Square loss used to penalize deviations from true labels Training performed via empirical risk minimization Stochastic gradient descent used to optimize training parameters Inputs x ν are Gaussian, outputs y ν drawn from generative model Teacher-student scenario provides data model for studying generalization SGD dynamics remain in bounded subset of R p×d Student and teacher activation functions twice differentiable and upper bounded by K The three limit regimes and their dimensionless description Non-convex optimisation problem is introduced in Sec....

Compositional Exemplars for In-context Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Large pretrained language models (LMs) have shown impressive In-Context Learning (ICL) ability. Previous selection methods for ICL are based on simple heuristics. CEIL (Compositional Exemplars for In-context Learning) is proposed to model the interaction between the given input and in-context examples. CEIL is validated on 12 classification and generation datasets from 7 distinct NLP tasks....

Evaluating the Robustness of Discrete Prompts

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Discrete prompts have been used to fine-tune Pre-trained Language Models. Automatic methods can generate discrete prompts from a small set of training instances. Automatically learnt discrete prompts contain noisy and counter-intuitive lexical constructs. Study of robustness of discrete prompts by applying perturbations to an application using AutoPrompt. Discrete prompt-based method remains relatively robust against NLI input perturbations....

Zero-Knowledge Mechanisms

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Mechanism design allows for commitment to rules of a mechanism Commitment is achieved through public declaration Public declaration can reveal information the mechanism designer would prefer to keep private A new approach to commitment is proposed that does not require a mediator Framework is based on zero-knowledge proofs Applications include non-mediated bargaining with hidden offers Paper Content Introduction Mechanism-design paradigm relies on designer’s ability to commit to a mechanism Mechanism must be public to allow players to inspect and verify properties Players can infer information from mechanism that designer may not want revealed Trusted mediator can guarantee incentive properties and keep mechanism secret Goethe’s lawyer leaked reserve price to publisher Protocol with three messages to commit to mechanism and run it Too-good-to-be-true protocol with cryptographic approach Cryptographic commitment to mechanism and zero-knowledge proof of incentive properties Player can verify outcome is consistent with mechanism Mechanism never disclosed to anyone Self-policing commitment to never-observed mechanism Hiding and committing guarantees under certain conditions Examples to demonstrate protocol and what can be deduced from outcome Novel simple zero-knowledge proof techniques tailored to examples Protocols feasibly computable and can be implemented in practice Illustrative examples Seller commits to a hidden price s Seller sends commitment (C1, …, C log2H) to buyer Buyer discloses value v If s ≤ v, seller discloses s and trade takes place If s > v, seller must prove to buyer that this is the case without disclosing any additional information about s Seller must prove knowledge of discrete logarithm base h of at least one of Cj (j = i ∨ (j < i ∧ vj = 0)) Seller must do this without revealing which of the numbers she knows the discrete logarithm of Seller must do this with negligible probability of getting discovered Seller must do this with modest computational power and communication Procedure for proving knowledge of discrete logarithm is given in Appendix A....