Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Method to formulate algorithm discovery as program search Leverage efficient search techniques to explore an infinite and sparse program space Introduce program selection and simplification strategies Discovered a simple and effective optimization algorithm, $\textbf{Lion}$ Compares Lion with widely used optimizers Lion boosts accuracy and saves compute Lion outperforms Adam in diffusion models Lion exhibits similar or better performance compared to Adam Performance gain grows with training batch size Requires smaller learning rate than Adam Examines limitations of Lion Paper Content Introduction Optimization algorithms are important for training neural networks There are many handcrafted optimizers AdamW and Adafactor are the standard optimizers for deep neural networks Lion offers improved accuracy, efficiency, and performance on language modeling Lion requires a smaller learning rate and larger decoupled weight decay Symbolic discovery of algorithms Algorithm discovery is formulated as program search Symbolic representation (programs) used for advantages such as implementation, analysis, comprehension and transferability Program length used to estimate complexity and select simpler, more generalizable programs Method applicable to deep neural network training and other tasks Program search space Search space should be flexible to enable discovery of novel algorithms Programs should be easy to analyze and incorporate into machine learning workflow Focus on high-level algorithmic design rather than low-level implementation details Programs contain functions operating over n-dimensional arrays Train function has inputs of model weight, gradient, and learning rate schedule value Train function has outputs of update to weight Extra variables initialized as zeros to collect historical information 45 common math functions used Mutations include inserting, deleting, and modifying statement Search space is infinite and sparse Random search of 2M programs on low-cost proxy task still inferior to AdamW Efficient search techniques Regularized evolution is used to address the challenges posed by the infinite and sparse search space Tournament selection is used to pick the best performer as the parent The parent is then copied and mutated to produce a child algorithm Evolutionary search is warm-started with AdamW to accelerate the search Multiple restarts from the initial program are used to reduce variance in the search fitness Redundancies in the program space are pruned from three sources Abstract execution is used to detect programs with errors and identify redundant statements Low-cost proxies are used to reduce search cost Search experiments utilize 100 TPU V2 chips and run for ∼72h Five repeats of search experiments are used, followed by another round of search initializing from the best algorithm found thus far Generalization: program selection and simplification Search experiments can discover promising programs on proxy tasks....