Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Bayesian Learning Rule provides a framework for generic algorithm design
  • Difficult to use due to parameterization, gradients, and updates
  • Extension based on Lie-groups simplifies difficulties
  • New algorithm for deep learning with desirable attributes
  • Exploits Lie-group structures for new algorithm design

Paper Content

Introduction

  • Bayesian Learning Rule (BLR) provides a general framework to derive algorithms from optimization, deep learning, and graphical models
  • BLR uses natural-gradient descent to find approximations of the generalized posterior distribution
  • BLR has been used to design new algorithms for uncertainty estimation in deep learning
  • BLR can be difficult to use for three reasons
  • Extension of BLR based on Lie-groups proposed to address difficulties
  • Lie-group BLR uses group’s exponential map to update candidate distributions
  • Gradient computations simplified by reparameterization trick
  • Update naturally stays within the manifold
  • Use cases for algorithm design in deep learning using additive, multiplicative, and affine groups
  • New algorithm with multiplicative group gives rise to networks with nodes that are forced to be either excitatory or inhibitory

The bayesian learning rule

  • BLR aims to find a posterior candidate in a space of candidate distributions
  • Balancing the two terms requires an exploration-exploitation tradeoff
  • Problem can be rewritten as an inference problem
  • When the loss corresponds to the log-joint distribution of a Bayesian model, the solution coincides with the posterior distribution
  • BLR is a natural-gradient descent algorithm
  • BLR can recover many existing algorithms from a variety of fields
  • Design of new algorithms is possible
  • BLR can be difficult to use in many cases
  • Computing the gradient with respect to µ is not always straightforward
  • λ obtained by BLR may not always be valid natural parameters

The lie-group bayesian learning rule

  • Proposing a Lie-group based extension of the BLR
  • Describing Lie groups and their actions
  • Parameterization and exponential map
  • Deriving the new learning rule

Lie groups and their actions

  • Lie-group is a set with a binary operation that satisfies associativity, identity element and inverses
  • Smooth manifold is locally diffeomorphic to Euclidean space
  • Examples of Lie-groups include (R, +) and (R >0 , ×)
  • Cartesian product of two Lie-groups is also a Lie-group
  • Action of Lie-group on parameter-space is a smooth map
  • Example of action is (A, b) • θ = Aθ + b

Lie group parametrization

  • G is an action on a space of measures
  • Pushforwards are used to define another action on the space of measures
  • A base distribution q0 is given with positive density
  • The space of candidate distributions Q is the orbit of q0 under the action of G
  • Every q in Q can be parametrized by group elements g
  • Examples of EFs that can be parameterized this way include Gaussian and Bernoulli distributions
  • This parameterization is useful for using non-EF distributions such as the Laplace distribution

The exponential map and lie group updates

  • Goal is to find a group element g* that minimizes a given energy function
  • Exponential map is used to move in the direction of fastest descent
  • Exponential map is a smooth function that folds the tangent space at identity to the group
  • For diagonal matrices, exponential map is given by Taylor series
  • Update of the form g ← g exp(−αX) is used to move in the direction of X with a step-size of α

Simplifying gradients through reparametrization

  • We will use the group’s exponential map to derive a new learning rule.
  • We start by showing the simplification of the gradient computation by using a change of variable.
  • We parametrize T qg Q by the vectors in the Lie algebra and then compute the differential of E on vectors expressed in this way.
  • We use linear maps dL g (e) : T e G → T g G between the vector spaces.
  • We compute the perturbations of E at q g by the tangent vectors.
  • We use a reparameterization trick to calculate the differential of E at q g .
  • We update the g using the exponential map.

The new learning rule

  • Lie-Group BLR uses an update to stay within the manifold
  • Operator is the Riemannian manifold analogue of the transpose
  • Metric is a positive-definite, non-degenerate, symmetric, bilinear form on the tangent spaces of Q
  • Fastest direction can be obtained using ∇ θ
  • Fisher metric is used because it arises as the second-order differential of the objective function
  • Momentum can be included in Lie-Group BLR

New algorithms for deep learning

  • Lie-group BLR can be used to design new algorithms for deep learning
  • Lie-group BLR is extended from BLR
  • Lie-group BLR uses additive, multiplicative, and affine groups
  • Linear approximation of the map coincides with BLR
  • Lie-group BLR generalizes the update obtained by Khan and Rue (2021)
  • Lie-group BLR can be used to train networks with desirable biologically-plausible attributes
  • Linearization of Lie-group BLR recovers BLR update for Rayleigh distributions
  • Lie-group BLR can be written in terms of λ by squaring and reciprocating
  • BLR update for Rayleigh distributions can be simplified to coincide with Lie-group BLR

The diagonal affine group

  • The affine group combines translations and scaling.
  • The exponential map for this group is more complicated.
  • Choosing q0 as a Dirac delta measure gives us gradient descent.
  • If q0 is chosen as the normal distribution N (0, I) then Q is an exponential family.
  • The linear approximation in α to this update is the BLR from Khan and Rue (2021).

Numerical experiments

  • Compare Lie-group BLR to existing methods
  • Report performance for predictive marginal probability
  • Compute probability from optimal group element
  • Approximate integral using 32 samples from q0

Additive vs. multiplicative learning

  • Additive and multiplicative group updates are compared when applied to neural network training
  • Algorithms are used to train a MLP and CNN
  • Results are summarized in Table 1
  • Multiplicative learning leads to sparse, localized and compositional traits
  • Multiplicative learning improves NLL and ECE
  • Sparse nature of multiplicative filters is explained by entropy and mean of a distribution being tied together
  • Sharpness of filters is interpreted as neuronal task specialization

The affine learning rule

  • Affine learning update from Sec. 4.3 compared to state-of-the-art natural-gradient variational inference methods
  • ResNet-20 architecture reaches around 91% when trained with SGD
  • Making q 0 distribution more heavy tailed improves calibration measures
  • Affine update works for any base distribution q 0
  • No additional damping term required, easier to tune

Discussion

  • Proposes Lie-group BLR, an extension of BLR
  • Does not rely on specific parameterization of EFs
  • Enables gradient computations via reparametrization trick
  • Automatically keeps updates on manifold
  • Shows 3 use cases for algorithm design in deep learning

A mathematical details

  • Fisher metric is the second order differential of the objective function
  • It is obtained by perturbing the energy functional by a mean zero function
  • The linear term is the first differential and the quadratic term is the second differential
  • The second differential can be written as a bilinear form
  • The differential of the entropy term in dE(q) is calculated by integration by parts
  • The Lie-Group Bayesian Learning Rule is given by a scalar matrix
  • Algorithm 1 is for the additive group
  • Algorithm 2 is for the multiplicative group
  • Tangent vectors to Q are given as mean-zero functions
  • The Fisher information metric is calculated as a scalar matrix
  • The update rule is given by a step size
  • The group parameter g and the natural parameter λ are componentwise reciprocals of each other
  • Log-normal distributions with a fixed variance parameter fit into this family scheme
  • The Lie algebra of G is given by vectors X ∈ R P
  • The Fisher bilinear form is independent of g and only depends on q0