Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Bayesian Learning Rule provides a framework for generic algorithm design
Difficult to use due to parameterization, gradients, and updates
Extension based on Lie-groups simplifies difficulties
New algorithm for deep learning with desirable attributes
Exploits Lie-group structures for new algorithm design

Paper Content

Introduction

Bayesian Learning Rule (BLR) provides a general framework to derive algorithms from optimization, deep learning, and graphical models
BLR uses natural-gradient descent to find approximations of the generalized posterior distribution
BLR has been used to design new algorithms for uncertainty estimation in deep learning
BLR can be difficult to use for three reasons
Extension of BLR based on Lie-groups proposed to address difficulties
Lie-group BLR uses group’s exponential map to update candidate distributions
Gradient computations simplified by reparameterization trick
Update naturally stays within the manifold
Use cases for algorithm design in deep learning using additive, multiplicative, and affine groups
New algorithm with multiplicative group gives rise to networks with nodes that are forced to be either excitatory or inhibitory

The bayesian learning rule

BLR aims to find a posterior candidate in a space of candidate distributions
Balancing the two terms requires an exploration-exploitation tradeoff
Problem can be rewritten as an inference problem
When the loss corresponds to the log-joint distribution of a Bayesian model, the solution coincides with the posterior distribution
BLR is a natural-gradient descent algorithm
BLR can recover many existing algorithms from a variety of fields
Design of new algorithms is possible
BLR can be difficult to use in many cases
Computing the gradient with respect to µ is not always straightforward
λ obtained by BLR may not always be valid natural parameters

The lie-group bayesian learning rule

Proposing a Lie-group based extension of the BLR
Describing Lie groups and their actions
Parameterization and exponential map
Deriving the new learning rule

Lie groups and their actions

Lie-group is a set with a binary operation that satisfies associativity, identity element and inverses
Smooth manifold is locally diffeomorphic to Euclidean space
Examples of Lie-groups include (R, +) and (R >0 , ×)
Cartesian product of two Lie-groups is also a Lie-group
Action of Lie-group on parameter-space is a smooth map
Example of action is (A, b) • θ = Aθ + b

Lie group parametrization

G is an action on a space of measures
Pushforwards are used to define another action on the space of measures
A base distribution q0 is given with positive density
The space of candidate distributions Q is the orbit of q0 under the action of G
Every q in Q can be parametrized by group elements g
Examples of EFs that can be parameterized this way include Gaussian and Bernoulli distributions
This parameterization is useful for using non-EF distributions such as the Laplace distribution

The exponential map and lie group updates

Goal is to find a group element g* that minimizes a given energy function
Exponential map is used to move in the direction of fastest descent
Exponential map is a smooth function that folds the tangent space at identity to the group
For diagonal matrices, exponential map is given by Taylor series
Update of the form g ← g exp(−αX) is used to move in the direction of X with a step-size of α

Simplifying gradients through reparametrization

We will use the group’s exponential map to derive a new learning rule.
We start by showing the simplification of the gradient computation by using a change of variable.
We parametrize T qg Q by the vectors in the Lie algebra and then compute the differential of E on vectors expressed in this way.
We use linear maps dL g (e) : T e G → T g G between the vector spaces.
We compute the perturbations of E at q g by the tangent vectors.
We use a reparameterization trick to calculate the differential of E at q g .
We update the g using the exponential map.

The new learning rule

Lie-Group BLR uses an update to stay within the manifold
Operator is the Riemannian manifold analogue of the transpose
Metric is a positive-definite, non-degenerate, symmetric, bilinear form on the tangent spaces of Q
Fastest direction can be obtained using ∇ θ
Fisher metric is used because it arises as the second-order differential of the objective function
Momentum can be included in Lie-Group BLR

New algorithms for deep learning

Lie-group BLR can be used to design new algorithms for deep learning
Lie-group BLR is extended from BLR
Lie-group BLR uses additive, multiplicative, and affine groups
Linear approximation of the map coincides with BLR
Lie-group BLR generalizes the update obtained by Khan and Rue (2021)
Lie-group BLR can be used to train networks with desirable biologically-plausible attributes
Linearization of Lie-group BLR recovers BLR update for Rayleigh distributions
Lie-group BLR can be written in terms of λ by squaring and reciprocating
BLR update for Rayleigh distributions can be simplified to coincide with Lie-group BLR

The diagonal affine group

The affine group combines translations and scaling.
The exponential map for this group is more complicated.
Choosing q0 as a Dirac delta measure gives us gradient descent.
If q0 is chosen as the normal distribution N (0, I) then Q is an exponential family.
The linear approximation in α to this update is the BLR from Khan and Rue (2021).

Numerical experiments

Compare Lie-group BLR to existing methods
Report performance for predictive marginal probability
Compute probability from optimal group element
Approximate integral using 32 samples from q0

Additive vs. multiplicative learning

Additive and multiplicative group updates are compared when applied to neural network training
Algorithms are used to train a MLP and CNN
Results are summarized in Table 1
Multiplicative learning leads to sparse, localized and compositional traits
Multiplicative learning improves NLL and ECE
Sparse nature of multiplicative filters is explained by entropy and mean of a distribution being tied together
Sharpness of filters is interpreted as neuronal task specialization

The affine learning rule

Affine learning update from Sec. 4.3 compared to state-of-the-art natural-gradient variational inference methods
ResNet-20 architecture reaches around 91% when trained with SGD
Making q 0 distribution more heavy tailed improves calibration measures
Affine update works for any base distribution q 0
No additional damping term required, easier to tune

Discussion

Proposes Lie-group BLR, an extension of BLR
Does not rely on specific parameterization of EFs
Enables gradient computations via reparametrization trick
Automatically keeps updates on manifold
Shows 3 use cases for algorithm design in deep learning

A mathematical details

Fisher metric is the second order differential of the objective function
It is obtained by perturbing the energy functional by a mean zero function
The linear term is the first differential and the quadratic term is the second differential
The second differential can be written as a bilinear form
The differential of the entropy term in dE(q) is calculated by integration by parts
The Lie-Group Bayesian Learning Rule is given by a scalar matrix
Algorithm 1 is for the additive group
Algorithm 2 is for the multiplicative group
Tangent vectors to Q are given as mean-zero functions
The Fisher information metric is calculated as a scalar matrix
The update rule is given by a step size
The group parameter g and the natural parameter λ are componentwise reciprocals of each other
Log-normal distributions with a fixed variance parameter fit into this family scheme
The Lie algebra of G is given by vectors X ∈ R P
The Fisher bilinear form is independent of g and only depends on q0

Link to paper#

Abstract#

Paper Content#

Introduction#

The bayesian learning rule#

The lie-group bayesian learning rule#

Lie groups and their actions#

Lie group parametrization#

The exponential map and lie group updates#

Simplifying gradients through reparametrization#

The new learning rule#

New algorithms for deep learning#

The diagonal affine group#

Numerical experiments#

Additive vs. multiplicative learning#

The affine learning rule#

Discussion#

A mathematical details#

Link to paper

Abstract

Paper Content

Introduction

The bayesian learning rule

The lie-group bayesian learning rule

Lie groups and their actions

Lie group parametrization

The exponential map and lie group updates

Simplifying gradients through reparametrization

The new learning rule

New algorithms for deep learning

The diagonal affine group

Numerical experiments

Additive vs. multiplicative learning

The affine learning rule

Discussion

A mathematical details