Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Bayesian Learning Rule provides a framework for generic algorithm design
- Difficult to use due to parameterization, gradients, and updates
- Extension based on Lie-groups simplifies difficulties
- New algorithm for deep learning with desirable attributes
- Exploits Lie-group structures for new algorithm design
Paper Content
Introduction
- Bayesian Learning Rule (BLR) provides a general framework to derive algorithms from optimization, deep learning, and graphical models
- BLR uses natural-gradient descent to find approximations of the generalized posterior distribution
- BLR has been used to design new algorithms for uncertainty estimation in deep learning
- BLR can be difficult to use for three reasons
- Extension of BLR based on Lie-groups proposed to address difficulties
- Lie-group BLR uses group’s exponential map to update candidate distributions
- Gradient computations simplified by reparameterization trick
- Update naturally stays within the manifold
- Use cases for algorithm design in deep learning using additive, multiplicative, and affine groups
- New algorithm with multiplicative group gives rise to networks with nodes that are forced to be either excitatory or inhibitory
The bayesian learning rule
- BLR aims to find a posterior candidate in a space of candidate distributions
- Balancing the two terms requires an exploration-exploitation tradeoff
- Problem can be rewritten as an inference problem
- When the loss corresponds to the log-joint distribution of a Bayesian model, the solution coincides with the posterior distribution
- BLR is a natural-gradient descent algorithm
- BLR can recover many existing algorithms from a variety of fields
- Design of new algorithms is possible
- BLR can be difficult to use in many cases
- Computing the gradient with respect to µ is not always straightforward
- λ obtained by BLR may not always be valid natural parameters
The lie-group bayesian learning rule
- Proposing a Lie-group based extension of the BLR
- Describing Lie groups and their actions
- Parameterization and exponential map
- Deriving the new learning rule
Lie groups and their actions
- Lie-group is a set with a binary operation that satisfies associativity, identity element and inverses
- Smooth manifold is locally diffeomorphic to Euclidean space
- Examples of Lie-groups include (R, +) and (R >0 , ×)
- Cartesian product of two Lie-groups is also a Lie-group
- Action of Lie-group on parameter-space is a smooth map
- Example of action is (A, b) • θ = Aθ + b
Lie group parametrization
- G is an action on a space of measures
- Pushforwards are used to define another action on the space of measures
- A base distribution q0 is given with positive density
- The space of candidate distributions Q is the orbit of q0 under the action of G
- Every q in Q can be parametrized by group elements g
- Examples of EFs that can be parameterized this way include Gaussian and Bernoulli distributions
- This parameterization is useful for using non-EF distributions such as the Laplace distribution
The exponential map and lie group updates
- Goal is to find a group element g* that minimizes a given energy function
- Exponential map is used to move in the direction of fastest descent
- Exponential map is a smooth function that folds the tangent space at identity to the group
- For diagonal matrices, exponential map is given by Taylor series
- Update of the form g ← g exp(−αX) is used to move in the direction of X with a step-size of α
Simplifying gradients through reparametrization
- We will use the group’s exponential map to derive a new learning rule.
- We start by showing the simplification of the gradient computation by using a change of variable.
- We parametrize T qg Q by the vectors in the Lie algebra and then compute the differential of E on vectors expressed in this way.
- We use linear maps dL g (e) : T e G → T g G between the vector spaces.
- We compute the perturbations of E at q g by the tangent vectors.
- We use a reparameterization trick to calculate the differential of E at q g .
- We update the g using the exponential map.
The new learning rule
- Lie-Group BLR uses an update to stay within the manifold
- Operator is the Riemannian manifold analogue of the transpose
- Metric is a positive-definite, non-degenerate, symmetric, bilinear form on the tangent spaces of Q
- Fastest direction can be obtained using ∇ θ
- Fisher metric is used because it arises as the second-order differential of the objective function
- Momentum can be included in Lie-Group BLR
New algorithms for deep learning
- Lie-group BLR can be used to design new algorithms for deep learning
- Lie-group BLR is extended from BLR
- Lie-group BLR uses additive, multiplicative, and affine groups
- Linear approximation of the map coincides with BLR
- Lie-group BLR generalizes the update obtained by Khan and Rue (2021)
- Lie-group BLR can be used to train networks with desirable biologically-plausible attributes
- Linearization of Lie-group BLR recovers BLR update for Rayleigh distributions
- Lie-group BLR can be written in terms of λ by squaring and reciprocating
- BLR update for Rayleigh distributions can be simplified to coincide with Lie-group BLR
The diagonal affine group
- The affine group combines translations and scaling.
- The exponential map for this group is more complicated.
- Choosing q0 as a Dirac delta measure gives us gradient descent.
- If q0 is chosen as the normal distribution N (0, I) then Q is an exponential family.
- The linear approximation in α to this update is the BLR from Khan and Rue (2021).
Numerical experiments
- Compare Lie-group BLR to existing methods
- Report performance for predictive marginal probability
- Compute probability from optimal group element
- Approximate integral using 32 samples from q0
Additive vs. multiplicative learning
- Additive and multiplicative group updates are compared when applied to neural network training
- Algorithms are used to train a MLP and CNN
- Results are summarized in Table 1
- Multiplicative learning leads to sparse, localized and compositional traits
- Multiplicative learning improves NLL and ECE
- Sparse nature of multiplicative filters is explained by entropy and mean of a distribution being tied together
- Sharpness of filters is interpreted as neuronal task specialization
The affine learning rule
- Affine learning update from Sec. 4.3 compared to state-of-the-art natural-gradient variational inference methods
- ResNet-20 architecture reaches around 91% when trained with SGD
- Making q 0 distribution more heavy tailed improves calibration measures
- Affine update works for any base distribution q 0
- No additional damping term required, easier to tune
Discussion
- Proposes Lie-group BLR, an extension of BLR
- Does not rely on specific parameterization of EFs
- Enables gradient computations via reparametrization trick
- Automatically keeps updates on manifold
- Shows 3 use cases for algorithm design in deep learning
A mathematical details
- Fisher metric is the second order differential of the objective function
- It is obtained by perturbing the energy functional by a mean zero function
- The linear term is the first differential and the quadratic term is the second differential
- The second differential can be written as a bilinear form
- The differential of the entropy term in dE(q) is calculated by integration by parts
- The Lie-Group Bayesian Learning Rule is given by a scalar matrix
- Algorithm 1 is for the additive group
- Algorithm 2 is for the multiplicative group
- Tangent vectors to Q are given as mean-zero functions
- The Fisher information metric is calculated as a scalar matrix
- The update rule is given by a step size
- The group parameter g and the natural parameter λ are componentwise reciprocals of each other
- Log-normal distributions with a fixed variance parameter fit into this family scheme
- The Lie algebra of G is given by vectors X ∈ R P
- The Fisher bilinear form is independent of g and only depends on q0