Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • QTD is a distributional reinforcement learning algorithm
  • QTD has been successful in large-scale applications
  • QTD updates are non-linear and may have multiple fixed points
  • This paper provides a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1

Paper Content

Introduction

  • Distributional reinforcement learning predicts the full probability distribution of future returns
  • A widely-used family of algorithms for this is based on learning quantiles of the return distribution
  • This approach has been successful in combination with deep reinforcement learning
  • Little is known about its behaviour from a theoretical viewpoint
  • QTD updates rely on asymmetric L1 losses
  • This paper proves the convergence of QTD
  • The proof uses stochastic approximation theory with differential inclusions
  • The paper also analyses the limit points of QTD and bounds the approximation error

Background

  • Introduce background concepts
  • Introduce notation

Markov decision processes

  • Finite state and action spaces
  • Transition kernel and reward distribution
  • Discount factor
  • Policy determines trajectory distribution

Predicting expected returns and the return distribution

  • Return is a random variable with sources of randomness from actions, state transitions, and rewards
  • A single scalar summary of performance is given by the expectation of the return
  • Distributional reinforcement learning is concerned with learning the probability distribution over returns
  • Learning the probability distribution provides more signal for an agent to learn from
  • Distributional reinforcement learning algorithms typically work with a subset of distributions that are amenable to parametrisation on a computer

Monte carlo and temporal-difference learning

  • TD learning is an approximation of Monte Carlo learning
  • TD learning uses a bootstrapped approximation of the return
  • TD learning is lower variance than Monte Carlo learning
  • TD learning shares information across states

Quantile temporal-difference learning and quantile dynamic programming

  • Presenting main algorithms of study
  • Exploring computer science paper

Quantile regression

  • Monte Carlo algorithm can be adapted to learn about the return distribution
  • Probability distribution representation is used to approximate the return distribution
  • Quantile temporal-difference learning uses an equally-weighted mixture of Dirac deltas
  • Quantile-based approach aims to have particle locations approximate certain quantiles of η π (x)
  • Generalised inverse CDF of ν is used to uniquely specify a quantile for each level τ
  • Quantile regression loss is used to estimate τ-quantiles of the return distribution
  • Negative gradient of the loss is used to motivate an update rule
  • Update rule is essentially the application of stochastic gradient descent method for quantile regression

Quantile temporal-difference learning

  • Quantile temporal-difference learning algorithm (QTD) is a modification of the Monte Carlo algorithm
  • QTD uses an approximate sample from the return distribution derived from an observed transition
  • QTD updates its parameters on the basis of sample transitions
  • QTD updates include a distinct term τ i
  • QTD provides theoretical guarantees as to what the algorithm will converge to

Motivating examples

  • QTD can exhibit a variety of behaviours in small environments
  • QTD can be used to estimate Gaussian random variables
  • QTD can be used to estimate the median of a distribution
  • QTD can converge to a point or a set depending on the environment
  • QTD’s behaviour can be affected by the ‘smoothness’ of the reward distributions
  • QTD can perform a random walk over a set in certain environments

Quantile dynamic programming

  • QTD update given in Equation (10) moves θ(x, i) in the direction of the τ i-quantiles of the distribution of the random variable R + θ(X , J).
  • Quantile dynamic programming (QDP) calculates these quantiles iteratively.
  • QDP is parametrised by interpolation parameters λ ∈ [m].
  • QTD approximates the behaviour of QDP without requiring access to transition structure and reward distributions.
  • QTD and QDP have the same asymptotic behaviour.

Convergence of quantile dynamic programming

  • QDP decomposes into several operators
  • Algorithm 2 manipulates tables of estimated quantiles
  • Distribution associated with quantiles can be referred to
  • Transformation performed by Algorithm 2 has two parts
  • First part assigns distribution of reward and next state
  • Second part approximates and projects the distribution
  • Operator Π λ T π is equivalent to QDP
  • Understanding long-term behaviour of QDP requires understanding Π λ T π

Convergence analysis

  • Π λ T π is a contraction mapping with respect to an appropriate metric over return-distribution functions
  • Wasserstein-∞ metric and its extension to return-distribution functions is used
  • T π is a γ-contraction with respect to w∞
  • Π λ cannot expand distances as measured by w∞
  • Π λ is a γ-contraction with respect to w∞
  • QDP and QTD converge to the same fixed points

Convergence of quantile temporal-difference learning

  • QTD is a synchronous version of a computer science algorithm
  • QTD updates states using independent transitions
  • Theorem 5.1 states that QTD converges almost surely to the set of fixed points of the projected distributional Bellman operators
  • The theorem does not require finite-variance conditions on rewards
  • Step size conditions are weaker than the typical Robbins-Monro conditions used in classical TD analyses
  • The proof is based on the ODE method for stochastic approximation
  • The connection to differential equations and differential inclusions is elucidated

The qtd differential equation

  • Equation (13) yields an expected increment
  • Assumption 5.2 guarantees that two “difficult” cases of flat and vertical regions of CDFs do not arise
  • Equation (13) is a noisy discretisation of the differential equation referred to as the QTD differential equation
  • Assumption 5.2 guarantees the global existence and uniqueness of solutions to the QTD differential equation

The qtd differential inclusion

  • Assumption 5.2 is lifted, causing complications.
  • Right-hand side of QTD ODE is modified to a strict inequality.
  • Right-hand side of differential equation is not continuous, so solutions may not exist.
  • Filippov’s proposal is used to relax the definition of the dynamics at points of discontinuity.

Solutions of differential inclusions

  • Definition 5.4: Path (zt) is a solution to differential inclusion ∂t zt ∈ H(zt) if there is an integrable function g: [0, ∞) → Rn
  • Proposition 5.5: Set-valued map H: Rn ⇒ Rn is a Marchaud map with constant C > 0
  • Global solutions to differential inclusion guaranteed under any initial conditions

Asymptotic behaviour of differential inclusion trajectories

  • Goal is to show trajectories of QTD differential inclusion approach fixed points of QDP
  • Lyapunov function is a key tool
  • Definition of Lyapunov function based on Benaïm et al. (2005)
  • Lyapunov function decreases along trajectories of differential inclusion
  • Lyapunov function is minimal on Λ (set of fixed points of family of QDP algorithms)

Qtd as a stochastic approximation to the qtd differential inclusion

  • Theorem 5.1 is proven using Theorem 5.7 and Proposition 5.8
  • Theorem 5.7 requires a Marchaud map, Lyapunov function, and a sequence of iterates
  • The iterates must satisfy a martingale difference condition and be bounded
  • The boundedness of the iterates is proven in Proposition 5.8
  • The Lyapunov function is constructed in Proposition 5.11
  • Relaxation of the dynamics is required to define a valid continuous-time dynamical system
  • Relaxing the dynamics too much results in too many solutions that don’t exhibit the desired behaviour

A lyapunov function for the qdp fixed points

  • A Lyapunov function exists to prove Theorem 5.1
  • Assumption 5.2 holds, meaning all projections Π λ behave identically on the image of T π
  • A Lyapunov function is given by Proposition 5.10
  • The proof of Proposition 5.10 shows that the expected update under QTD moves the estimate in the same direction as gradient descent on a squared loss from the fixed point
  • Proposition 5.11 states the Lyapunov result in the general case

Extension to asynchronous qtd

  • QTD is usually implemented in a synchronous way, but can also be done asynchronously.
  • Asynchronous QTD updates a single state at a time.
  • Step size β x,k depends on both x and k.
  • Updates can be done online or from a replay buffer.

Analysis of the qtd limit points

  • QTD/QDP limiting points will not be the same as true return distribution
  • Each return-distribution function is in the image of the projection
  • Approximation error is not immediately clear
  • Wasserstein-1 metric is used to measure approximation error
  • Increasing m can make errors accumulated in dynamic programming arbitrarily small

Instance-dependent bounds

  • Worst-case projection error is assumed in all states
  • Quality of fixed point can be improved in environments where this is not the case
  • Example of instance-dependent quality bound is described
  • Practical advice for practitioners: use quantile representation in environments with mostly deterministic transitions
  • Variants of Proposition 6.3 are possible

Qualitative analysis of qdp fixed points

  • QTD and QDP can learn an arbitrarily accurate approximation of the return-distribution function with enough atoms.
  • In a two-state Markov decision process, QDP with m = 2 heavily skews the distribution to the right.
  • QDP behaves like an affine policy evaluation operator on X × [m] locally around the fixed point.
  • Visualizing which particles are assigned to one another by a QDP operator application can help understand the behaviour of QDP.
  • Increasing m prevents pathological self-loops/small cycles from “leaking out” and degrading the quality of other quantile estimates.
  • Local quantile back-up diagrams can be used to develop intuition and further analysis of QDP and QTD.
  • Stochastic approximation theory with differential inclusions was introduced by Ljung (1977) and extended by Kushner and Clark (1978)
  • Differential inclusions have been used in a variety of fields, including control theory, economics, differential game theory, and mechanics
  • Quantile regression was introduced by Koenker and Bassett (1978)
  • Quantile temporal-difference learning combines quantile regression with the bootstrapping approach
  • Distributional reinforcement learning based on quantiles was introduced by Dabney et al. (2018b)

Conclusion

  • QTD is a popular and effective distributional reinforcement learning algorithm
  • Analysis of QTD requires tools from differential inclusions and stochastic approximation theory
  • Weaker conditions are used in the analysis of TD algorithms
  • Establishes soundness of QTD
  • Further research into theory, practice and applications of QTD is important
  • Interaction of QTD with function approximation and adaptive optimization should be studied
  • Proof of Proposition 4.3 and Theorem 5.7 provided
  • Intuition of proof is that structure of QTD differential inclusion moves furthest coordinates back towards the origin
  • Martingale noise cannot cause divergence
  • Notation introduced to compare successive iterates
  • Lemma A.1 states that infimum over λ in equation is attained
  • Lyapunov function is continuous, non-negative and takes on the value 0 only on the set of fixed points
  • Decreasing property of Lyapunov function established
  • Reasoning about effects of perturbing λ to show that Lyapunov function is decreasing