Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- QTD is a distributional reinforcement learning algorithm
- QTD has been successful in large-scale applications
- QTD updates are non-linear and may have multiple fixed points
- This paper provides a proof of convergence to the fixed points of a related family of dynamic programming procedures with probability 1
Paper Content
Introduction
- Distributional reinforcement learning predicts the full probability distribution of future returns
- A widely-used family of algorithms for this is based on learning quantiles of the return distribution
- This approach has been successful in combination with deep reinforcement learning
- Little is known about its behaviour from a theoretical viewpoint
- QTD updates rely on asymmetric L1 losses
- This paper proves the convergence of QTD
- The proof uses stochastic approximation theory with differential inclusions
- The paper also analyses the limit points of QTD and bounds the approximation error
Background
- Introduce background concepts
- Introduce notation
Markov decision processes
- Finite state and action spaces
- Transition kernel and reward distribution
- Discount factor
- Policy determines trajectory distribution
Predicting expected returns and the return distribution
- Return is a random variable with sources of randomness from actions, state transitions, and rewards
- A single scalar summary of performance is given by the expectation of the return
- Distributional reinforcement learning is concerned with learning the probability distribution over returns
- Learning the probability distribution provides more signal for an agent to learn from
- Distributional reinforcement learning algorithms typically work with a subset of distributions that are amenable to parametrisation on a computer
Monte carlo and temporal-difference learning
- TD learning is an approximation of Monte Carlo learning
- TD learning uses a bootstrapped approximation of the return
- TD learning is lower variance than Monte Carlo learning
- TD learning shares information across states
Quantile temporal-difference learning and quantile dynamic programming
- Presenting main algorithms of study
- Exploring computer science paper
Quantile regression
- Monte Carlo algorithm can be adapted to learn about the return distribution
- Probability distribution representation is used to approximate the return distribution
- Quantile temporal-difference learning uses an equally-weighted mixture of Dirac deltas
- Quantile-based approach aims to have particle locations approximate certain quantiles of η π (x)
- Generalised inverse CDF of ν is used to uniquely specify a quantile for each level τ
- Quantile regression loss is used to estimate τ-quantiles of the return distribution
- Negative gradient of the loss is used to motivate an update rule
- Update rule is essentially the application of stochastic gradient descent method for quantile regression
Quantile temporal-difference learning
- Quantile temporal-difference learning algorithm (QTD) is a modification of the Monte Carlo algorithm
- QTD uses an approximate sample from the return distribution derived from an observed transition
- QTD updates its parameters on the basis of sample transitions
- QTD updates include a distinct term τ i
- QTD provides theoretical guarantees as to what the algorithm will converge to
Motivating examples
- QTD can exhibit a variety of behaviours in small environments
- QTD can be used to estimate Gaussian random variables
- QTD can be used to estimate the median of a distribution
- QTD can converge to a point or a set depending on the environment
- QTD’s behaviour can be affected by the ‘smoothness’ of the reward distributions
- QTD can perform a random walk over a set in certain environments
Quantile dynamic programming
- QTD update given in Equation (10) moves θ(x, i) in the direction of the τ i-quantiles of the distribution of the random variable R + θ(X , J).
- Quantile dynamic programming (QDP) calculates these quantiles iteratively.
- QDP is parametrised by interpolation parameters λ ∈ [m].
- QTD approximates the behaviour of QDP without requiring access to transition structure and reward distributions.
- QTD and QDP have the same asymptotic behaviour.
Convergence of quantile dynamic programming
- QDP decomposes into several operators
- Algorithm 2 manipulates tables of estimated quantiles
- Distribution associated with quantiles can be referred to
- Transformation performed by Algorithm 2 has two parts
- First part assigns distribution of reward and next state
- Second part approximates and projects the distribution
- Operator Π λ T π is equivalent to QDP
- Understanding long-term behaviour of QDP requires understanding Π λ T π
Convergence analysis
- Π λ T π is a contraction mapping with respect to an appropriate metric over return-distribution functions
- Wasserstein-∞ metric and its extension to return-distribution functions is used
- T π is a γ-contraction with respect to w∞
- Π λ cannot expand distances as measured by w∞
- Π λ is a γ-contraction with respect to w∞
- QDP and QTD converge to the same fixed points
Convergence of quantile temporal-difference learning
- QTD is a synchronous version of a computer science algorithm
- QTD updates states using independent transitions
- Theorem 5.1 states that QTD converges almost surely to the set of fixed points of the projected distributional Bellman operators
- The theorem does not require finite-variance conditions on rewards
- Step size conditions are weaker than the typical Robbins-Monro conditions used in classical TD analyses
- The proof is based on the ODE method for stochastic approximation
- The connection to differential equations and differential inclusions is elucidated
The qtd differential equation
- Equation (13) yields an expected increment
- Assumption 5.2 guarantees that two “difficult” cases of flat and vertical regions of CDFs do not arise
- Equation (13) is a noisy discretisation of the differential equation referred to as the QTD differential equation
- Assumption 5.2 guarantees the global existence and uniqueness of solutions to the QTD differential equation
The qtd differential inclusion
- Assumption 5.2 is lifted, causing complications.
- Right-hand side of QTD ODE is modified to a strict inequality.
- Right-hand side of differential equation is not continuous, so solutions may not exist.
- Filippov’s proposal is used to relax the definition of the dynamics at points of discontinuity.
Solutions of differential inclusions
- Definition 5.4: Path (zt) is a solution to differential inclusion ∂t zt ∈ H(zt) if there is an integrable function g: [0, ∞) → Rn
- Proposition 5.5: Set-valued map H: Rn ⇒ Rn is a Marchaud map with constant C > 0
- Global solutions to differential inclusion guaranteed under any initial conditions
Asymptotic behaviour of differential inclusion trajectories
- Goal is to show trajectories of QTD differential inclusion approach fixed points of QDP
- Lyapunov function is a key tool
- Definition of Lyapunov function based on Benaïm et al. (2005)
- Lyapunov function decreases along trajectories of differential inclusion
- Lyapunov function is minimal on Λ (set of fixed points of family of QDP algorithms)
Qtd as a stochastic approximation to the qtd differential inclusion
- Theorem 5.1 is proven using Theorem 5.7 and Proposition 5.8
- Theorem 5.7 requires a Marchaud map, Lyapunov function, and a sequence of iterates
- The iterates must satisfy a martingale difference condition and be bounded
- The boundedness of the iterates is proven in Proposition 5.8
- The Lyapunov function is constructed in Proposition 5.11
- Relaxation of the dynamics is required to define a valid continuous-time dynamical system
- Relaxing the dynamics too much results in too many solutions that don’t exhibit the desired behaviour
A lyapunov function for the qdp fixed points
- A Lyapunov function exists to prove Theorem 5.1
- Assumption 5.2 holds, meaning all projections Π λ behave identically on the image of T π
- A Lyapunov function is given by Proposition 5.10
- The proof of Proposition 5.10 shows that the expected update under QTD moves the estimate in the same direction as gradient descent on a squared loss from the fixed point
- Proposition 5.11 states the Lyapunov result in the general case
Extension to asynchronous qtd
- QTD is usually implemented in a synchronous way, but can also be done asynchronously.
- Asynchronous QTD updates a single state at a time.
- Step size β x,k depends on both x and k.
- Updates can be done online or from a replay buffer.
Analysis of the qtd limit points
- QTD/QDP limiting points will not be the same as true return distribution
- Each return-distribution function is in the image of the projection
- Approximation error is not immediately clear
- Wasserstein-1 metric is used to measure approximation error
- Increasing m can make errors accumulated in dynamic programming arbitrarily small
Instance-dependent bounds
- Worst-case projection error is assumed in all states
- Quality of fixed point can be improved in environments where this is not the case
- Example of instance-dependent quality bound is described
- Practical advice for practitioners: use quantile representation in environments with mostly deterministic transitions
- Variants of Proposition 6.3 are possible
Qualitative analysis of qdp fixed points
- QTD and QDP can learn an arbitrarily accurate approximation of the return-distribution function with enough atoms.
- In a two-state Markov decision process, QDP with m = 2 heavily skews the distribution to the right.
- QDP behaves like an affine policy evaluation operator on X × [m] locally around the fixed point.
- Visualizing which particles are assigned to one another by a QDP operator application can help understand the behaviour of QDP.
- Increasing m prevents pathological self-loops/small cycles from “leaking out” and degrading the quality of other quantile estimates.
- Local quantile back-up diagrams can be used to develop intuition and further analysis of QDP and QTD.
Related work
- Stochastic approximation theory with differential inclusions was introduced by Ljung (1977) and extended by Kushner and Clark (1978)
- Differential inclusions have been used in a variety of fields, including control theory, economics, differential game theory, and mechanics
- Quantile regression was introduced by Koenker and Bassett (1978)
- Quantile temporal-difference learning combines quantile regression with the bootstrapping approach
- Distributional reinforcement learning based on quantiles was introduced by Dabney et al. (2018b)
Conclusion
- QTD is a popular and effective distributional reinforcement learning algorithm
- Analysis of QTD requires tools from differential inclusions and stochastic approximation theory
- Weaker conditions are used in the analysis of TD algorithms
- Establishes soundness of QTD
- Further research into theory, practice and applications of QTD is important
- Interaction of QTD with function approximation and adaptive optimization should be studied
- Proof of Proposition 4.3 and Theorem 5.7 provided
- Intuition of proof is that structure of QTD differential inclusion moves furthest coordinates back towards the origin
- Martingale noise cannot cause divergence
- Notation introduced to compare successive iterates
- Lemma A.1 states that infimum over λ in equation is attained
- Lyapunov function is continuous, non-negative and takes on the value 0 only on the set of fixed points
- Decreasing property of Lyapunov function established
- Reasoning about effects of perturbing λ to show that Lyapunov function is decreasing