Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Studies offline policy learning, which aims to learn an optimal decision rule for a given population
- Existing methods rely on a uniform overlap assumption, which can be unrealistic in many situations
- Proposed algorithm optimizes lower confidence bounds of policy values
- Data-dependent upper bound for suboptimality of algorithm only depends on overlap for optimal policy and complexity of policy class
- New self-normalized type concentration inequality developed for inverse-propensity-weighting estimators
Paper Content
Introduction
- Policy learning aims to create an optimal decision rule for a given population
- Policy learning is used in healthcare, advertising, and recommendation systems
- Data is collected by executing a fixed and potentially suboptimal policy
- Data is also collected by running an adaptive learning algorithm
- Policy learning involves two components: policy evaluation and policy optimization
- A good performance of policy learning relies on accurate evaluation for all policies
- Uniform overlap assumption is often violated in practice
- Fundamental challenge of policy learning is counterfactual reasoning
- Estimation of policy value can have a huge variance if evaluated at out-of-distribution actions
- Quality of assessment of policy value varies across policies
- Need to quantify statistical confidence in policy value and incorporate into learning process
A pessimism-based framework
- Proposed novel solution to offline policy learning with two ingredients: pessimism principle and generalized empirical Bernstein inequality
- Algorithm optimizes confidence lower bounds (LCBs) instead of point estimates of policy values
- Generic data-dependent upper bound for performance of algorithm
- Complexity measure of policy and estimation error of optimal policy in upper bound
- Oracle property of approach: adapts to overlap of optimal policy in data
- Developed uniform concentration inequality for offline policy evaluation
- Standard O(1/ √ T ) rate for learning performance when e(X, π * (X)) is lower bounded by a constant
- Polynomial learning rate when propensity for optimal policy is decaying at a polynomial rate
Background
- X and Y denote the space of contexts and rewards respectively
- A contextual bandit model C is specified by an action set A, a context distribution PX and a set of reward distributions
- Policy learning is done from an offline dataset collected from C
Data collecting process
- Context Xt is drawn from PX at each time t
- Action A t is taken according to behavior policy
- Reward Yt is µ(Xt, At) + ǫt
- Behavior policy is potentially adaptive
- A t is treatment option taken for Xt
- Yt(a) follows population joint distribution
- Conditional independence condition: Yt(a) ⊥ ⊥ A t | X t , H t
- Bounded reward: Yt ∈ [0, 1]
- Known behavior policy: e t (x, a | H t ) for all (x, a) ∈ X ×A are known to learner
Policy learning and performance metric
- Π is a collection of policies, each of which is a mapping from X to A
- The average policy value of a policy π in Π is defined using the contextual bandit model C
- The optimal policy π* maximizes the policy value
- The goal is to learn a policy π and measure the learning performance by its suboptimality
Policy class complexity
- Natarajan dimension is used to quantify the complexity of a policy class
- Natarajan dimension is a generalization of the VC dimension
- Finite upper bounds for the Natarajan dimension have been established for many policy classes
Related work
- Policy learning from a fixed class is an area at the intersection of economics, causal inference and statistical machine learning.
- Classical works focus on i.i.d. data collected by a fixed behavior policy that is known or unknown and estimable.
- These works all assume a uniform overlap condition, i.e., the propensity for all actions in the behavior policy is uniformly lower bounded by a constant.
- When the behavior policy is known, the uniform overlap assumption is not necessary for efficient policy learning.
- There is a recent interest in policy learning from adaptively collected data.
- This setting is more challenging because the observations are no longer i.i.d.
- To tackle these challenges, existing works assume the propensities for all actions are deterministically lower bounded over time.
- We introduce a novel algorithm to policy learning and show that the overlap of the optimal policy is sufficient for efficient policy learning.
- Pessimism was initially proposed in offline reinforcement learning for learning an optimal policy in Markov decision processes.
- Variance regularization has also appeared in statistical learning and empirical risk minimization.
- Our method extends the setting in this line of work to unbounded random variables and adaptively collected hence mutually dependent data.
- We develop a novel self-normalized-type concentration inequality for empirical risks.
- Our work is generally related to reweighting-based approaches for off-policy evaluation in offline RL and contextual bandits.
- We develop a novel empirical Bernstein’s inequality to provide uncertainty quantification for inverse-weighted estimators.
Pessimistic policy learning
- Introducing the pessimism principle to eliminate reliance on uniform overlap assumption
- Optimizing pessimistic policy value with policy-dependent regularization term
- Proposition 3.1 states that suboptimality of policy only depends on regularization term of optimal policy
- Learning performance guarantee and uniform concentration event differ from greedy algorithms in literature
- Seek data-dependent bound for every policy in optimization problem to improve upper bound
Construction of the estimator and the regularizer
- Instantiate generic idea from Section 3.2 into concrete algorithm
- Estimator Q(π) and regularizer R(π) for all π ∈ Π
- Estimator is AIPW-type estimator
- Regularizer is β • V (π)
- Scaling factor β depends on complexity of policy class Π
- V (π) is proxy for standard deviation of Q(π)
Learning from a fixed behavior policy
- Fixed behavior policy used to collect offline data
- Upper and lower bounds on suboptimality of pessimistic learning algorithm
Suboptimality upper bound
- Theorem 5.2 provides an upper bound for the suboptimality of pessimistic policy learning
- Assumption 5.1 is required for the theorem to hold
- The upper bound is data-dependent and small as long as the value of the optimal policy can be estimated
- Assumption 5.1 is mild and does not require any assumptions on the adaptive data collecting process
- Corollary 5.6 provides a polynomial upper bound on the suboptimality of pessimistic policy learning
- The upper bound improves upon the results developed in other papers
- The upper bound is efficient even when the assumptions in greedy learning do not hold
Pessimistic policy learning is minimax optimal
- Pessimistic policy learning is the best effort given a fixed behavior policy
- Theorem 4.4 establishes a minimax lower bound for policy learning in a policy class
- Theorem 4.4 matches the upper bound in Corollary 4.3 up to logarithm factors of T and K
- Pessimistic policy learning is the best effort given offline data from a fixed behavior policy
Learning from adaptive behavior policies
- Pessimistic policy learning from adaptively collected data allows the behavior policy to change over time.
- An additional mild condition is imposed that the sampling probability has a deterministic lower bound.
Minimax optimality
- Investigated minimax optimality of pessimistic policy learning when data is adaptively collected
- Worked under polynomial decay setting for clear interpretability
- Fundamental difficulty of policy learning with adaptive behavior policy is captured by sampling propensities for optimal policy
- Minimax lower bound matches upper bound, illustrating key role of overlap of optimal policy in fundamental limit of learning from offline dataset
Proof sketch of main results
- Proof sketches are provided for the main results
- Section 6.1 focuses on the upper bound in the fixed policy case
- Section 6.2 focuses on the upper bound in the adaptive policy case
- Notation is simplified to omit dependence on C when context is clear
Proof of upper bound for the fixed policy case
- Estimation error for any (x, a) can be decomposed into two terms
- Symmetrization via Rademacher process to bound term (i)
- Lemma 6.1 bounds probability of term (i) exceeding a constant
- Lemma 6.2 establishes uniform upper bound for tail probability of Rademacher process
- Lemma 6.3 controls term (ii)
- Combining results from Lemmas 6.1, 6.2 and 6.3, Theorem 4.1 is proven
Proof of upper bound for the adaptive policy case
- Triangle inequality used to control terms (i) and (ii)
- Decoupled tangent sequence used to control term (i)
- Symmetrization used to move supremum over π out of probability
- Tail bound of term (i) turned into tree Rademacher process
- Self-normalized concentration inequality used to control tail bound
- Lemma 6.8 used to control term (ii)
- Union bound used to combine terms (i) and (ii)
- Result used to prove Theorem 5.2
Discussion
- Introduce new algorithm based on pessimism principle for policy learning from offline data
- Algorithm promises efficient learning as long as optimal policy is sufficiently covered by data set
- Developed generalized empirical Bernstein inequality for estimators with unbounded empirical losses
- Extension to ERM, efficient and scalable algorithms, and sequential decision making
A cross-fitted pessimistic policy learning
- Algorithm provides a cross-fitted version for fixed behavior policy case
- Index set is randomly split into two disjoint folds
- Estimator µ (k) is used to obtain an estimator for µ(x, a)
- Estimation error of Q(π) can be bounded
- Regularization term is constructed
- Uniform concentration results can be obtained
B proof of fixed-policy results
- Event E is defined
- Symmetrization trick using Rademacher random variables is used to control probability of E
- Lemma B.1 is used to control deviation of 1
- Union bound is used to control probability of E
- Tower property is used to bound probability of E
- Hoeffding’s inequality is used to bound probability of E
- Natarajan’s lemma is used to bound probability of E
- Markov’s inequality is used to bound probability of E
- Exchangeability between {A t , Y t } T t=1 and is used
- Bernstein’s inequality is used to bound probability of E
- Boundedness of Q(π) is used
- Boundedness of e(X t , π * (X t )) is used
- S is N-shattered by Π
- For any v ∈ V, there exists some π * v
- Fixed behavior policy is defined
- Total-Variation distance is used
- KL divergence is bounded
- Union bound is used
C proof of adaptive results
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lemma 6.8
- Proof of Corollary 5.6
- Proof of Theorem 5.7
- Define event of interest
- Define two additional events
- Lemma C.1 controls conditional probability of E 1 and E 2
- Bound left-handed side
- Recursive symmetrization for t = T
- Recursive symmetrization for t = T - 1
- Continuing recursive symmetrization
- Proof of Lemma 6.7
- Proof of Lem…
E auxiliary lemmas
- Natarajan’s lemma characterizes the number of distinct realizations of a policy
- Lemma E.1 provides a uniform bound over a finite set with polynomially growing size
- Lemma E.2 is a self-normalized concentration inequality
- Lemma E.3 is Hoeffding’s inequality
- Lemma E.4 is Freedman’s inequality, with Bernstein’s inequality as a special case
- Lemma B.1 provides a bound on the expected regret