Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Studies offline policy learning, which aims to learn an optimal decision rule for a given population
Existing methods rely on a uniform overlap assumption, which can be unrealistic in many situations
Proposed algorithm optimizes lower confidence bounds of policy values
Data-dependent upper bound for suboptimality of algorithm only depends on overlap for optimal policy and complexity of policy class
New self-normalized type concentration inequality developed for inverse-propensity-weighting estimators

Paper Content

Introduction

Policy learning aims to create an optimal decision rule for a given population
Policy learning is used in healthcare, advertising, and recommendation systems
Data is collected by executing a fixed and potentially suboptimal policy
Data is also collected by running an adaptive learning algorithm
Policy learning involves two components: policy evaluation and policy optimization
A good performance of policy learning relies on accurate evaluation for all policies
Uniform overlap assumption is often violated in practice
Fundamental challenge of policy learning is counterfactual reasoning
Estimation of policy value can have a huge variance if evaluated at out-of-distribution actions
Quality of assessment of policy value varies across policies
Need to quantify statistical confidence in policy value and incorporate into learning process

A pessimism-based framework

Proposed novel solution to offline policy learning with two ingredients: pessimism principle and generalized empirical Bernstein inequality
Algorithm optimizes confidence lower bounds (LCBs) instead of point estimates of policy values
Generic data-dependent upper bound for performance of algorithm
Complexity measure of policy and estimation error of optimal policy in upper bound
Oracle property of approach: adapts to overlap of optimal policy in data
Developed uniform concentration inequality for offline policy evaluation
Standard O(1/ √ T ) rate for learning performance when e(X, π * (X)) is lower bounded by a constant
Polynomial learning rate when propensity for optimal policy is decaying at a polynomial rate

Background

X and Y denote the space of contexts and rewards respectively
A contextual bandit model C is specified by an action set A, a context distribution PX and a set of reward distributions
Policy learning is done from an offline dataset collected from C

Data collecting process

Context Xt is drawn from PX at each time t
Action A t is taken according to behavior policy
Reward Yt is µ(Xt, At) + ǫt
Behavior policy is potentially adaptive
A t is treatment option taken for Xt
Yt(a) follows population joint distribution
Conditional independence condition: Yt(a) ⊥ ⊥ A t | X t , H t
Bounded reward: Yt ∈ [0, 1]
Known behavior policy: e t (x, a | H t ) for all (x, a) ∈ X ×A are known to learner

Policy learning and performance metric

Π is a collection of policies, each of which is a mapping from X to A
The average policy value of a policy π in Π is defined using the contextual bandit model C
The optimal policy π* maximizes the policy value
The goal is to learn a policy π and measure the learning performance by its suboptimality

Policy class complexity

Natarajan dimension is used to quantify the complexity of a policy class
Natarajan dimension is a generalization of the VC dimension
Finite upper bounds for the Natarajan dimension have been established for many policy classes

Policy learning from a fixed class is an area at the intersection of economics, causal inference and statistical machine learning.
Classical works focus on i.i.d. data collected by a fixed behavior policy that is known or unknown and estimable.
These works all assume a uniform overlap condition, i.e., the propensity for all actions in the behavior policy is uniformly lower bounded by a constant.
When the behavior policy is known, the uniform overlap assumption is not necessary for efficient policy learning.
There is a recent interest in policy learning from adaptively collected data.
This setting is more challenging because the observations are no longer i.i.d.
To tackle these challenges, existing works assume the propensities for all actions are deterministically lower bounded over time.
We introduce a novel algorithm to policy learning and show that the overlap of the optimal policy is sufficient for efficient policy learning.
Pessimism was initially proposed in offline reinforcement learning for learning an optimal policy in Markov decision processes.
Variance regularization has also appeared in statistical learning and empirical risk minimization.
Our method extends the setting in this line of work to unbounded random variables and adaptively collected hence mutually dependent data.
We develop a novel self-normalized-type concentration inequality for empirical risks.
Our work is generally related to reweighting-based approaches for off-policy evaluation in offline RL and contextual bandits.
We develop a novel empirical Bernstein’s inequality to provide uncertainty quantification for inverse-weighted estimators.

Pessimistic policy learning

Introducing the pessimism principle to eliminate reliance on uniform overlap assumption
Optimizing pessimistic policy value with policy-dependent regularization term
Proposition 3.1 states that suboptimality of policy only depends on regularization term of optimal policy
Learning performance guarantee and uniform concentration event differ from greedy algorithms in literature
Seek data-dependent bound for every policy in optimization problem to improve upper bound

Construction of the estimator and the regularizer

Instantiate generic idea from Section 3.2 into concrete algorithm
Estimator Q(π) and regularizer R(π) for all π ∈ Π
Estimator is AIPW-type estimator
Regularizer is β • V (π)
Scaling factor β depends on complexity of policy class Π
V (π) is proxy for standard deviation of Q(π)

Learning from a fixed behavior policy

Fixed behavior policy used to collect offline data
Upper and lower bounds on suboptimality of pessimistic learning algorithm

Suboptimality upper bound

Theorem 5.2 provides an upper bound for the suboptimality of pessimistic policy learning
Assumption 5.1 is required for the theorem to hold
The upper bound is data-dependent and small as long as the value of the optimal policy can be estimated
Assumption 5.1 is mild and does not require any assumptions on the adaptive data collecting process
Corollary 5.6 provides a polynomial upper bound on the suboptimality of pessimistic policy learning
The upper bound improves upon the results developed in other papers
The upper bound is efficient even when the assumptions in greedy learning do not hold

Pessimistic policy learning is minimax optimal

Pessimistic policy learning is the best effort given a fixed behavior policy
Theorem 4.4 establishes a minimax lower bound for policy learning in a policy class
Theorem 4.4 matches the upper bound in Corollary 4.3 up to logarithm factors of T and K
Pessimistic policy learning is the best effort given offline data from a fixed behavior policy

Learning from adaptive behavior policies

Pessimistic policy learning from adaptively collected data allows the behavior policy to change over time.
An additional mild condition is imposed that the sampling probability has a deterministic lower bound.

Minimax optimality

Investigated minimax optimality of pessimistic policy learning when data is adaptively collected
Worked under polynomial decay setting for clear interpretability
Fundamental difficulty of policy learning with adaptive behavior policy is captured by sampling propensities for optimal policy
Minimax lower bound matches upper bound, illustrating key role of overlap of optimal policy in fundamental limit of learning from offline dataset

Proof sketch of main results

Proof sketches are provided for the main results
Section 6.1 focuses on the upper bound in the fixed policy case
Section 6.2 focuses on the upper bound in the adaptive policy case
Notation is simplified to omit dependence on C when context is clear

Proof of upper bound for the fixed policy case

Estimation error for any (x, a) can be decomposed into two terms
Symmetrization via Rademacher process to bound term (i)
Lemma 6.1 bounds probability of term (i) exceeding a constant
Lemma 6.2 establishes uniform upper bound for tail probability of Rademacher process
Lemma 6.3 controls term (ii)
Combining results from Lemmas 6.1, 6.2 and 6.3, Theorem 4.1 is proven

Proof of upper bound for the adaptive policy case

Triangle inequality used to control terms (i) and (ii)
Decoupled tangent sequence used to control term (i)
Symmetrization used to move supremum over π out of probability
Tail bound of term (i) turned into tree Rademacher process
Self-normalized concentration inequality used to control tail bound
Lemma 6.8 used to control term (ii)
Union bound used to combine terms (i) and (ii)
Result used to prove Theorem 5.2

Discussion

Introduce new algorithm based on pessimism principle for policy learning from offline data
Algorithm promises efficient learning as long as optimal policy is sufficiently covered by data set
Developed generalized empirical Bernstein inequality for estimators with unbounded empirical losses
Extension to ERM, efficient and scalable algorithms, and sequential decision making

A cross-fitted pessimistic policy learning

Algorithm provides a cross-fitted version for fixed behavior policy case
Index set is randomly split into two disjoint folds
Estimator µ (k) is used to obtain an estimator for µ(x, a)
Estimation error of Q(π) can be bounded
Regularization term is constructed
Uniform concentration results can be obtained

B proof of fixed-policy results

Event E is defined
Symmetrization trick using Rademacher random variables is used to control probability of E
Lemma B.1 is used to control deviation of 1
Union bound is used to control probability of E
Tower property is used to bound probability of E
Hoeffding’s inequality is used to bound probability of E
Natarajan’s lemma is used to bound probability of E
Markov’s inequality is used to bound probability of E
Exchangeability between {A t , Y t } T t=1 and is used
Bernstein’s inequality is used to bound probability of E
Boundedness of Q(π) is used
Boundedness of e(X t , π * (X t )) is used
S is N-shattered by Π
For any v ∈ V, there exists some π * v
Fixed behavior policy is defined
Total-Variation distance is used
KL divergence is bounded
Union bound is used

C proof of adaptive results

Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lemma 6.8
Proof of Corollary 5.6
Proof of Theorem 5.7
Define event of interest
Define two additional events
Lemma C.1 controls conditional probability of E 1 and E 2
Bound left-handed side
Recursive symmetrization for t = T
Recursive symmetrization for t = T - 1
Continuing recursive symmetrization
Proof of Lemma 6.7
Proof of Lem…

E auxiliary lemmas

Natarajan’s lemma characterizes the number of distinct realizations of a policy
Lemma E.1 provides a uniform bound over a finite set with polynomially growing size
Lemma E.2 is a self-normalized concentration inequality
Lemma E.3 is Hoeffding’s inequality
Lemma E.4 is Freedman’s inequality, with Bernstein’s inequality as a special case
Lemma B.1 provides a bound on the expected regret

Link to paper#

Abstract#

Paper Content#

Introduction#

A pessimism-based framework#

Background#

Data collecting process#

Policy learning and performance metric#

Policy class complexity#

Related work#

Pessimistic policy learning#

Construction of the estimator and the regularizer#

Learning from a fixed behavior policy#

Suboptimality upper bound#

Pessimistic policy learning is minimax optimal#

Learning from adaptive behavior policies#

Minimax optimality#

Proof sketch of main results#

Proof of upper bound for the fixed policy case#

Proof of upper bound for the adaptive policy case#

Discussion#

A cross-fitted pessimistic policy learning#

B proof of fixed-policy results#

C proof of adaptive results#

E auxiliary lemmas#

Link to paper

Abstract

Paper Content

Introduction

A pessimism-based framework

Background

Data collecting process

Policy learning and performance metric

Policy class complexity

Related work

Pessimistic policy learning

Construction of the estimator and the regularizer

Learning from a fixed behavior policy

Suboptimality upper bound

Pessimistic policy learning is minimax optimal

Learning from adaptive behavior policies

Minimax optimality

Proof sketch of main results

Proof of upper bound for the fixed policy case

Proof of upper bound for the adaptive policy case

Discussion

A cross-fitted pessimistic policy learning

B proof of fixed-policy results

C proof of adaptive results

E auxiliary lemmas