Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- This paper examines the computational complexity of learning a Hidden Markov Model (HMM).
- It proposes an interactive access model, in which the algorithm can query for samples from the conditional distributions of the HMMs.
- This model enables computationally efficient learning algorithms, bypassing cryptographic hardness.
- Algorithms are presented for two settings: one with query access to exact conditional probabilities, and one with samples from the conditional distributions.
- The performance of the algorithm depends on a new parameter, called the fidelity of the HMM.
- The algorithms can be viewed as generalizations and robustifications of Angluin’s $L^*$ algorithm.
Paper Content
Introduction
- HMMs are used to model temporal and sequential phenomena
- HMMs have low description complexity, expressivity to capture long-range dependencies, and efficient inference algorithms
- HMMs are used in many fields
- Estimating/learning HMMs is computationally difficult
- We focus on distribution learning in total variation distance
- Maximum likelihood estimation is known to be statistically efficient, but not computationally efficient
- We consider interactive access to the HMM
- We show how L* algorithm can efficiently learn any HMM
- We show an algorithm that is efficient for all HMMs with “high fidelity”
- We introduce a new representation for distributions over exponentially large domains
- We introduce a new perturbation argument for mitigating error amplification over long sequences
Preliminaries
- Let O denote a finite observation space and O* denote observation sequences of arbitrary length
- Consider a distribution Pr[•] over T random variables x1, …, xT with a sequential ordering
- Pr[x1, x2, …, xT] is written in lieu of Pr[x1=x1, …, xT=xT], omitting explicit reference to the random variables
- Pr[f|h] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
- Pr[F|H] is written to denote the |F| x |H| matrix whose (i, j)th entry is Pr[f|h]
- Hidden Markov Models provide a low-complexity parametrization for distributions over observation sequences
- An HMM with S hidden states is specified by an initial distribution µ, an emission matrix O, and a state transition matrix T
- Rank of a distribution is defined as the conditional probability matrix Pr[O ≤T −t |O t ] having rank at most r
- An HMM with S hidden states has rank at most S
- Exact conditional probability oracle is given as input observation sequences h and f of length t ≤ T and T − t respectively and returns the scalar Pr[f|h]
- Conditional sampling oracle is given as input an observation sequence h of length t ≤ T and returns an observation sequence f of length T − t
- Learning goal is distribution learning in total variation distance
- Algorithm should compute an estimate Pr[•] such that with probability at least 1 − δ, the total variation distance between Pr[•] and Pr[•] is at most ε
- Algorithm should have computational complexity that scales polynomially in r, T, O, 1/ε and log(1/δ)
Our results
- Algorithm 1 can learn any HMM given access to an exact probability oracle in poly time
- Algorithm 1 requires 0 < ε, δ < 1
- Algorithm 1 returns an efficiently represented approximation of the distribution
- Algorithm 2 is a robust version of L*
- Algorithm 2 depends on a spectral property of a distribution called fidelity
- Open Problem 1.6: Is there a computationally efficient algorithm for learning any low rank distribution given access to a conditional sampling oracle?
Technical overview
- Low rank distributions are challenging to learn.
- Notation is introduced to explain the challenges.
- Estimating matrices is necessary for distribution learning.
- Low rank property does not provide an efficient representation of the distribution.
Background: observable operators and hard instances
- HMMs can be used to obtain an efficient algorithm.
- Probability of a sequence can be written using observable operator representation.
- Operators can be estimated when T and O have full column rank.
- Rank deficient HMMs are hard instances.
Efficient representation
- Rank deficient HMMs require efficient representation of the distribution
- Any submatrix of Pr[F t |H t ] with the same rank as the entire matrix can be used to build an efficient representation
- Exploiting a circulant structure in the matrices {Pr[F t |H t ]} t≤T can model the evolution of the coefficients
- Sequence probabilities can be expressed by iterated application of the circulant structure
- Estimating operators requires interactive access and a novel error propagation argument
Error propagation
- Finding and estimating operators is difficult
- Error amplification can arise from repeated application of learned operators
- Estimating operators is discussed in Section 2.4 and finding the basis in Section 2.5
- Estimator is defined in terms of estimated operators
- Total variation distance is defined
- Two strategies for bounding expression are discussed
- New perturbation analysis is introduced
- Tracks error in space of coefficients
- Sum of scalars is small via inductive argument
Estimating operators
- Estimate operators using conditional sampling oracle
- Linear regression may not work due to small singular values
- Preconditioner introduced to stabilize system
- Preconditioner reduces size of matrix
- Entries of matrix can be estimated using conditional samples
- Preconditioner amplifies singular values of matrix
- Fidelity introduced to ensure large singular values of matrix
- Fidelity captures previously studied positive results for learning HMMs
Finding the basis
- Challenge is to find bases {B t } t∈[T ]
- Random sampling approach works for high fidelity distributions
- Basis finding is the final issue to address for Theorem 1
- Adaptation of Angluin’s L* algorithm used to find bases
- Algorithm checks if predictions are accurate for polynomially many random sequences
Learning with conditional probabilities (theorem 1)
- Theorem 1 states that Algorithm 1 can return an approximation of a probability distribution in poly(r, T, 1/ε, log(1/δ)) time
- Notation is introduced to define histories and futures of length t
- Probabilities associated to empty string are defined
- Bases for the distribution are formally defined
- Structural result is introduced to generate coefficients using OT matrices of size r x r
- Equation (7) is introduced to represent the probability distribution
- Solution A o,t is introduced for Equation (7)
Algorithm
- Algorithm 2 requires the user to provide ε, δ, ∆* and r.
- Algorithm 2 relies on the efficient representation provided by Proposition 3.2.
- Algorithm 2 estimates the operators A o,t−1 using conditional samples and linear regression.
Analysis
- Access to exact conditional probability oracle is no longer available, only samples can be obtained
- Robust bases must be defined to control estimation errors
- Estimation algorithm and proof provided in Appendix B.6
- Estimation error for operators A o,t can be characterized
- Errors in induced distributions can be bounded using structured error
Discussion
- Interactive access to hidden Markov models can circumvent computational barriers to efficient learning.
- All low rank distributions with a certain fidelity property can be efficiently learned.
- Fidelity captures assumptions considered in prior work on learning of HMMs.
- Overcomplete setting of Sharan et al. admits bases of size S with fidelity 1/poly(S).
- Reliance on fidelity parameter is the main limitation of results.
- Open problem is to show that ignoring small directions preserves low rank property.
- Algorithm 1 with access to an exact probability oracle runs in poly(r, T, 1/ε, log(1/δ)) time.
B.1 finding robust basis
- Finding a robust basis can be defined by a covariance matrix
- The norm of the basis is upper bounded
- The two distributions are only a small factor apart
- Approximations of projections and coefficients need to be learned
- Error of the approximation is small
- Process requires many conditional samples
B.3 perturbation analysis: error in coefficients
- Learn approximations of operators A o,t to compute probabilities
- Let A x 1:t and A x 1:t represent product of matrices A xt,t−1 . . x 1:t
- B x 1:t ⊂ H t+1, a subset of histories of length t+1
- 1 norm of associated coefficients γ x 1:t and γ ⊥ x 1:t grow moderately
- For any observation sequence x, 1 norm of coefficients can be bounded
- Let α(x t , B x 1:t−1 ) represent matrix with column given by α(x t , b) for b ∈ B x 1:t−1
- Let α ⊥ (x t , B x 1:t−1 ), α(x t , V ⊥ t−1 ) and α ⊥ (x t , V ⊥ t−1 )
- Recursion from Proposition B.5 has solution
- Algorithm 2 with access to conditional sampling oracle runs in poly time
- Returns approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ
- Let {B t } t∈[T ] be basis of distribution Pr[•]
- Define operators A o,t under basis {B t } t∈[T ]
- Covariance matrix associated to B t has eigenvalue decomposition
- Let d * t be restriction of distribution d t over set dom + (t)
- Let β(x) ∈ span(V t ) be coefficients associated to history x
- β(x) are uniquely defined in span(V t )
- β(x) sum to one, even though some entries could be negative
- Existence of operators which can be used to construct coefficients
B.6 estimating covariance matrix in frobenius norm
- Estimate objects needed for operator A o,t
- Lemma B.12 states that with probability 1-δ, we can learn estimate s(b * , x)
- Define s(b * , x) as a sum where b * ∈ B t and x is a history of length t
- Define Pr[•|b] for b ∈ B t
- Sample m = (1/2c 2 n 3 p 2 ) log(2/δ) random futures from Pr[•|x]
- Estimate Pr[f |b * ] and d(f )
- Define α-regular future
- Perform test A(f, b) for each future f and basis history b
- Estimate q(bo) and Σ Bt for all b ∈ B t , observations o and time t ∈ [T ]
- Parity with noise and all previously known positive results can be learned by algorithm
- Define distribution induced by parity with noise
- Proposition C.4 shows that distribution has rank ≤ 2T and fidelity (1−2α) 2 /2
- Define overcomplete HMMs
- Proposition C.7 shows that distribution has rank S and fidelity (poly(S)) −1
D general algorithm for finding approximate basis
- Definition D.1 defines an approximate basis for a probability vector
- Theorem 3 presents a main result on how to build an approximate basis for a regular low rank distribution
- The regularity assumption on the distribution can be removed using ideas from Appendix B.6
D.1 learning coefficients
- Check if there exists a β(x) such that a certain condition is met
- Define a 2 approximation error
- Use relative probabilities for regular distributions to build a guess for approximate basis
- Use poly(T, 1/ε, 1/α, log(1/δ)) many conditional samples to get estimates
- Use a 2-smooth function and a standard uniform convergence argument
- Use Hoeffding’s inequality
- Use an Elliptical Potential Lemma
- Choose C, H and n
- Find a counterexample
- Show that the overall error of the basis is small
E helper propositions
- Proposition E.1 (Hoeffding’s inequality) states that for independent random variables with a lower and upper bound, the sum of these random variables can be bounded by a certain value.
- Davis-Kahan theorem is used in the work.
- Algorithm 2 with access to a conditional sampling oracle runs in poly(r, T, O, 1/∆ * , 1/ε, log(1/δ)) time and returns an efficiently represented approximation Pr[•] satisfying TV(Pr, Pr) ≤ ε with probability at least 1 − δ.
- Lemma 4.2 states that with probability 1 - δ, {S t } t∈[T ] form ∆-robust bases for Pr[•].
- Lemma 4.4 states that we can learn approximations A o,t for all observations o ∈ O and t ∈ [T ] in poly(r, |O|, T , 1/ε, 1/∆, log(1/δ)) time such that with probability 1 − δ, for any unit vector v.
- Lemma 4.5 states that the functions Pr[•] and Pr[•] are close in TV distance: TV(Pr, Pr) ≤ 2|O|T ε.
- Definition B.14 states that Test A(f, b) passes if the empirical estimate Pr[f τ |bf 1:τ −1 ] > 2α for all τ ∈ [t] and fails otherwise.
- Proposition B.17 states that Pr[F b |b] ≤ O(|O|T α).
- Definition B.15 states that a future f is α-irregular for history b ∈ B t if there exists some τ ∈ [t] (τ can depend on b) such that Pr[f τ |bf 1:τ −1 ] < α.
- Proposition B.18 states that Pr[F b |b] ≤ |O|T α.
- Definition B.19 states that Pr[f |b] is set to 0 if test A(f, b) fails and is set to the estimate from Proposition B.16 if test A(f, b) passes.
- Proposition B.20 states that Pr[f |b] - Pr[f |b] ≤ γ Pr[f |b].
- Proposition D.4 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.
- Proposition D.5 states that there exists h ≤ H such that min β∈R h ,||β|| 2 ≤C L B h ,b h+1 (β) ≤ ε.