homework

In order to illustrate some comments made to my previous post, I suggest
the following homework problem:

We are dealing with a Hidden Markov model with n internal states, which produces as output a sequence of Hs and Ts. We know that it is ergodic (*), we are told that it is biased such that either p(H) > 2p(T) or p(T) > 2p(H) (if it just runs long enough) and we observe a sequence HTHHT... of N ouptputs.

How big does N have to be as a function of n to determine with some confidence if we are
dealing with one or the other case?

If we do not know n (**), how large would N have to be?

And how does an uninformed prior look like in this case? Will it have to be an improper prior as long as we do not know n or should we somehow weight with the (inverse of) the complexity of the HMMs?

(*) Every internal state can eventually be reached and every internal state
will eventually be left, but the transition probabilities might be small.

(**) This makes it somewhat similar to the case of the previous blog post, but of course, in the previous post we do not know anything about the underlying model, other than that there is a strong bias.

wolfgang said...

I should add that I think none of these questions has a good answer.
Consider a HMM where one particular state is difficult to Enter with probability E << 1 and yet once the HMM is in this state it is even more unlikely to eXit, with probability X << E.
But as long as the HMM is in this state it mostly produces Hs (or Ts), so that all of the bias might come from this one state.
But in order to have a reasonable chance to encounter it one needs a sequence of length N > 1/E.

Now my problem is that assuming a uniform distribution for E between 0 and a < 1, the
integral Int( dE 1/E ) diverges
and so does the expectation value for N ...

RZ said...

my intuition is that we have to wait at least a typical correlation length, which is proportional to the inverse gap between the largest eigen value of the markov matrix, and the steady state (eigen-value 1). If we don't know anything about the matrix, there is not much to be said. In fact, as Wolfgang points out, even looking at the sequence and tracking correlation decays might not help if we get caught in a "meta-stable" region of the phase space.

Isn't this the stochastic version of the "attractor embedding" idea of Takens?

wolfgang said...

>> my intuition is that we have to wait at least a typical correlation length

I agree.
So if we are dealing with the general case of a biased coin toss (e.g. thrown by a clever stage magician) there is not much we can do.
So are we back to assigning p(H) = 1/2 , at least for a (long) sequence of N coin tosses?
In other words do we have to treat the strongly biased coin as a fair coin?

RZ said...

I would say that we have to treat the statement "this is a strongly biased coin with an *unknown* bias, that can be observed only after an unknown amount of flips"
as a statement with a very limited amount of information.

Compare:

wolfgang said...

>> the statement "this is a strongly biased coin with an *unknown* bias, that can be observed only after an unknown amount of flips"

The statement is that the coin is indeed strongly biased, either p(H) > 2p(T) or p(T) > 2p(H), but
we dont know where it comes from.
The fact that in general the bias can be determined only after an unknown amount of flips was *deduced* and not stated (and I could be wrong about that).

RZ said...

How was it deduced?

wolfgang said...

>> How was it deduced?
see my first comment to this post and your first comment to this post 8-)

RZ said...

:)
I'd call it a clarification, not a deduction.

Cosma said...

I think that every hidden Markov model where the underlying Markov chain is ergodic is \psi-mixing, and so the probability that the observed fraction of heads departs from its expectation by more than h goes to zero like exp{-CNh^2}, where C is an ugly expression that depends on the rate at which the \psi-mixing coefficients go to zero. We would only need to take h=1/6 (yes?), but the problem of course is C. My recollection is that while C > 0, it can be made arbitrarily small, essentially by the kind of metastability Wolfgang suggests.

This is all from my memory of Paul Shields's Ergodic Theory of Discrete Sample Paths, which I will try to check later today.

wolfgang said...

>> while C > 0, it can be made arbitrarily small

one question is whether C can be made arbitrarily small and another interesting question is about the 'expectation value' of C, assuming some 'uninformed prior' distribution for the internal configuration of the HMM.

As you have probably guessed, my main question is if the Bayesian method works only if it deals with restricted (set of) models.

wolfgang said...

On the other hand I wonder if one could show (numerically?) that some 'simple strategies' would work more often than not.
e.g. "if you have observed a sequence of N outcomes and the number of heads > 0.6*N then assume p(H) > p(T)" could be such a 'simple strategy'...
It might fail completely for some cases (e.g. metastable states) but still be correct in most cases and thus be the best possible 'uninformed guess'.

Cosma said...

As you have probably guessed, my main question is if the Bayesian method works only if it deals with restricted (set of) models.I'm glad you asked that question.

Cosma said...

It turns out that the exponential part of the psi-mixing convergence bound does not depend on the process, but there is a complicated polynomial-order pre-factor which does, and which I think can be made arbitrarily large for any given finite sample size, though of course it's dominated as the sample size grows. In fact, Shields has a homework problem which involves showing that if all you know about the coin-tossing is that it's ergodic, the convergence can be arbitrarily slow. I haven't done the problem, but from his hints it looks like the construction is basically pasting together parts of the state space with different biases, and adding very, very, very slow transitions between them. It's actually somewhat depressing to see that ergodicity alone is so unhelpful.

wolfgang said...

>> It's actually somewhat depressing
But on the other hand it is uplifting that a simple strategy works better than I would have thought.