The basic model is the same as the one presented in Section 4.1. The hidden state sequence is denoted by and other parameters by . The exact form of will be specified later. The observations , given the corresponding hidden state, are assumed to be Gaussian with diagonal covariance matrix.

Given the HMM state sequence , the individual observations are assumed to be independent. Therefore the likelihood of the data can be written as

Because of the Markov property, the prior distribution of the probabilities of the hidden states can also be written in factorial form:

The factors of Equations (5.1) and (5.2) are defined to be

(5.3) | ||

(5.4) | ||

(5.5) |

The priors of all the parameters defined above are

(5.6) | ||

(5.7) | ||

(5.8) | ||

(5.9) |

These should be written as conditional distributions conditional to the parameters of the hyperprior but the conditioning variables have been dropped out to simplify the notation.

The parameters and of the Dirichlet priors are fixed. Their values should be chosen to reflect true prior knowledge on the possible initial states and transition probabilities of the chain. In our example of speech recognition where the states of the HMM represent different phonemes, these values could, for instance, be estimated from textual data.

All the other parameters and have higher hierarchical priors. As the number of parameters in such priors grows, only the full structure of the hierarchical prior of is given. It is:

(5.10) | ||

(5.11) | ||

(5.12) |

The hierarchical prior of for example can be summarised as follows:

- The different components of have different priors whereas the vectors corresponding to different states of the HMM share a common prior, which is parameterised with .
- The parameters corresponding to different components of the original vector share a common prior parameterised with .
- The parameter has a fixed noninformative prior.

The set of model parameters consists of all these parameters and all the parameters of the hierarchical priors.

In the hierarchical structure formulated above, the Gaussian prior for the mean of a Gaussian is a conjugate prior. Thus the posterior will also be Gaussian.

The parameterisation of the variance with
,
is somewhat less conventional. The conjugate
prior for variance of a Gaussian is the *inverse gamma*
distribution. Adding a new level of hierarchy for the parameters of
such a distribution would, however, be significantly more difficult.
The present parameterisation allows adding similar layers of hierarchy
for the parameters of the priors of and . In this
parameterisation the posterior of is not exactly Gaussian but it
may be approximated with one. The exponential function will ensure
that the variance will always be positive and the posterior will thus
be closer to a Gaussian.