In theoretical considerations it seem to exist several limitations of the applicability of HMMs for the phoneme modeling. First, the assumption of phonemes as the invariable speech units is insufficient. Comparison of the same phonemes pronounced in different utterances reveals the well-known coarticulation effect, i.e. the articulatory apparatus adapts the current phoneme to the neighboring phonemes so that the realization is slightly variable. Some additional variation is due to the changes in the data collection conditions and the speaker. In order to take all variations into account, several different HMMs should actually be trained for each phoneme. Another possibility would be to apply a more complicated state structure that allows skips and branches instead of the simple uni-directional chain of Figure 1. However, the trainability and the scarcity of the training data restrict the use of very sophisticated models.
The basic assumptions of Markov models that only the present state affects the transition probabilities and that the successive observations are independent, which allow the simple probabilistic inference, do not hold accurately for speech. Both the production of the signal and the feature extraction are affected by the neighboring events so that the models can only be treated as approximations. The static density models are only approximations, as well, since all phonemes do not generate quasistationary signals. For using only a few Gaussian mixtures per phoneme there is yet a possibility for mismatch, because the variation around the centroids of the clusters in the input space is not inherently Gaussian. Such non-Gaussian clusters can be accurately modeled using only a sufficiently large number of mixtures.
Numerical difficulties arise from the wide dynamical range of the Gaussian densities. To prevent too low sequence probabilities special tricks need to be used. For example, the probabilities can be normalized at each time step or instead of multiplication in (17) the summation of logarithmized probabilities can be used [Rabiner, 1989]. The joining of the discrete weights and transition probabilities to the density function values having much wider dynamical range may sometimes cause problems, as well.
Anyhow, despite their inaccuracy and theoretical incompleteness the mixture density HMMs seem to work very well in practice, and they are applied in almost all current ASR systems. The reason is perhaps that since their structure is so simple and mathematically feasible, the models can be extended easily. It can be shown [Niles and Silverman, 1990] that the HMM training is actually very close to some ANN methods. In practice, the complexity of the data and the theoretical limitations of the HMMs can be then partly overcome by brute force, i.e by increasing codebooks, input dimensions, and the number of models. As known for neural computation in general, very complex systems can be modeled in appropriate accuracy by connecting a sufficient amount of simple units [Gorin and Mammone, 1994]. Special care must, however, be taken to control that there is enough training data and the data represents the task well. Otherwise the huge models will be over-fitted and will generalize poorly to independent test data. The heavily extended models are of the black-box type and do not necessarily offer any intelligible insights into the system.