It can be shown [Juang, 1985] that the maximum likelihood HMM training will eventually lead to at least locally optimal parameter values, but it is the duty of the initialization to make sure that the obtained result will be as close as possible to the global optimum. In practice, the error function is so complicated due to the large training databases that it is not feasible to iterate the HMM training until the final convergence, since the progress will usually be very slow after the first couple of iterations. The ML criterion is actually not particularly effective to minimize the number of misclassifications, but it was used in the experiments to compare initializations, anyhow, because the advanced training algorithms like the segmental GPD [Juang and Katagiri, 1992] or the maximum mutual information (MMI) [Bahl et al., 1986] are even more vulnerable to bad initializations. Actually, the normal procedure for HMM training is to apply GPD or MMI only for models first trained by ML methods (in Figure 3 GPD or MMI would substitute the corrective training).
In addition to comparing the average test set error rates after a certain number of ML iterations, the performance of the differently initialized models is analyzed by taking into account the speed of convergence. In this case the speed of convergence refers to the number of training epochs it takes before the recognition result is close to some benchmark result. The benchmark result is used as a substitute of the optimal result simply because the determination, and even the definition, of the optimal result is somewhat troublesome. The benchmark result is determined here by finding out how low error rates can actually be achieved with the experimented ML method. This is done by selecting the model which gives the lowest average error rate after a long training session. The tested model is assumed to be near enough to the benchmark when the average recognition result is not significantly different (tested by the matched-pairs and the McNemar's statistical tests [Gillick and Cox, 1989]) from the benchmark.
The results in Publication 6 show that there are remarkable differences depending on the applied initialization methods. By the criteria described above, the SOM and LVQ initialization methods suit best to the MDHMMs. SOM was selected for the initialization to be used with further training method comparisons, because its fast implementation. For phoneme-wise tied MDHMMs (see Figure 1), the SOM initialization can be implemented very efficiently, since each mixture density codebook can be independently initialized.