The preprocessing of the acoustic signal is basically the same for all the experiments except that the sampling rate was increased for the new data from 12.8 kHz to 16 kHz, commonly available in the workstations. The acoustic features used throughout this work consists of the mel-cepstrum coefficients and the RMS-value of the signal. The basic feature vectors for the experiments are 20 component cepstra, but also extended feature vectors like averaged, concatenated, and delta cepstra were tested (Publication 4) and also a version where the concatenated vectors use only the 10-15 first coefficients (Publication 6).
The HMM structure has been subject to continuous development in this work. Some basic assumptions of the system remaining unchanged are the simple temporal structure of uni-directional chains without skips (see Figure 1), the principle of using one HMM for each of phoneme and the exclusion of some rare phonemes leaving totally 22 HMMs, which include only common Finnish phonemes and the silences directly before and after the word. The building blocks for the output density of the states have been Gaussians with a shared diagonal covariance matrix.
The most common types of mixture density HMMs are compared, for example, in Publication 3 in terms of the number of parameters, the recognition speed, and the achieved error rate. The structure called phoneme-wise tied mixture density HMM (PWMHMM) (Figure 1) performed best and was chosen to be the baseline method for both the further developments and the online prototype system. The current average phoneme error rate for the seven speakers used in Publication 6 is 5.3% measured on the 350-word test sets. By applying several successive training methods, a lower error rate (4.8%) can be achieved.