ASR system.

Next: The applications of SOM Up: About the phoneme recognition Previous: Recognition error rate.

ASR system.

For a brief review of the ASR system used in the experiments see the Figure 2. The recognition occurs in five successive phases. The contribution of this work restricts mainly to the middle phase, where the output density models of the HMM states are used to provide a likelihood for each state that the state would produce the current observed feature vector of the signal. This part is the one that consumes over 50 % of the total computational load of the recognition process. Some experiments have been made for Publications 4 and 6 with different extensions for the feature vectors, which fall to the second phase in Figure 2. The other three main phases of the recognition system are currently rather straightforward implementations of conventional algorithms with some minor practical tricks only to make the processing more effective.

The preprocessing of the acoustic signal is basically the same for all the experiments except that the sampling rate was increased for the new data from 12.8 kHz to 16 kHz, commonly available in the workstations. The acoustic features used throughout this work consists of the mel-cepstrum coefficients and the RMS-value of the signal. The basic feature vectors for the experiments are 20 component cepstra, but also extended feature vectors like averaged, concatenated, and delta cepstra were tested (Publication 4) and also a version where the concatenated vectors use only the 10-15 first coefficients (Publication 6).

The HMM structure has been subject to continuous development in this work. Some basic assumptions of the system remaining unchanged are the simple temporal structure of uni-directional chains without skips (see Figure 1), the principle of using one HMM for each of phoneme and the exclusion of some rare phonemes leaving totally 22 HMMs, which include only common Finnish phonemes and the silences directly before and after the word. The building blocks for the output density of the states have been Gaussians with a shared diagonal covariance matrix.

The most common types of mixture density HMMs are compared, for example, in Publication 3 in terms of the number of parameters, the recognition speed, and the achieved error rate. The structure called phoneme-wise tied mixture density HMM (PWMHMM) (Figure 1) performed best and was chosen to be the baseline method for both the further developments and the online prototype system. The current average phoneme error rate for the seven speakers used in Publication 6 is 5.3% measured on the 350-word test sets. By applying several successive training methods, a lower error rate (4.8%) can be achieved.

Next: The applications of SOM Up: About the phoneme recognition Previous: Recognition error rate.

Mikko Kurimo
11/7/1997