next up previous contents
Next: ASR system. Up: About the phoneme recognition Previous: About the phoneme recognition

Recognition error rate.

A testing task used throughout the thesis deals with automatic speech recognition for unlimited vocabulary. Finnish speech data collected between 1991-1996 in the Laboratory of Computer and Information Science was used in the recognition tests. For each of over 20 speakers there are at least four recording sessions collected during a time span of normally a couple of months. In each session the speaker has uttered a word list consisting of over 300 different Finnish words. For the tests to compare the different algorithms and methods in this thesis, only a subset of speakers have been chosen, however, to keep the computational load tolerable. The speaker dependent recognition models are trained using three word sets and tested by the remaining one. To get a reliable view of the error rate, the leave-one-out principle is used in the testing and the average error rate of all the sets is reported. The error rate consists of the number of all phoneme errors (inserted, deleted, and changed phonemes) divided by the total number of phonemes. To gain statistical significance for the model comparisons, the error rates of all tests for all tested speakers are averaged. The rate of correct phonemes is sometimes given as well, but because it does not react to the phoneme insertion errors, it is not as illustrative as the phoneme error rate.

No post-processing of the results is applied in order to extract all the differences produced by the compared models. The post-processing can exploit language dependent syntax and point out uncommon phoneme combinations from the raw phoneme sequences. An optional post-processing module in the applied ASR system is based on the Dynamically Expanding Context algorithm (DEC) [Kohonen, 1986a].The recognition of long phoneme versions like /AA/ from their short counterparts is a source of some frequent errors, as well. Here, the distinction is made using phoneme dependent duration limits learned iteratively during the model training. This simple separation does not take any context information into account. In Finnish the mismatches between the written and spoken format of words are quite exceptional, but these errors as well as some unmodeled rare phonemes increase the lowest obtainable value for the error rate.

Although the recognition test settings look similar in all publications included in the thesis, the error rate comparisons between the publications need to be done with special care. The speech database was revised in 1995 and the data collected after that is mainly used in the experiments, because there is now a broader variety of speakers available ranging from ASR researchers to novices. In some of the experiments the leave-one-out principle to average the results was abandoned in order to be able to test the methods with a larger amount of speakers. The set of training words was extended, as well, from the previous 311 to 350 by including some more uncommon words for better balance among the phoneme combinations. Since other more independent high quality Finnish databases suitable for similar experiments have unfortunately not been available, the data from 1991, which apply slightly different sampling rate than the new data, is still used to verify some of the main results.


  
Figure 2: The main phases of the ASR by HMMs.
\begin{figure}
\centerline{
\epsfig {file=asr.eps, width=15cm}
}\end{figure}


next up previous contents
Next: ASR system. Up: About the phoneme recognition Previous: About the phoneme recognition
Mikko Kurimo
11/7/1997