Introduction to this thesis project.

Next: Contents of the thesis Up: INTRODUCTION Previous: Introduction to the problem.

Introduction to this thesis project.

The ASR research in the Laboratory of Information and Computer Science has long traditions since the first phoneme recognition system developed about 20 years ago. The system was able to segment and label speech into phoneme classes using the Learning Subspace Method producing phonemic strings which were correct to 81 % for one speaker [Jalanko, 1980]. The corresponding word recognition accuracy of 94 % was obtained by this system in a closed thousand-word vocabulary applying a suitable post-processing. Later, the development of new algorithms and recognition methods in the laboratory and the opportunities to use dramatically more powerful computers have increased the average speaker dependent rate of correct phonemes in phoneme strings for unlimited vocabulary without post-processing now to about 97 %. This rate corresponds the average phoneme error rate of about 5 % that was measured in this work as an average of seven speakers and 350 test words.

The research on LVQ and HMMs for ASR started in the laboratory in 1990 from experiments with LVQ-trained codebooks for discrete observation density HMMs [Torkkola et al., 1991,Kohonen, 1991]. The prominent system was used as a basis for several reports [Kurimo, 1992,Mäntysalo, 1992,Utela, 1992]. The continuous density HMMs were first applied as a reference method, but after the introduction of the semi-continuous density models and LVQ based training methods the work leading to [Kurimo, 1994] and this thesis began to take shape. The motivation for this work has been to enhance the modeling and training methods to decrease the recognition errors produced by the HMM decoding system and study the effects of the extension of the system to use higher dimensional feature vectors.

In this work two ANN paradigms, the Self-Organizing Map (SOM) and the Learning Vector Quantization (LVQ) are used with HMMs to incorporate some especially useful features into the modeling of the phonemes. SOM is applied to initialize and train the output densities of the states of the HMM (see Section 3.2 for closer explanations). The output density model is a mixture of Gaussian density functions and by SOM the codebooks of Gaussians are organized into a grid such that Gaussians responding well to similar features are expected to be found near each other in the grid. The obtained smooth and ordered representation of the feature space assists in the full exploitation of the modeling capacity and fast search. The LVQ is used as a simple method to increase the discrimination ability of the density codebook to minimize the recognition error rate.

Next: Contents of the thesis Up: INTRODUCTION Previous: Introduction to the problem.

Mikko Kurimo
11/7/1997