Paper abstracts


Large Vocabulary Statistical Language Modeling for Continuous Speech Recognition in Finnish

Authors

Vesa Siivola, Mikko Kurimo, and Krista Lagus

Published

In Proceedings of the 7th European Conference on Speech Technology and Communication, EUROSPEECH'97, Aalborg, Denmark, 2001.

Abstract

Entire paper (32 kB)


Arabic Documents Indexing and Classification Based on Latent Semantic Analsyis and Self-Organizing Map

Authors

Chafic Mokbel, Hanna Greige, Charles Sarraf, and Mikko Kurimo

Published

In Proceedings of the IEEE workshop on Natural Langage Processing in Arabic, Beirut, Lebanon, 2001.

Abstract

This paper describes an Arabic document indexing system based on a hybrid "Latent Semantic Analysis" (LSA) and "Self-Organizing Maps" (SOM) algorithm. The approach has the advantage to be completely statistic and to automatically infer the indices from the documents database. A rule-based stemming method is also proposed for the Arabic language. The whole system has been experimented on a database formed of the "Alnahar" newspaper articles for 1999. Documents clustering and few first experiments in retrieval have provided encouraging results.


Thematic Indexing of Spoken Documents by Using Self-Organizing Maps

Author

Mikko Kurimo

Published

Accepted for publication in Speech Communication.

Abstract

A method is presented to provide a useful searchable index for spoken audio documents. The task differs from the traditional (text) document indexing, because large audio databases are decoded by automatic speech recognition and decoding errors occur frequently. The idea in this paper is to take advantage of the large size of the database and select the best index terms for each document with the help of the other documents close to it using a semantic vector space. First, the audio stream is converted into a text stream by a speech recognizer. Then the text of each story is represented by a document vector which is the normalized sum of the word vectors in the story. A large collection of document vectors is used to train a self-organizing map to find the clusters and latent semantic structures in the collection. Because the news stories are quite short and include speech recognition errors, the idea of smoothing the document vectors using the thematic clusters determined by the self-organizing map is introduced to get a better index. The application in this paper is the indexing and retrieval of broadcast news on radio and TV. Test results are given using the evaluation data from the TREC spoken document retrieval task.

Entire manuscript (263 kB)


Indexing spoken audio by LSA and SOMs

Author

Mikko Kurimo

Published

In Proceedings of the European Signal Processing Conference, EUSIPCO'2000, Tampere, Finland, 2000.

Abstract

This paper presents an indexing system for spoken audio documents. The framework is an European long term research project THISL for indexing and retrieval of broadcast news. The current indexing system applies latent semantic analysis (LSA) and self-organizing maps (SOM) to map the documents into a semantic vector space and display the semantic structures of the document collection. The objective is to enhance the indexing of high word error rate parts by smoothing the document vectors with other documents which are close to it. Experimental results are provided using the test data of the TREC SDR (spoken document retrieval) track.

Entire paper (179 kB)


Fast Latent Semantic Indexing of Spoken Documents by using Self-Organizing Maps

Author

Mikko Kurimo

Published

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP'2000, Istanbul, Turkey, 2000.

Abstract

This paper describes a new latent semantic indexing (LSI) method for spoken audio documents. The framework is indexing broadcast news from radio and TV as a combination of large vocabulary continuous speech recognition (LVCSR), natural language processing (NLP) and information retrieval (IR). For indexing, the documents are presented as vectors of word counts, whose dimensionality is rapidly reduced by random mapping (RM). The obtained vectors are projected into the latent semantic subspace determined by SVD, where the vectors are then smoothed by a self-organizing map (SOM). The smoothing by the closest document clusters is important here, because the documents are often short and have a high word error rate (WER). As the clusters in the semantic subspace reflect the news topics, the SOMs provide an easy way to visualize the index and query results and to explore the database. Test results are reported for TREC's spoken document retrieval databases.

Entire paper (184 kB)


Indexing Audio Documents by using Latent Semantic Analysis and SOM

Author

Mikko Kurimo

Published

In Erkki Oja and Samuel Kaski, editors, Kohonen Maps, pages 363-374, Elsevier, 1999.

Abstract

This paper describes an important application for state-of-art automatic speech recognition, natural language processing and information retrieval systems. Methods for enhancing the indexing of spoken documents by using latent semantic analysis and self-organizing maps are presented, motivated and tested. The idea is to extract extra information from the structure of the document collection and use it for more accurate indexing by generating new index terms and stochastic index weights. Indexing methods are evaluated for two broadcast news databases (one French and one English) using the average document perplexity defined in this paper and test queries analyzed by human experts.

Entire chapter (57 kB)


Latent Semantic Indexing by Self-Organizing Map

Author

Mikko Kurimo and Chafic Mokbel

Published

In Proceedings of the ESCA ETRW workshop on Accessing Information in Spoken Audio, pages 25-30, Cambridge, UK, 1999.

Abstract

An important problem for the information retrieval from spoken documents is how to extract those relevant documents which are poorly decoded by the speech recognizer. In this paper we propose a stochastic index for the documents based on the Latent Semantic Analysis (LSA) of the decoded document contents. The original LSA approach uses Singular Value Decomposition to reduce the dimensionality of the documents. As an alternative, we propose a computationally more feasible solution using Random Mapping (RM) and Self-Organizing Maps (SOM). The motivation for clustering the documents by SOM is to reduce the effect of recognition errors and to extract new characteristic index terms. Experimental indexing results are presented using relevance judgments for the retrieval results of test queries and using a document perplexity defined in this paper to measure the power of the index models.

Entire paper (40 kB)


Improving Vocabulary Independent HMM Decoding Results by Using the Dynamically Expanding Context

Author

Mikko Kurimo

Published

In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seattle, WA, USA, 1998.

Abstract

A method is presented to correct phoneme strings produced by a vocabulary independent speech recognizer. The method first extracts the N best matching result strings using mixture density hidden Markov models (HMMs) trained by neural networks. Then the strings are corrected by the rules generated automatically by the Dynamically Expanding Context (DEC). Finally, the corrected string candidates and the extra alternatives proposed by the DEC are ranked according to the likelihood score of the best HMM path to generate the obtained string. The experiments show that N need not be very large and the method is able to decrease recognition errors from a test data that even has no common words with the training data of the speech recognizer.

Entire paper (28 kB)

Poster presented in ICASSP (88 kB)


Self Organization in Mixture Densities of HMM based Speech Recognition

Author

Mikko Kurimo

Published

In Proceedings of ESANN'98, European Symposium on Artificial Neural Networks, pages 237-242, Bruges, Belgium, 1998.

Abstract

In this paper experiments are presented to apply Self-Organizing Map (SOM) and Learning Vector Quantization (LVQ) for training mixture density hidden Markov models (HMMs) in automatic speech recognition. The decoding of spoken words into text is made using speaker dependent, but vocabulary and context independent phoneme HMMs. Each HMM has a set of states and the output density of each state is a unique mixture of the Gaussian densities. The mixture densities are trained by segmental versions of SOM and LVQ3. SOM is applied to initialize and smooth the mixture densities and LVQ3 to simply and robustly decrease recognition errors.

Entire paper (56 kB)


Training Mixture Density HMMs with SOM and LVQ

Author

Mikko Kurimo

Published

In Computer Speech and Language, Volume 11, Number 4, pages 321-343, October 1997.

Abstract

Entire paper


Comparison results for segmental training algorithms for mixture density HMMs

Author

Mikko Kurimo

Published

In Proceedings of the 5th European Conference on Speech Technology and Communication, EUROSPEECH'97, pages 87-90, Rhodes, Greece, 1997.

Abstract

This work presents experiments on four segmental training algorithms for mixture density HMMs. The segmental versions of SOM and LVQ3 suggested by the author are compared against the conventional segmental K-means and the segmental GPD. The recognition task used as a test bench is the speaker dependent, but vocabulary independent automatic speech recognition. The output density function of each state in each model is a mixture of multivariate Gaussian densities. Neural network methods SOM and LVQ are applied to learn the parameters of the density models from the mel-cepstrum features of the training samples. The segmental training improves the segmentation and the model parameters by turns to obtain the best possible result, because the segmentation and the segment classification depend on each other. It suffices to start the training process by dividing the training samples approximatively into phoneme samples.

Entire paper (32 kB)


SOM based density function approximation for mixture density HMMs

Author

Mikko Kurimo

Published

In Proceedings of the Workshop on Self-Organized Maps, WSOM'97, pages 8-13, Espoo, Finland, 1997.

Abstract

This paper explains how some properties of the Self-Organizing Maps (SOMs) can be exploited in the density models used in continuous density hidden Markov models (HMMs). The three main ideas are the suitable initialization of the centroids for the Gaussian mixtures, the smoothing of the HMM parameters and the use of topology for fast density approximations. The methods are tested here in the automatic speech recognition framework, where the task is to decode the phonetic transcription of spoken words by speaker dependent, but vocabulary independent phoneme models. The results show that the average number of final recognition errors will be over 15 % smaller, if the traditional K-means based initialization is substituted by SOM. The method described for fast SOM density approximation improves the total recognition time by over 40 % for the current online system compared to the default which uses independent complete searches for the best matching units.

Entire paper (58 kB)


Using Self-Organizing Maps and Learning Vector Quantization for Mixture Density Hidden Markov Models

Author

Mikko Kurimo

Published

PhD thesis, Helsinki University of Technology, Espoo, Finland, 1997.
Faculty of Information Technology in the Department of Computer Science
Supervisor: Professor Teuvo Kohonen

Abstract

This work presents experiments to recognize pattern sequences using hidden Markov models (HMMs). The pattern sequences in the experiments are computed from speech signals and the recognition task is to decode the corresponding phoneme sequences. The training of the HMMs of the phonemes using the collected speech samples is a difficult task because of the natural variation in the speech. Two neural computing paradigms, the Self-Organizing Map (SOM) and the Learning Vector Quantization (LVQ) are used in the experiments to improve the recognition performance of the models.

A HMM consists of sequential states which are trained to model the feature changes in the signal produced during the modeled process. The output densities applied in this work are mixtures of Gaussian density functions. SOMs are applied to initialize and train the mixtures to give a smooth and faithful presentation of the feature vector space defined by the corresponding training samples. The SOM maps similar feature vectors to nearby units, which is here exploited in experiments to improve the recognition speed of the system.

LVQ provides simple but efficient stochastic learning algorithms to improve the classification accuracy in pattern recognition problems. Here, LVQ is applied to develop an iterative training method for mixture density the HMMs, which increases both the modeling accuracy of the states and the discrimination between the models of different phonemes. Experiments are also made with LVQ based corrective tuning methods for the mixture density HMMs, which aim at improving the models by learning from the observed recognition errors in the training samples.

The suggested HMM training methods are tested using the Finnish speech database collected in the Neural Networks Research Centre at Helsinki University of Technology. Statistically significant improvements compared to the best conventional HMM training methods are obtained using the speaker dependent but vocabulary independent phoneme models. The decrease in the average number of phoneme recognition errors for the tested speakers have been around 10 percent in the applied test material.

Entire thesis (189 kB) The www page


Speech Recognition

Authors

Mikko Kurimo and Panu Somervuo

Published

In Triennial Report 1994 - 1996, Neural Networks Research Center & Laboratory of Computer and Information Science, Helsinki University of Technology, pages 48-54, March 1997.

Abstract

This report includes a brief description of the most important projects on this topic.

Entire report (74 kB)


Using SOM and LVQ for HMM Training

Author

Mikko Kurimo

Published

In Triennial Report 1994 - 1996, Neural Networks Research Center & Laboratory of Computer and Information Science, Helsinki University of Technology, pages 55-60, March 1997.

Abstract

This report includes a brief description of the most important projects on this topic.

Entire report (80 kB)


Training Mixture Density HMMs with SOM and LVQ

Author

Mikko Kurimo

Published

Helsinki University of Technology, Laboratory of Computer and Information Science, Report A43, January 1997, 26 pages.

Abstract

The objective of this paper is to present experiments and discussions of how some neural network algorithms can help the phoneme recognition with mixture density hidden Markov models (MDHMMs). In MDHMMs the modeling of the stochastic observation processes associated with the states is based on the estimation of the probability density function of the short-time observations in each state as a mixture of Gaussian densities. The Learning Vector Quantization (LVQ) is used to increase the discrimination between different phoneme models both during the initialization of the Gaussian codebooks and during the actual MDHMM training. The Self-Organizing Map (SOM) is applied to provide a suitably smoothed mapping of the training vectors to accelerate the convergence of the actual training. The obtained codebook topology can also be exploited in the recognition phase to speed up the calculations to approximate the observation probabilities. The experiments with LVQ and SOMs show reductions both in the average phoneme recognition error rate and in the computational load compared to the maximum likelihood training and the Generalized Probabilistic Descent (GPD). The lowest final error rate, however, is obtained by using several training algorithms successively. Additional reductions from the online system of about 40 % in the error rate are obtained by using the same training methods, but with advanced and higher dimensional feature vectors.

Entire report (116 kB)


Using the Self-Organizing Map to Speed up the Probability Density Estimation for Speech Recognition with mixture density HMMs

Authors

Mikko Kurimo and Panu Somervuo

Published

In Proceedings of the International Conference on Spoken Language Processing, ICSLP'96, pages 358-361, Philadelphia, PA, USA, 1996.

Abstract

This paper presents methods to improve the probability density estimation in hidden Markov models for phoneme recognition by exploiting the Self-Organizing Map (SOM) algorithm. The advantage of using the SOM is based on the created approximative topology between the mixture densities by training the Gaussian mean vectors used as the kernel centers by the SOM algorithm. The topology makes the neighboring mixtures to respond strongly for the same inputs and so most of the nearest mixtures used to approximate the current observation probability will be found in the topological neighborhood of the "winner" mixture. Also the knowledge about the previous winners are used to speed up the the search for the new winners. Tree-search SOMs and segmental SOM training are studied aiming at faster search and suitability for HMM training. The framework for the presented experiments includes mel-cepstrum features and phoneme-wise tied mixture density HMMs.

Entire paper (61 kB)


Segmental LVQ3 Training for Phoneme-wise Tied Mixture Density HMMs

Author

Mikko Kurimo

Published

In Proceedings of the European Signal Processing Conference, EUSIPCO'96, pages 1599-1602, Trieste, Italy, 1996.

Abstract

This work presents training methods and recognition experiments for phoneme-wise tied mixture densities in hidden Markov models (HMM). The system trains speaker dependent, but vocabulary independent, phoneme models for the recognition of Finnish words. The Learning Vector Quantization (LVQ) methods are applied to increase the discrimination between the phoneme models. A segmental LVQ3 training is proposed to substitute the LVQ2 based corrective tuning as a parameter estimation method. The experiments indicate that the new method can provide the corresponding recognition accuracy, but with less training and more robustness over the initial models. Experiments to up-scale the current system by introducing context vectors and larger mixture pools show up to 40 % reduction of recognition errors compared to the earlier results in
kurimo94.nnsp.

Entire paper (60 kB) Slides (124 kB)


Hybrid training method for tied mixture density hidden Markov models using Learning Vector Quantization and Viterbi estimation

Author

Mikko Kurimo

Published

In John Vlontzos, Jenq-Neng Hwang, Elizabeth Wilson, editors, Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, NNSP'94, pages 362-371, Ermioni, Greece, 1994.

Abstract

In this work the output density functions of hidden Markov models are phoneme-wise tied mixture Gaussians. For training these tied mixture density HMMs, modified versions of the Viterbi training and LVQ based corrective tuning are described. The initialization of the mean vectors of the mixture Gaussians is performed by first composing small Self-Organizing Maps representing each phoneme and then combining them to a single large codebook to be trained by Learning Vector Quantization (LVQ). The experiments on the proposed training methods are accomplished using a speech recognition system for Finnish phoneme sequences. Comparing to the corresponding continuous density and semi-continuous HMMs in
kurimo94.issipnn and kurimo93.esp in the respect of the number of parameters, the recognition time and the average error rate, the performance of the phoneme-wise tied mixture HMMs is superior.

Entire paper (46 kB)


Corrective Tuning by Applying LVQ for Continuous Density and Semi-continuous Markov Models

Author

Mikko Kurimo

Published

In Proceedings of International Symposium on Speech, Image Processing and Neural Networks, ISSIPNN'94, pages 718-721, Hong Kong, 1994.

Abstract

In this work the objective is to increase the accuracy of speaker dependent phonetic transcription of spoken utterances using continuous density and semi-continuous HMMs. Experiments with LVQ based corrective tuning indicate that the average recognition error rate can be made to decrease about 5% -- 10%. Experiments are also made to increase the efficiency of the Viterbi decoding by a discriminative approximation of the output probabilities of the states in the Markov models. Using only a few nearest components of the mixture density functions instead of every component decreases both the recognition error rate (5% -- 10% for CDHMMs) and the execution time (about 50% for SCHMMs). The lowest average error rates achieved were about 5.6%.

Entire paper (31 kB)


Application of Learning Vector Quantization and Self-Organizing Maps for training continuous density and semi-continuous Markov models

Author

Mikko Kurimo

Published

Licentiate's thesis, Helsinki University of Technology, Espoo, Finland, February 1994.
Faculty of Information Technology in the Department of Computer Science
Supervisor: Professor Teuvo Kohonen

Abstract

Experiments on variations of continuous density and semi-continuous hidden Markov models (CDHMM and SCHMM) as a part of a speech recognition system are performed. The work is oriented to the testing of different combinations of neural network methods developed in the Laboratory of Information and Computer Science at Helsinki University of Technology and more conventional statistical parameter estimation methods based on the maximum likelihood (ML) principle.

Self-Organizing Maps (SOM) and Learning Vector Quantization (LVQ) are applied into the initialization of the mean vectors of the mixture Gaussian densities for CDHMMs and SCHMMs to reduce the amount of required ML estimation and achieve more discriminative phoneme models. Experiments are also made with a LVQ-based corrective tuning method by which the HMMs can be further enhanced for lower recognition error rates.

Some approximations for the computationally complex determination of the continuous output densities for the CDHMMs and SCHMMs are also tested. The suggested approximations reduce the number of covariance parameters in the multivariate mixture Gaussian densities and the number of mixtures actually used in the observation probability computations. The lowest average phoneme recognition error rates achieved by the novel combinations of training methods were about 5.6 %.

Entire thesis (316 kB)


Using LVQ to Enhance Semi-continuous Hidden Markov Models for Phonemes

Author

Mikko Kurimo

Published

In Proceedings of 3rd European Conference on Speech Communication and Technology, EUROSPEECH'93, pages 1731-1734, Berlin, Germany, 1993.

Abstract

Experiments are made to enhance the discrimination ability of the SCHMMs by applying Learning Vector Quantization. The SCHMMs are used for the modeling of phonemes in a speaker-dependent speech recognition application to create the phonetic transcriptions of spoken utterances. The probability density functions for the cepstral feature vectors produced in each state of each model are modeled by mixtures of multivariate Gaussian density functions. The mean vectors of the Gaussian densities are chosen by clustering the feature vectors of the training samples by using the Self-Organizing Map (SOM). Then the Gaussians are modified to correspond better to the Bayesian decision surfaces between phonemes by tuning the mean vectors by the LVQ. The experiments indicate that by this careful placement of the mean vectors the recognition error rates for the SCHMMs decrease significantly. LVQ algorithms can also be successfully applied after Baum-Welch or Viterbi training to slightly modify the Gaussians using training samples which would otherwise be incorrectly recognized. This kind of error corrective tuning drops out some of the recognition errors also in the test data.

Keywords

HMMs, LVQ, SOM, semi-continuous

Entire paper (52 kB)


Application of Self-Organizing Maps and LVQ in training continuous density hidden Markov models for phonemes

Authors

Mikko Kurimo and Kari Torkkola

Published

In Proceedings of the International Conference on Spoken Language Processing, ICSLP'92, pages 543-546, Banff, Canada, 1992.

Abstract

We present experiments in using neural network based methods to initialize continuous observation density hidden Markov models (CDHMMs). Proper initialization provides an easy way to avoid excessive amount of iterations, when maximum likelihood algorithms are used to estimate the parameters of CDHMMs. This is important in, for example, phoneme based automatic speech recognition, where the output density functions of the states of HMMs are complex and a lot of training data must be used. In our work CDHMMs are used as phoneme models in the task of transcribing speech into phoneme sequences. The probability density function of the output distribution for a state is approximated by mixture of a large number of multivariate Gaussian density functions (typically 25). We present experiments of initializing the means of mixture Gaussians by Self-Organizing Maps (SOMs) and Learning Vector Quantization (LVQ). The results of the experiments indicate that initialization by SOMs speeds up the convergence in ML-parameter estimation, when error rate is used as a measure. The same applies to LVQ especially combined with segmental K-means algorithm.

Entire paper (45 kB)


Combining LVQ with continuous density hidden Markov models in speech recognition

Authors

Mikko Kurimo and Kari Torkkola

Published

In Proceedings of the SPIE's Conference on Neural and Stochastic Methods in Image and Signal Processing, pages 726-734, San Diego, USA, 1992.

Abstract

We propose the use of Self-Organizing Maps (SOMs) and Learning Vector Quantization (LVQ) as an initialization method for the training of the continuous observation density hidden Markov models (CDHMMs). We apply CDHMMs to model phonemes in the transcription of speech into phoneme sequences. The Baum-Welch maximum likelihood estimation method is very sensitive to the initial parameter values if the observation densities are represented by mixtures of many Gaussian density functions. We suggest the training of CDHMMs to be done in two phases. First the vector quantization methods are applied to find suitable placements for the means of Gaussian density functions to represent the observed training data. The maximum likelihood estimation is then used to find the mixture weights and state transition probabilities and to re-estimate the Gaussians to get the best possible models. The result of initializing the means of distributions by SOMs or LVQ is that good recognition results can be achieved using essentially fewer Baum-Welch iterations than is needed with random initial values. Also in the segmental K-means algorithm the number of iterations can be remarkably reduced with a suitable initialization. We experiment furthermore to enhance the discriminatory power of the phoneme models by adaptively training the state output distributions using the LVQ-algorithm.

Entire paper (48 kB)


Training Continuous Density Hidden Markov Models in Association with Self-Organizing Maps and LVQ

Authors

Mikko Kurimo and Kari Torkkola

Published

In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, pages 174-183, Copenhagen, Denmark, 1992.

Abstract

We propose a novel initialization method for continuous observation density hidden Markov models (CDHMMs) that is based on Self-Organizing Maps (SOMs) and Learning Vector Quantization (LVQ) Our framework is to transcribe speech into phoneme sequences using CDHMMs as phoneme models. When numerous mixtures of, for example, Gaussian density functions are used to model the observation distributions of CDHMMs, good initial values are necessary in order to the Baum-Welch estimation to converge satisfactorily. We experiment with constructing rapidly good initial values by SOMs, and with enhancing the discriminatory power of the phoneme models by adaptively training the state output distributions by using the LVQ-algorithm. Experiments indicate that an improvement to the pure Baum-Welch and the segmental K-means procedures can be obtained using the proposed method.

Entire paper (48 kB)


Adaptiivisten vektorikvantisointimenetelmien ja kätkettyjen Markov-mallien kombinaatioita puheentunnistuksessa

Author

Mikko Kurimo

Published

Master's thesis, Helsinki University of Technology, February 1992.

Abstract

Tässä työssä tutkitaan jatkuvan havaintojakauman kätkettyjen Markov-mallien yhdistämistä adaptiiviseen vektorikvantisointiin perustuvaan puheentunnistusjärjestelmään. Tutkimus liittyy Teknillisen korkeakoulun informaatiotekniikan laboratoriossa tutkijaprofessori Teuvo Kohosen johdolla tehtävään puheentunnistustutkimukseen.

Foneemit mallitetaan peräkkäisinä puhesignaalin tiloina. Tilat määritetään puhesignaalista laskettujen hetkellisten kepstrikertoimien muodostamien piirrevektorien ja tilojen välisten siirtymien todennäköisyysjakaumien avulla. Piirrevektorijakaumaa approksimoidaan monen osittain päällekkäisen normaalijakauman yhdistelmällä. Mallit opetetaan kerätyn puheaineiston perusteella käyttäen vektorikvantisointimenetelmiä normaalijakaumien huippujen optimaaliseen sijoitteluun ja iteratiivista Baum-Welch -estimointia muiden parametrien määrittämiseen.

Tehdyt kokeet osoittavat, että piirrevektorien todennäköisyysjakaumat puhesignaalin tilamallissa ovat mutkikkaita ja niiden approksimaatiot vaativat paljon parametreja. Parhaaksi strategiaksi osoittautui keskittyminen enemmän jakaumien huippujen paikallistamiseen kuin huippujen muodon kuvaamiseen. Itseorganisoituvan piirrekartan ja oppivan vektorikvantisoinnin käyttö Baum-Welch -estimoinnin ohella jatkuvien Markov-mallien opetuksessa tuotti parempia tunnistustuloksia kuin pelkkä Baum-Welch -opetus. Tunnistuksessa saadut tulokset olivat tarkempia kuin puheentunnistusjärjestelmässä tällä hetkellä käytössä olevilla diskreetin havaintojakauman Markov-malleilla, mutta eivät kuitenkaan täysin vertailukelpoisia.

Entire thesis (281 kB)


Improving Short-Time Speech Frame Recognition Results by Using Context

Authors

Kari Torkkola and Mikko Kokkonen and Mikko Kurimo and Pekka Utela

Published

In Proceedings of the 2nd European Conference on Speech Technology and Communication, EUROSPEECH'91, pages 793-796, Milano, Italy, 1991.

Abstract

This paper focuses in comparing three approaches to improve the accuracy of classifying short-time speech frames into phoneme classes by taking into account the classifications of nearby frames, also individually classified. We investigate whether this improvement has effect to the accuracy of transcribing speech into phoneme sequences using two different decoding schemes, one based on simple durational rules, and the other on hidden Markov models (HMMs). The experiments indicate that the recognition accuracies can indeed be improved significantly by taking the local context into account.

Entire paper (36 kB)


Status report of the Finnish phonetic typewriter project

Authors

Kari Torkkola, Jari Kangas, Pekka Utela, Sami Kaski, Mikko Kokkonen, Mikko Kurimo, and Teuvo Kohonen

Published

In Proceedings of the International Conference on Artificial Neural Networks (ICANN-91), pages 771-776, Espoo, Finland, June 24-28 1991. North-Holland.

Abstract

In connection to a speech recognizer, the aim of which is to produce phonemic transcriptions of arbitrary spoken uttrances, we investigate the combined effect of several improvements at different stages of phoneme recognition.

The core of the basic recognition system is Learning Vector Quantization (LVQ1) [1]. This algorithm was originally used to classify FFT based short time feature vectors into phonemic classes. The phonemeic decoding phase was earlier based on simple durational rules [2] [3].

At the feature level, we now study the effect of using mel scale cepstral features and concatenating several consecutive feature vectors to include context. At the output of vector quantization, acomparison of three approaches to take into account the classifications of feature vectors in local context is presented. The rule based phonemic decoding is compared to decoding employing Hidden Markov Models (HMMs). As earlier, an optional grammatical post correction method (DEC) is applied.

Experiments conducted with three male speakers indicate that it is possible to increase significantly the phonemic transcription accuracy of the previous configuration. By using appropriately liftered cepstra, concatenating three adjacent feature vectors, and using HMM based phonemic decoding, the error rate can be decreased from 14.0 % to 5.8 %.

Entire paper (40 kB)


Back to the bibliography page
Mikko Kurimo <Mikko.Kurimo@hut.fi>
Last modified: Mon Jul 9 13:27:35 EEST 2001