 
 
 
 
 
 
 
  
 Next: Properties of the data
 Up: Speech data
 Previous: Speech data
     Contents 
The preprocessing performed to turn the digitised speech samples to
the observation vectors for the algorithms was as follows:
- The signal was high-pass filtered to emphasise the important
  higher frequencies.  This was done with a first order FIR filter
  having the transfer function 
 . .
- A 256-point Fourier transform with Hamming windowing was
  calculated for short overlapping segments.  The overlapping part of
  two consecutive segments consisted of half of the segments.
- The frequencies were transformed to Mel-scale to
  emphasise the important features for understanding the speech.  This
  gave a 30 component vector for each segment.
- The logarithm of the energies on the Mel-scale was used as
  observations.
These steps form a rather standard preprocessing procedure for
speech recognition [54,33].
The Mel-scale of frequencies has been designed to model the frequency
response of the human ear.  The scale is constructed by asking a
naïve listener when she found the heard sound to have double or
half of the frequency of a reference tone.  The resulting scale is
close to linear at frequencies below 1000 Hz and nearly logarithmic
above that [54].
Figure 7.1 shows an example of what the
preprocessed data looks like.
Figure:
An example of the preprocessed spectrogram of a speech
      segment.  Time increases from left to right and frequency from
      down to up.  White areas correspond to low energy of the signal
      and dark areas to high energy.  The word in the segment is
      ``JOHTOPÄÄTÖKSIÄ'', meaning ``conclusions''.  Every letter in
      the written word corresponds to one phoneme in speech.  The
      silent areas in the middle correspond to the consonants t, p, t
      and k, thus revealing the segmentation of the utterance into
      phonemes.
|  | 
 
 
 
 
 
 
 
 
  
 Next: Properties of the data
 Up: Speech data
 Previous: Speech data
     Contents 
Antti Honkela
2001-05-30