next up previous contents
Next: Comparison with other models Up: Speech data Previous: Preprocessing   Contents

Properties of the data set

One distinctive property of the data is that it is not very continuous. This is due to the bad frequency resolution of the relatively short time Fourier transform used in the preprocessing.

The same data set was used with the static NFA model in [25]. The part used in these experiments consisted of spectrograms of 24 individual words, spoken by 20 different speakers. The preprocessed data consisted of 2547 spectrogram vectors with 30 components.

Figure 7.2: The remaining energy of the speech data as a function of the number of extracted components using linear and nonlinear factor analysis.

For studying the dimensionality of the data, linear and nonlinear factor analysis were applied to the data. The results are shown in Figure 7.2. All the NFA experiments used an MLP network with 30 hidden neurons. The data manifold is clearly nonlinear, because nonlinear factor analysis is able to explain it equally well with fewer components than linear factor analysis. The difference is especially clear when the number of components is relatively small. Even though the analysis only uses static models, it can be used to estimate a lower bound for the number of continuous hidden states used in the experiments with dynamical models.

Figure 7.3: A short fragment of the data used in the experiment. The first subfigure shows the original data, the second shows the reconstruction from 8 nonlinear components and the last shows the reconstruction from 8 linear components. Both the models used were static and did not use any temporal information on the signals. The results would have been exactly the same for any permutation of the data vectors.

Figure 7.4: Extracted NFA factors corresponding to the data fragment in Figure 7.3, rotated with linear ICA.

A small segment of the original data and its reconstructions with eight nonlinear and linear components are shown in Figure 7.3. The reconstructed spectrograms are somewhat smoother than the original ones. Still, all the discriminative features of the original spectrum are well preserved in the nonlinear reconstruction. This means that the dropped components mostly correspond to noise. The linear reconstruction is not as good, especially at the beginning.

The extracted nonlinear factors, rotated with linear ICA, are shown in Figure 7.4. They seem rather smooth so it seems plausible that the dynamic models would be able to model the data better. The representation of the data given by the nonlinear factors seems, however, somewhat more difficult to interpret. It is rather difficult to see how the different factors affect the predicted outputs.

next up previous contents
Next: Comparison with other models Up: Speech data Previous: Preprocessing   Contents
Antti Honkela 2001-05-30