next up previous contents
Next: Implementation Up: Experiments Previous: Experiments

Experiment settings

The experiment setting is to reconstruct the missing values and the mean square error of the reconstructions are used for the comparison. The two data sets that are used are speech data and Boston housing data. Ignorability of the data collection mechanism [3] is assumed here. The collection mechanism is nonignorable, for instance, when out-of-scale measurements are marked as missing.

The first data set consists of real-world Finnish speech spectrograms spoken by several individuals. Short term spectra are windowed to 30 dimensions with a standard preprocessing procedure for speech recognition. It is clear that a dynamic model [12] would give better reconstructions, but in this case the temporal information is left out to ease the comparison of the models. Half of the about 5000 samples are used as test data with some missing values. Missing values are set in four different ways to measure different properties of the algorithms (Figure 3):

1.
38 percent of the values are set to miss randomly in 4 times 4 patches. (Figure 4) This is the main setting, since it is most realistic.
2.
10 percent of the values are set to miss randomly independent of any neighbours. This is an easier setting, since simple smoothing using nearby values would give fine reconstructions.
3.
Training and testing sets are randomly permuted before setting missing values in 4 times 4 patches as in setting 1. The training set contains vectors more similar to the test set now.
4.
Training and testing sets are permuted and 10 percent of the values are set to miss independently of any neighbours.


  
Figure 3: Four different experiment settings with the speech data try to measure different properties of the algorithms.
\begin{figure}
\begin{center}
\epsfig{file=fourexperiments.eps,width=7cm} \end{center} \end{figure}

The second data set is Boston housing data, which is publicly available at [2]. It concerns housing values in suburbs of Boston. Data set contains 506 vectors of 13 dimensions excluding one binary attribute. Four of the 13 values were c

ommon to each town, which consist of 1 to 50 suburbs. 70% of the data vectors are used as training data and the rest as testing data, which has 10% of its values missing randomly.


next up previous contents
Next: Implementation Up: Experiments Previous: Experiments
Tapani Raiko
2001-09-26