Missing data.

Next: Outliers. Up: Properties useful in exploring Previous: Visualization of clusters.

Missing data.

A frequently occurring problem in applying methods of statistics is that of missing data. Some of the components of the data vectors are not available for all data items, or may not even be applicable or defined. Several simple (e.g., Dixon, 1979) and more complex (e.g., Dempster et al., 1977) approaches have been proposed for tackling this problem, from which all of the clustering and projection methods suffer likewise.

In the case of the SOM the problem of missing data can be treated as follows: when choosing the winning unit by Equation 5, the input vector can be compared with the reference vectors using only those components that are available in . Note that none of the reference vector components is missing. If only a small proportion of the components of the data vector is missing, the result of the comparison will be statistically fairly accurate. When the reference vectors are then adapted using Equation 6, only the components that are available in will be modified.

It has been demonstrated that better results can be obtained with the approach described above than by discarding the data items from which components are missing [Samad and Harp, 1992]. However, for data items from which the majority of the indicators are missing it is not justifiable to assume that the winner selection is accurate. A reasonable compromise, used in Publication 2, is to discard data items with too many (exceeding a chosen proportion) missing values from the learning process. Even the discarded samples can, however, be tentatively displayed on the map after it has been organized.

Note: Although the SOM as such can be used to explore incomplete data sets, some preprocessing methods may have problems with missing components of the input data items. For example, normalization of the data vectors cannot be done in a straightforward manner. Normalization of the variance of each component separately is, in contrast, a viable operation even for incomplete data sets.

Next: Outliers. Up: Properties useful in exploring Previous: Visualization of clusters.

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997