Notes on statistical accuracy

Next: Relations and differences between Up: Self-organizing maps Previous: Some variants

Notes on statistical accuracy

The question of how large the SOM grid should be has a simple answer if there is an unlimited number of learning samples available: as large as the available computational resources allow. In Publication 1 this is essentially the case.

If the data set consists of a finite sample from a larger data set, then the question of the statistical accuracy of the map must be considered, i.e., whether the correct positions of the reference vectors can be estimated accurately based on the available data.

The question is especially important if the map is to be used later for displaying new data items. If the new items can be assumed to follow the same distribution as the items that were present while learning, and if the map is statistically accurate, then the new items can be displayed as accurately as the old ones. In high-dimensional spaces it is in general very difficult to achieve sufficient statistical accuracy since the samples are necessarily sparse. By fitting lower-dimensional structures like the SOM grid to the data set in an unsupervised manner, however, there is some hope of a generalizability of the results.

A possible rule-of-thumb for choosing a suitable size for the map might be that the number of free parameters should be at most, say, a certain fraction of the number of values used for estimating them. It seems, however, that it is difficult to compute the number of free parameters in the SOM, since the neighborhood kernel restricts the placement of the reference vectors. A rule-of-thumb that is based directly on the total number of parameters in the SOM may then be overly conservative.

If the goal of data analysis is merely to visualize a given data set for exploratory purposes as in Publication 2, and if no new samples need to be mapped later, then the map can be taught to represent the given input data set with a sufficient statistical accuracy by sampling with replacement even from a small data set during the computation of the map. Then, of course, there is no guarantee that the map will generalize well to new data. The SOM is useful, however, even for maps having more units than there are data items. Even if the map passed through all data items, it would do so in an ordered manner and the distances between close-by data item pairs could be visualized on the map displays using the methods discussed in Section 6.4.2.

A brief introduction to one application of the SOM may serve to illustrate this point. A one-dimensional circular SOM can in effect be used for finding an approximate solution to the traveling salesman problem [Angéniol et al., 1988, Budinich, 1996]: find the shortest route that passes through all of the data items once. The map forms a path that follows the data distribution, i.e., extends close to each data item. A two-dimensional map follows the data in a similar manner. It forms an ``elastic network'' that can follow the data the more closely the more there are map units. (Note, however, that the ``smoothness'' or regularity of the mapping can be controlled independently of the map size by changing the width of the neighborhood kernel.)

Setting the number of nodes approximately equal to the number of the input samples seems to be a useful rule-of-thumb for many applications when the data sets are relatively small. More extensive empirical investigations should be done, however, for example by cross-validating the results utilizing either the cost function (Eq. 7) or the measures presented in Publication 7.

Next: Relations and differences between Up: Self-organizing maps Previous: Some variants

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997