Feature extraction, the choice of suitable representations for the data items, is a key step in the analysis. All unsupervised methods merely illustrate some structures in the data set, and the structures are ultimately determined by the features chosen to represent the data items. The usefulness of different preprocessing methods depends strongly on the application. Therefore a comprehensive treatment would be an immense task, and only the basic approaches used in the case studies can be introduced here.
If the data comes from a process of which there exists some a priori knowledge, this information should of course be used for choosing the features. This is the case with the EEG signal (Publication 1); it is known that the frequency content of the signal depends, for instance, on the vigilance of the individual. The better the features can be tailored to reflect the requirements of the task the better the results will be. In general, however, the task of tailoring requires considerable expertise both in the application area and in the data analysis methodology. Some experiments with a method that is potentially useful as an automatic feature extraction stage are reported in Publication 8.
Sophisticated feature extraction methods are also required in the case of mining symbolic information like text, for which no evident automatically available semantic features exist. Similarity relations computed based on the forms of the words would convey almost no information about the meaning and use of the words. Contextual information can, however, be used for constructing useful similarity relationships for textual data [Ritter and Kohonen, 1989]. A system that utilizes representations of the ``average context'' of words to represent the words, and suitably processed word category histograms to represent documents is discussed in Publications 3, 4, 5, and 6.
Besides the choice of the features, their scaling must also be chosen before applying the SOM algorithm. If there exists knowledge of the relative importance of the components of the data items, the corresponding dimensions of the input space can be scaled according to this information. The importance may in some applications be estimated automatically by, for example, an entropy-based criterion (Publication 4). If no such criterion is available the variance of all components may be scaled to an equal value as was done in Publication 2.
The importance of choosing a set of indicators that describes the phenomenon of interest and nothing else, and proper scaling of the indicators cannot be overstressed. Even the best analysis methods cannot overcome all mistakes made at this stage.