There exist several methods for quickly producing and visualizing simple summaries of data sets [Tukey, 1977]. For example, the so-called five-number summary consisting of the smallest and largest data value, the median, and the first and third quartiles can be visualized as a drawing, where each number corresponds to some constituent like the altitude of a box.

Such simple methods are very useful for summarizing low-dimensional data sets, but as the dimensionality increases their ability to visualize interdimensional relations soon degrades.

In this section some methods for illustrating *structures*,
multivariate relations between data items, in high-dimensional data
sets will be discussed. The treatment will be restricted to
methods that regard the inputs as metric vectors and that can be used
without making assumptions about the distribution of the data. It is
also assumed that no external information like class labels is
available on the data items. The illustrations will then be driven
solely by the actual structures in the data and not by prespecified
assumptions about the class structure. Although the analysis is
unsupervised, the possible class labels may be used *afterwards*
to aid in the interpretation of the results; then they do not
affect the structures that have been found.

The vectors in the input data set will be denoted by ,
. Here . In statistics it is
customary to call the components of the data vectors *
observations* recorded on *variables*. Here the mathematical
terminology will be preferred, however. The components may also be
called *features* as is customary in pattern recognition
literature.

It this section the emphasis will be on methods that illustrate
structures in given, prespecified data sets. It may be useful to note,
however, that in practical applications the *selection* and *
preprocessing* of the data may be even more important than the
choice of the analysis method. For example, changes in the relative
scales of the features have a drastic effect on the results of most of
the methods that will be presented: the larger the scale of a
component the more the component affects the result. It is, however,
very difficult to give general guidelines for the very application
specific task of preprocessing; the approaches used in some case
studies will be discussed in Section 7.1.

The following questions play a central role in applying a method to large, high-dimensional data sets: what kinds of structures the method is able to extract from the data set, how does it illustrate the structures, does it reduce the dimensionality of the data, and does it reduce the number of data items.

- Visualization of high-dimensional data items
- Clustering methods
- Projection methods
- Self-organizing maps
- Relations and differences between SOM and MDS

Mon Mar 31 23:43:35 EET DST 1997