Next: Visualization of high-dimensional data Up: No Title Previous: INTRODUCTION

METHODS FOR EXPLORATORY DATA ANALYSIS

There exist several methods for quickly producing and visualizing simple summaries of data sets [Tukey, 1977]. For example, the so-called five-number summary consisting of the smallest and largest data value, the median, and the first and third quartiles can be visualized as a drawing, where each number corresponds to some constituent like the altitude of a box.

Such simple methods are very useful for summarizing low-dimensional data sets, but as the dimensionality increases their ability to visualize interdimensional relations soon degrades.

In this section some methods for illustrating structures, multivariate relations between data items, in high-dimensional data sets will be discussed. The treatment will be restricted to methods that regard the inputs as metric vectors and that can be used without making assumptions about the distribution of the data. It is also assumed that no external information like class labels is available on the data items. The illustrations will then be driven solely by the actual structures in the data and not by prespecified assumptions about the class structure. Although the analysis is unsupervised, the possible class labels may be used afterwards to aid in the interpretation of the results; then they do not affect the structures that have been found.

The vectors in the input data set will be denoted by , . Here . In statistics it is customary to call the components of the data vectors observations recorded on variables. Here the mathematical terminology will be preferred, however. The components may also be called features as is customary in pattern recognition literature.

It this section the emphasis will be on methods that illustrate structures in given, prespecified data sets. It may be useful to note, however, that in practical applications the selection and preprocessing of the data may be even more important than the choice of the analysis method. For example, changes in the relative scales of the features have a drastic effect on the results of most of the methods that will be presented: the larger the scale of a component the more the component affects the result. It is, however, very difficult to give general guidelines for the very application specific task of preprocessing; the approaches used in some case studies will be discussed in Section 7.1.

The following questions play a central role in applying a method to large, high-dimensional data sets: what kinds of structures the method is able to extract from the data set, how does it illustrate the structures, does it reduce the dimensionality of the data, and does it reduce the number of data items.

Next: Visualization of high-dimensional data Up: No Title Previous: INTRODUCTION

Sami Kaski
Mon Mar 31 23:43:35 EET DST 1997