It is relatively easy to give answers to well-specified questions about the statistical nature of a well-understood data set, like ``how large should an aeroplane cockpit be to accommodate any one of the 95% of the potential pilots?'' In this example the data may be assumed to be normally distributed and it is straightforward to estimate the threshold. The more data there is available the more accurate an answer can be given.

If, on the other hand, the data is not well-understood and the problem
is not well-specified, an increase in the amount of data may even have
the opposite effect. This holds for multivariate data in particular.
If the goal is simply to try to make sense out of a data set to
generate sensible hypotheses or to find some interesting novel
patterns, it paradoxically seems that the more data there is available
the more difficult it is to understand the data set. The structures
are hidden among the large amounts of multivariate data. When
exploring a data set for new insights, only methods that *discover
and illustrate effectively the structures in the data* can be of
help. Such methods, applied to large data sets, are the topic of this
work.

A data-driven search for statistical insights and models is
traditionally called *exploratory data analysis*
[Hoaglin, 1982, Jain and Dubes, 1988, Tukey, 1977, Velleman and Hoaglin, 1981] in statistical
literature. The process of making statistical inferences often
consists of an exploratory, data-driven phase, followed by a
confirmatory phase in which the reproducibility of the results is
investigated. There thus exists a wealth of applications in which data
sets need to be summarized to gain insight into them; the goal in this work
is to present a data set in a form that is easily understandable but
that at the same time preserves as much of the essential information
in the data as possible.

Exploratory data analysis methods can be used as tools in knowledge
discovery in databases (KDD)
[Fayyad, 1996, Fayyad et al., 1996a, Fayyad et al., 1996c, Simoudis, 1996]. In this
relatively recently established field the emphasis is on the whole
interactive process of knowledge discovery, discovery of novel
patterns or structures in the data. The process consists of a
multitude of steps starting from setting up the goals to evaluating
the results, and possibly reformulating the goals based on the
results. *Data mining* is one step in the discovery
process, a step in which suitable tools from many other disciplines
including exploratory data analysis are used to find interesting
patterns in the data. Depending on the goals of the data mining
process essentially any kinds of pattern recognition
[Devijver and Kittler, 1982, Fu, 1974, Fukunaga, 1972, Therrien, 1989, Schalkoff, 1992], machine
learning [Forsyth, 1989, Langley, 1996, Michalski, 1983], and multivariate
analysis [Cooley and Lohnes, 1971, Hair, Jr. et al., 1984, Kendall, 1975] algorithms may be useful;
for recent examples, cf. Fayyad et al. (1996b).
An essential novelty in the field then lies in emphasizing the
discovery of previously unknown structures from vast databases, and in
emphasizing the importance of considering the whole process.

In this work exploratory data analysis methods which illustrate the structures in data sets, are applied to large databases. The tool in this endeavor will be the self-organizing map (SOM) [Kohonen, 1982, Kohonen, 1995c]. Some properties that distinguish the SOM from the other data mining tools are that it is numerical instead of symbolic, nonparametric, and capable of learning without supervision. The numerical nature of the method enables it to treat numerical statistical data naturally, and to represent graded relationships. Because the method does not require supervision and is nonparametric, used here in the sense that no assumptions about the distribution of the data need to be made, it may even find quite unexpected structures from the data.

In this thesis the relation of the SOM to some other data visualization and clustering methods is first analyzed in Section 6. Then, recipes on how to use the SOM in exploratory data analysis are given in Section 7. The areas of application that are treated in the Publications are introduced in Section 8, and finally two recent developments in the methodology are discussed in Section 9.

Mon Mar 31 23:43:35 EET DST 1997