next up previous
Next: Measuring the description length Up: Introduction Previous: Introduction

MDL-based cost functions for supervised and unsupervised learning

 A common problem in pattern recognition is that one wants to recognise objects in a given raw data. One might have a large amount of raw data available, but only a small amount of labelled data. The typical approach is to extract features from the raw data in an unsupervised manner and then to find a mapping from the features to the corresponding labels (desired outputs) in a supervised manner. The whole process can be described by $I_U \rightarrow (F_U = I_S)
\rightarrow D_S$, where the arrows stands for mappings, I for input, F for feature, D for desired output, U for unsupervised and S for supervised.

Let us denote the model by M. In supervised learning one wants to find a mapping from the input data IS to the desired output data DS. The mapping is described by the model MS, whose structure and parameters are to be estimated. In a conventional neural network, the cost function C measures the error in the mapping: C(DS - MS(IS)). The MDL-based cost function for supervised learning is L(DS | MS, IS) + L(MS): the description length of the output data given the model and the input data plus the description length of the model. The first term corresponds to C(DS - MS(IS)) and the second term penalises for the complexity of the model. The input is assumed to be known by both the sender and the receiver and is not included in the description length.

In unsupervised learning one wants to find features FU which compactly describe the input data IU. The quality of the features is evaluated by estimating how much information about inputs they preserve, that is, how well the inputs can be reconstructed from the features. We shall consider a generative model MU which defines the reconstruction mapping: $\hat{I}_U = M_U(F_U)$. The reconstruction error which measures the quality of the features is then C(IU - MU(FU)). We shall use vector quantisation to illuminate the use of a generative model.

In vector quantisation, the model consists of model vectors $\boldsymbol{m}_i$, which directly define the reconstruction mapping MU: given an index i, the reconstruction is $\boldsymbol{m}_i$. The index i thus plays the role of the feature. The quality of the feature is measured with the reconstruction error C, or quantisation error as it is often called in connection with vector quantisation. The feature is defined as the index that minimises the reconstruction error, or to put it in other words, the index of the closest model vector: $F_U = M_U^{-1}(I_U) = \arg \min_{F} C(I_U - M_U(F))$.The model can be adapted by minimising the reconstruction error, that is, by moving the closest model vector even closer to the input. The features and the model are thus both defined as the minimum of the reconstruction error.

In addition to the term measuring the reconstruction error, the MDL-based cost function for unsupervised learning has terms penalising the complexity of the model and the features: L(IU | MU, FU) + L(MU) + L(FU). The cost function is used in exactly the same way as with vector quantisation: the features and the model are found by minimising it.

The features are often considered a subset of the parameters of the model, because they too have to be estimated from the data and are included in the cost function. The only difference between the features and the parameters is that the features change for each input element, whereas the parameters are constant for different data elements.

next up previous
Next: Measuring the description length Up: Introduction Previous: Introduction
Harri Lappalainen