Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity

### Minimum message length inference

A different line of research, which leads to ensemble learning, is represented by [130,106,131,44]. Like stochastic complexity, it also operates on parameterised statistical models rather than computer programmes.

We shall illustrate the treatment of real valued parameters by considering a simple model with one parameter which determines the probability for the observation x. The probabilities and are assumed to be specified. Then we ask how, using the given model, we can encode x into a message using the least amount of bits. Assume that we require x to be encoded with the precision . The purpose is not actually to encode x or send a message to someone, but to learn about by imagining a coding scheme.

Wallace and Boulton [130] suggested the following two-part message: first encode discretised with precision using the prior probability and then encode x using the model and the encoded . This will produce a code with length . If is small, the number of bits used for the first part of the message is approximately

 (12)

The number of bits used for the second part of the message depends on the discretised value of . Assuming the target value for was , the discretised value lies between and with roughly uniform probability. This means that the expected number of bits used for the second part is approximately

 (13)

If is small, it is possible to approximate the length of the second part of the message by using a second order Taylor series expansion of about .

Given x, the total message length is a function of and . The result of learning is the optimal value for both, which minimises the message length. Looking at the equations for and , it is evident that there is an optimal value for , because increasing the size of the discretisation bins will decrease due to the term , but it will increase the term because the discretisation errors will increase and the expected deviations from optimal grow larger.

The optimal value for depends on how quickly drops as is moved further away from the optimal value, and it turns out that the optimal is linearly dependent on the width of the maximum peak of . Therefore the optimal will tell the most plausible value for the parameter, and will tell how uncertain the value is.

An accessible introduction to learning based on this coding scheme, known as minimum message length inference, can be found in [97,96,6].

Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity
Harri Valpola
2000-10-31