Minimum message length inference

Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity

Minimum message length inference

A different line of research, which leads to ensemble learning, is represented by [130,106,131,44]. Like stochastic complexity, it also operates on parameterised statistical models rather than computer programmes.

We shall illustrate the treatment of real valued parameters by considering a simple model with one parameter $\theta$ which determines the probability for the observation x. The probabilities $p(x \vert \theta)$ and $p(\theta)$ are assumed to be specified. Then we ask how, using the given model, we can encode x into a message using the least amount of bits. Assume that we require x to be encoded with the precision $\epsilon_x$ . The purpose is not actually to encode x or send a message to someone, but to learn about $\theta$ by imagining a coding scheme.

Wallace and Boulton [130] suggested the following two-part message: first encode $\theta$ discretised with precision $\epsilon_\theta$ using the prior probability $p(\theta)$ and then encode x using the model $p(x \vert \theta)$ and the encoded $\theta$ . This will produce a code with length $L(\theta) + L(x \vert \theta)$ . If $\epsilon_\theta$ is small, the number of bits used for the first part of the message is approximately

$\begin{displaymath}L(\theta) \approx -\log [p(\theta) \epsilon_\theta]. \end{displaymath}$

(12)

The number of bits used for the second part of the message depends on the discretised value of $\theta$ . Assuming the target value for $\theta$ was $\theta_0$ , the discretised value lies between $\theta_0 - \epsilon_\theta/2$ and $\theta_0 + \epsilon_\theta/2$ with roughly uniform probability. This means that the expected number of bits used for the second part is approximately

$\begin{displaymath}L(x \vert \theta) = \int_{\theta_0-\epsilon_\theta/2} ^{\the... ...silon_\theta/2} -\log [p(x \vert \theta) \epsilon_x] d\theta. \end{displaymath}$

(13)

If $\epsilon_\theta$ is small, it is possible to approximate the length of the second part of the message by using a second order Taylor series expansion of $-\log p(x \vert \theta)$ about $\theta_0$ .

Given x, the total message length is a function of $\theta_0$ and $\epsilon_\theta$ . The result of learning is the optimal value for both, which minimises the message length. Looking at the equations for $L(\theta)$ and $L(x \vert \theta)$ , it is evident that there is an optimal value for $\epsilon_\theta$ , because increasing the size of the discretisation bins will decrease $L(\theta)$ due to the term $-\log \epsilon_\theta$ , but it will increase the term $L(x \vert \theta)$ because the discretisation errors will increase and the expected deviations from optimal $\theta_0$ grow larger.

The optimal value for $\epsilon_\theta$ depends on how quickly $p(x \vert \theta)$ drops as $\theta$ is moved further away from the optimal value, and it turns out that the optimal $\epsilon_\theta$ is linearly dependent on the width of the maximum peak of $p(x \vert \theta)$ . Therefore the optimal $\theta_0$ will tell the most plausible value for the parameter, and $\epsilon_\theta$ will tell how uncertain the value is.

An accessible introduction to learning based on this coding scheme, known as minimum message length inference, can be found in [97,96,6].

Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity

Harri Valpola
2000-10-31