next up previous contents
Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity

   
Minimum message length inference

A different line of research, which leads to ensemble learning, is represented by [130,106,131,44]. Like stochastic complexity, it also operates on parameterised statistical models rather than computer programmes.

We shall illustrate the treatment of real valued parameters by considering a simple model with one parameter $\theta$ which determines the probability for the observation x. The probabilities $p(x \vert \theta)$ and $p(\theta)$ are assumed to be specified. Then we ask how, using the given model, we can encode x into a message using the least amount of bits. Assume that we require x to be encoded with the precision $\epsilon_x$. The purpose is not actually to encode x or send a message to someone, but to learn about $\theta$by imagining a coding scheme.

Wallace and Boulton [130] suggested the following two-part message: first encode $\theta$ discretised with precision $\epsilon_\theta$ using the prior probability $p(\theta)$ and then encode x using the model $p(x \vert \theta)$ and the encoded $\theta$. This will produce a code with length $L(\theta) + L(x \vert \theta)$. If $\epsilon_\theta$ is small, the number of bits used for the first part of the message is approximately

\begin{displaymath}L(\theta) \approx -\log [p(\theta) \epsilon_\theta].
\end{displaymath} (12)


The number of bits used for the second part of the message depends on the discretised value of $\theta$. Assuming the target value for $\theta$ was $\theta_0$, the discretised value lies between $\theta_0
- \epsilon_\theta/2$ and $\theta_0 + \epsilon_\theta/2$ with roughly uniform probability. This means that the expected number of bits used for the second part is approximately

\begin{displaymath}L(x \vert \theta) = \int_{\theta_0-\epsilon_\theta/2}
^{\the...
...silon_\theta/2} -\log [p(x \vert \theta) \epsilon_x]
d\theta.
\end{displaymath} (13)


If $\epsilon_\theta$ is small, it is possible to approximate the length of the second part of the message by using a second order Taylor series expansion of $-\log p(x \vert \theta)$ about $\theta_0$.

Given x, the total message length is a function of $\theta_0$ and $\epsilon_\theta$. The result of learning is the optimal value for both, which minimises the message length. Looking at the equations for $L(\theta)$ and $L(x \vert \theta)$, it is evident that there is an optimal value for $\epsilon_\theta$, because increasing the size of the discretisation bins will decrease $L(\theta)$ due to the term $-\log \epsilon_\theta$, but it will increase the term $L(x \vert \theta)$because the discretisation errors will increase and the expected deviations from optimal $\theta_0$ grow larger.

The optimal value for $\epsilon_\theta$ depends on how quickly $p(x \vert \theta)$ drops as $\theta$ is moved further away from the optimal value, and it turns out that the optimal $\epsilon_\theta$ is linearly dependent on the width of the maximum peak of $p(x \vert \theta)$. Therefore the optimal $\theta_0$ will tell the most plausible value for the parameter, and $\epsilon_\theta$ will tell how uncertain the value is.

An accessible introduction to learning based on this coding scheme, known as minimum message length inference, can be found in [97,96,6].


next up previous contents
Next: Ensemble learning Up: Information-theoretic approaches to learning Previous: Coding and complexity
Harri Valpola
2000-10-31