A different line of research, which leads to ensemble learning, is represented by [130,106,131,44]. Like stochastic complexity, it also operates on parameterised statistical models rather than computer programmes.
We shall illustrate the treatment of real valued parameters by
considering a simple model with one parameter which
determines the probability for the observation x. The probabilities
and
are assumed to be specified. Then we
ask how, using the given model, we can encode x into a message using
the least amount of bits. Assume that we require x to be encoded
with the precision
. The purpose is not actually to
encode x or send a message to someone, but to learn about
by imagining a coding scheme.
Wallace and Boulton [130] suggested the following two-part
message: first encode discretised with precision
using the prior probability
and then
encode x using the model
and the encoded
.
This will produce a code with length
. If
is small, the number of bits used for the first part
of the message is approximately
![]() |
(12) |
The number of bits used for the second part of the message depends on
the discretised value of . Assuming the target value for
was
, the discretised value lies between
and
with roughly
uniform probability. This means that the expected number of bits used
for the second part is approximately
![]() |
(13) |
If
is small, it is possible to approximate the
length of the second part of the message by using a second order
Taylor series expansion of
about
.
Given x, the total message length is a function of and
. The result of learning is the optimal value for
both, which minimises the message length. Looking at the equations
for
and
, it is evident that there is an
optimal value for
, because increasing the size of
the discretisation bins will decrease
due to the term
, but it will increase the term
because the discretisation errors will increase and the expected
deviations from optimal
grow larger.
The optimal value for
depends on how quickly
drops as
is moved further away from the optimal
value, and it turns out that the optimal
is linearly
dependent on the width of the maximum peak of
.
Therefore the optimal
will tell the most plausible value
for the parameter, and
will tell how uncertain the
value is.
An accessible introduction to learning based on this coding scheme, known as minimum message length inference, can be found in [97,96,6].