The minimum message length encoding, which was discussed in
section 4.3.2, is not decodable, that is, it is not sufficient
to actually find *x*. This is because in order to decode ,
the receiver should know the precision
in addition
to the prior knowledge , but
was not
included in the message. In [131], Wallace and Freeman
argued that the values and the accuracies of the parameters are not
independent, and one can construct a decodable message with almost the
same code length as in minimum message length coding.

Later Hinton and van Camp introduced a ``bits-back'' argument to show that by a clever coding scheme, a decodable message is obtained without encoding [44]. The sender uses the following coding scheme:

- 1.
- Compute a distribution
based on the observation
*x*(the algorithm for doing this is defined later). - 2.
- Pick from the distribution and encode it with a very high prespecified precision ( is close to zero) using .
- 3.
- Encode
*x*using the encoded and .

(17) |

is very high due to small . It turns out that this is not the case, however. The trick is that in step 2, the sender can encode a secondary message in the choice of . The average length of this message is

(18) |

Although the length of the first part of the message is very high, most of it is actually used by the secondary message. In the end, the receiver will get those bits back, so to say, and hence the name bits-back argument. The total length of the part which is used for encoding

(19) |

It might not be immediately clear that the receiver can decode the secondary message hidden in the choice of . In the receiving end, the decoding proceeds as follows:

- 1.
- Decode from the first part of the message using .
- 2.
- Decode
*x*from the second part of the message using . - 3.
- Run the same algorithm as the sender used for computing the
distribution
based on
*x*. - 4.
- Decode the secondary message from using .

Computing all the terms, it turns out that
appears in both and
, but since these terms
appear with opposite signs,
disappears from the
equations. The remaining terms are

This is the number of bits actually effectively used for encoding in the first part of the message. The length of the second part of the message is

The total length of the message is thus

(22) |

which differs from the cost function (17) used in ensemble learning only by the term which can be neglected as it does not depend on the approximation . The bits-back argument thus results in ensemble learning.

This derivation for ensemble learning is useful because it gives an intuitive meaning for the cost function used in ensemble learning. Different terms of the cost function can be interpreted as the number of bits used for encoding various parameters and the observations. It is then possible to see how many bits have been used for representing different parameters and draw conclusions about the importance of the parameter to the model.

At first glance, it might seem that minimum message length and bits-back coding are quite different, the former using finite accuracy and the latter a very high accuracy . It turns out, however, that the minimum message length coding can be seen as a special case of bits-back coding where is chosen to be a uniform distribution between and , where is the finite accuracy used in the minimum message length scheme.