next up previous contents
Next: Specification of the model Up: Ensemble learning Previous: Cost function

Bits-back argument

The minimum message length encoding, which was discussed in section 4.3.2, is not decodable, that is, it is not sufficient to actually find x. This is because in order to decode $\theta$, the receiver should know the precision $\epsilon_\theta$ in addition to the prior knowledge $p(\theta)$, but $\epsilon_\theta$ was not included in the message. In [131], Wallace and Freeman argued that the values and the accuracies of the parameters are not independent, and one can construct a decodable message with almost the same code length as in minimum message length coding.

Later Hinton and van Camp introduced a ``bits-back'' argument to show that by a clever coding scheme, a decodable message is obtained without encoding $\epsilon_\theta$ [44]. The sender uses the following coding scheme:

Compute a distribution $q(\theta \vert x)$ based on the observation x (the algorithm for doing this is defined later).
Pick $\theta$ from the distribution $q(\theta \vert x)$ and encode it with a very high prespecified precision ( $\epsilon_\theta$ is close to zero) using $p(\theta)$.
Encode x using the encoded $\theta$ and $p(x \vert \theta)$.
It would seem that the sender has had to use a very large number of bits, because the length of the first part

\begin{displaymath}L(\theta) = -\int q(\theta \vert x) \log [p(\theta) \epsilon_\theta] d\theta
\end{displaymath} (17)

is very high due to small $\epsilon_\theta$. It turns out that this is not the case, however. The trick is that in step 2, the sender can encode a secondary message in the choice of $\theta$. The average length of this message is

\begin{displaymath}L(\theta \vert q) = - \int q(\theta \vert x) \log [q(\theta \vert x)
\epsilon_\theta] d\theta.
\end{displaymath} (18)

Although the length of the first part of the message is very high, most of it is actually used by the secondary message. In the end, the receiver will get those bits back, so to say, and hence the name bits-back argument. The total length of the part which is used for encoding x is

\begin{displaymath}L(\theta) + L(x \vert \theta) - L(\theta \vert q).
\end{displaymath} (19)

It might not be immediately clear that the receiver can decode the secondary message hidden in the choice of $\theta$. In the receiving end, the decoding proceeds as follows:
Decode $\theta$ from the first part of the message using $p(\theta)$.
Decode x from the second part of the message using $p(x \vert \theta)$.
Run the same algorithm as the sender used for computing the distribution $q(\theta \vert x)$ based on x.
Decode the secondary message from $\theta$ using $q(\theta \vert x)$.
Again, the whole point in devising the coding scheme is not the actual transmission of x, but learning about $\theta$. In this case the information about $\theta$ comes in the form of distribution $q(\theta \vert x)$.

Computing all the terms, it turns out that $\log \epsilon_\theta$appears in both $L(\theta)$ and $L(\theta \vert q)$, but since these terms appear with opposite signs, $\epsilon_\theta$ disappears from the equations. The remaining terms are

 \begin{displaymath}L(\theta) - L(\theta \vert q) = \int q(\theta \vert x) \log
\frac{q(\theta \vert x)}{p(\theta)} d\theta.
\end{displaymath} (20)

This is the number of bits actually effectively used for encoding $\theta$ in the first part of the message. The length of the second part of the message is

 \begin{displaymath}L(x \vert \theta) = - \int q(\theta \vert x) \log [p(x \vert \theta) \epsilon_x]
\end{displaymath} (21)

The total length of the message is thus

\begin{displaymath}L(\theta) - L(\theta \vert q) + L(x \vert \theta) = \int q(\t...
...}{p(x \vert \theta) p(\theta)} d\theta - \log
\epsilon_x \, ,
\end{displaymath} (22)

which differs from the cost function (17) used in ensemble learning only by the term $\log \epsilon_x$ which can be neglected as it does not depend on the approximation $q(\theta \vert x)$. The bits-back argument thus results in ensemble learning.

This derivation for ensemble learning is useful because it gives an intuitive meaning for the cost function used in ensemble learning. Different terms of the cost function can be interpreted as the number of bits used for encoding various parameters and the observations. It is then possible to see how many bits have been used for representing different parameters and draw conclusions about the importance of the parameter to the model.

At first glance, it might seem that minimum message length and bits-back coding are quite different, the former using finite accuracy $\epsilon_\theta$ and the latter a very high accuracy $\epsilon_\theta
\approx 0$. It turns out, however, that the minimum message length coding can be seen as a special case of bits-back coding where $q(\theta \vert x)$ is chosen to be a uniform distribution between $\theta_0
- \epsilon_\theta/2$ and $\theta_0 + \epsilon_\theta/2$, where $\epsilon_\theta$ is the finite accuracy used in the minimum message length scheme.

next up previous contents
Next: Specification of the model Up: Ensemble learning Previous: Cost function
Harri Valpola