next up previous
Next: EM and MAP Up: Ensemble learning Previous: Model selection in ensemble

Connection to coding

The cost function can be derived within MDL framework. The intuitive idea behind MDL is that in order to be able to encode data compactly one has to find a good model for it. Coding is not actually done; only the formula for the code length is needed. The optimal code length of x is $L(x) = -\log P(x)$ bits, and therefore coding has a close connection to the probabilistic framework. Some people prefer to think in terms of coding lengths, some in term of probabilities.

In coding, the sender and the receiver have to agree on the structure of the message in advance. This corresponds to having to state one's priors in the Bayesian framework.

A simple example: let x be the observation to be coded and y be a real valued parameter. We first encode y with accuracy dy. This part of the message takes $L(y) = -\log p(y)dy$ bits. Then we encode x with accuracy dx using the value y. This takes $L(x \vert y) =
-\log p(x \vert y) dx$ bits. The accuracy of the encoding of the observations can be agreed on in advance, but there is a question of how to decide the accuracy dy. The second part of the code would be shortest if y would be exactly in the maximum likelihood solution. If y is encoded in too high accuracy, however, the first part of the code will be very long. If y is encoded in too low accuracy, the first part will be short but deviations from the optimal y will increase the second part of the code.

The bits-back argument, [1], overcomes the problem by using infinitesimally small dy but picking the y from a distribution q(y) and encoding a secondary message in the choice of y. Since the distribution q(y) is not needed for decoding x, both the sender and the receiver can run the same algorithm for determining q(y) from x. After finding out q(y), the receiver can decode the secondary message which can have $-\log q(y) dy$ bits. As these bits will be got back in the end, the expected message length is

L(x) = $\displaystyle E_q \{-\log p(x\vert y)dx - \log p(y) dy + \log q(y) dy \}$  
  = $\displaystyle D(q(y) \vert\vert p(y \vert x)) - \log p(x) dx = C_y(x) - \log dx \, .$ (13)

The amount of bits used for individual parameter $\theta$ can be measured by looking at $D(q(\theta) \vert\vert p(\theta))$

It is interesting to notice that although the message has two parts, y and x, if q(y) is chosen to be p(y | x), the expected code length is equal to that of the optimal one-part message where the sender encodes x directly using the marginal probability $p(x) =
\int p(x \vert y) p(y) dy$.

next up previous
Next: EM and MAP Up: Ensemble learning Previous: Model selection in ensemble
Harri Lappalainen