Information theory can offer a simple, intuitive point of view to learning. If we succeed in finding a very simple description for the observations, the argument goes, then we must have found interesting structure in the data. The description length and probability are tightly linked. According to Shannon's coding theory, the shortest expected description length for a proposition equals the negative logarithm of the probability of the proposition. Viewed like this, the information-theoretic approach to learning is nothing else than using a different scale for measuring the beliefs, and any learning method derived in the information-theoretic context can be readily translated into the Bayesian context by a simple transformation of scale. This is not to say that information theory did not have an independent justification in coding theory.

Concepts from the Bayesian framework often have intuitive interpretations in the coding context. The prior probabilities, for instance, translate into the specification of a coding scheme. Optimal encoding of an observation into a binary string produces a seemingly random string of ones and zeros and some prior knowledge is needed about the instructions for decoding. This corresponds to the Bayesian prior. Another example is that slight approximations to the exact Bayesian learning translate into slightly nonoptimal coding schemes. Ensemble learning is an example of a Bayesian approximation scheme whose roots are in coding schemes.

It may be that much of the success of approximation schemes first derived in the information-theoretic context is due to the widespread misuse of probability densities and MAP estimates. In coding context, it is easier to see that it is the probability mass that matters, not probability density, because it is clear that in order to measure the number of bits needed for coding a real valued variable, the precision of coding has to be specified. This corresponds to specifying the volume around a point in space and thus determines a probability mass.