Information theory studies communication and the information content
of random variables. Its foundations were laid by Claude Shannon who
studied communication over noisy channels and was able to show that
the information content of observing a discrete random variable *X* is
bits [117]. According to
Shannon's coding theory, this is the number of bits needed for coding
entities using optimal coding. Later Rissanen and Langdon developed
arithmetic coding which is a practical algorithm for attaining a
coding which is arbitrarily close to optimal
[108,109].

Solomonoff [119,120], Kolmogorov [67] and Chaitin [15] independently developed algorithmic information theory which is based on the idea that the complexity of a binary string is equated with the shortest computer programme which produces that string. This measure, known as Kolmogorov complexity, is not computable because of the so called halting problem [41]. By taking into account the time it takes to run a programme, Kolmogorov complexity can be modified so that it becomes computable. This is known as Levin complexity [77]. Good introductions to coding and complexity can be found, for instance, in [78,18].

In principle Levin complexity can be used for learning as demonstrated in [116]. However, the model space which includes all computer programmes lacks the structure which would help in developing efficient learning algorithms. This means that although applicable in principle, the method cannot learn complex models in practice.

From a Bayesian point of view, a complexity measure can be interpreted
as a prior over binary strings. By using Shannon's formula in the
other direction, the lengths of the computer programmes *C* can be
interpreted to define a prior over programmes as
*P*(*C*) = 2^{-L(C)}.
Each programme gives a probability
*P*(*S* | *C*) = 1 for the string *S*that the programme *C* generates. It is then clear that when *n*first bits *S*_{n} of a string are observed, it is possible to use the
model to predict the continuation for the string using Bayes' rule and
the marginalisation principle.

The marginalisation principle tells that the prediction should use all
programmes *C* weighted by their posterior probabilities. However,
often the development originating from coding tradition assumes using
only the shortest programme *C*. An example of this is the stochastic
complexity developed by Rissanen [107]. It replaces
the computer programmes by ordinary parameterised statistical models in
measuring the complexity. With simple models, at least, this leads to
practical algorithms for finding the ``shortest'' model which gives
the observation. Stochastic complexity extends ML estimation so that
different model structures can be dealt with, but like ML estimation,
it yields only point estimates.