Information theory studies communication and the information content of random variables. Its foundations were laid by Claude Shannon who studied communication over noisy channels and was able to show that the information content of observing a discrete random variable X is bits . According to Shannon's coding theory, this is the number of bits needed for coding entities using optimal coding. Later Rissanen and Langdon developed arithmetic coding which is a practical algorithm for attaining a coding which is arbitrarily close to optimal [108,109].
Solomonoff [119,120], Kolmogorov  and Chaitin  independently developed algorithmic information theory which is based on the idea that the complexity of a binary string is equated with the shortest computer programme which produces that string. This measure, known as Kolmogorov complexity, is not computable because of the so called halting problem . By taking into account the time it takes to run a programme, Kolmogorov complexity can be modified so that it becomes computable. This is known as Levin complexity . Good introductions to coding and complexity can be found, for instance, in [78,18].
In principle Levin complexity can be used for learning as demonstrated in . However, the model space which includes all computer programmes lacks the structure which would help in developing efficient learning algorithms. This means that although applicable in principle, the method cannot learn complex models in practice.
From a Bayesian point of view, a complexity measure can be interpreted as a prior over binary strings. By using Shannon's formula in the other direction, the lengths of the computer programmes C can be interpreted to define a prior over programmes as P(C) = 2-L(C). Each programme gives a probability P(S | C) = 1 for the string Sthat the programme C generates. It is then clear that when nfirst bits Sn of a string are observed, it is possible to use the model to predict the continuation for the string using Bayes' rule and the marginalisation principle.
The marginalisation principle tells that the prediction should use all
programmes C weighted by their posterior probabilities. However,
often the development originating from coding tradition assumes using
only the shortest programme C. An example of this is the stochastic
complexity developed by Rissanen . It replaces
the computer programmes by ordinary parameterised statistical models in
measuring the complexity. With simple models, at least, this leads to
practical algorithms for finding the ``shortest'' model which gives
the observation. Stochastic complexity extends ML estimation so that
different model structures can be dealt with, but like ML estimation,
it yields only point estimates.