next up previous contents
Next: Minimum message length inference Up: Information-theoretic approaches to learning Previous: Information-theoretic approaches to learning

Coding and complexity

Information theory studies communication and the information content of random variables. Its foundations were laid by Claude Shannon who studied communication over noisy channels and was able to show that the information content of observing a discrete random variable X is $L(X) = -\log_2 P(X)$ bits [117]. According to Shannon's coding theory, this is the number of bits needed for coding entities using optimal coding. Later Rissanen and Langdon developed arithmetic coding which is a practical algorithm for attaining a coding which is arbitrarily close to optimal [108,109].

Solomonoff [119,120], Kolmogorov [67] and Chaitin [15] independently developed algorithmic information theory which is based on the idea that the complexity of a binary string is equated with the shortest computer programme which produces that string. This measure, known as Kolmogorov complexity, is not computable because of the so called halting problem [41]. By taking into account the time it takes to run a programme, Kolmogorov complexity can be modified so that it becomes computable. This is known as Levin complexity [77]. Good introductions to coding and complexity can be found, for instance, in [78,18].

In principle Levin complexity can be used for learning as demonstrated in [116]. However, the model space which includes all computer programmes lacks the structure which would help in developing efficient learning algorithms. This means that although applicable in principle, the method cannot learn complex models in practice.

From a Bayesian point of view, a complexity measure can be interpreted as a prior over binary strings. By using Shannon's formula in the other direction, the lengths of the computer programmes C can be interpreted to define a prior over programmes as P(C) = 2-L(C). Each programme gives a probability P(S | C) = 1 for the string Sthat the programme C generates. It is then clear that when nfirst bits Sn of a string are observed, it is possible to use the model to predict the continuation for the string using Bayes' rule and the marginalisation principle.

The marginalisation principle tells that the prediction should use all programmes C weighted by their posterior probabilities. However, often the development originating from coding tradition assumes using only the shortest programme C. An example of this is the stochastic complexity developed by Rissanen [107]. It replaces the computer programmes by ordinary parameterised statistical models in measuring the complexity. With simple models, at least, this leads to practical algorithms for finding the ``shortest'' model which gives the observation. Stochastic complexity extends ML estimation so that different model structures can be dealt with, but like ML estimation, it yields only point estimates.


next up previous contents
Next: Minimum message length inference Up: Information-theoretic approaches to learning Previous: Information-theoretic approaches to learning
Harri Valpola
2000-10-31