Level 2 (upper level) |

The basic rules of Bayesian probability theory are the sum and product rule:

P(A | C) + P(¬A | C) = 1

P(AB | C) = P(A | C) P(B | AC)

C means here all the background assumptions. Often C is not denoted:

P(A) + P(¬A) = 1

P(AB) = P(B) P(A | B)

If B_{1}, ..., B_{n} are n mutually exclusive and exhaustive explanations for A, the marginalisation principle can be derived from the sum and product rule

P(A) = P(AB_{1}) + ... + P(AB_{n}) = P(A |
B_{1}) P(B_{1}) + ... + P(A | B_{n})
P(B_{n}).

In other words, the probability of A is obtained by going through all
the possible explanations for A.

The Bayes' rule tells how the probabilities of hypotheses
(explanations) change when A is measured.

P(B_{i} | A) = P(B_{i}) P(A | B_{i}) / P(A)

The probabilities of the explanations which agree well with the
observation A are increased.

The classical deductive logic deals with inference from rules using binary yes/no truth values. The Bayesian probability theory can be derived from axioms which describe inference with uncertain truth values. The Bayesian probability can thus be seen as an extension of classical logic to uncertain truth values.

This extension also makes possible the description of inductive logic: a set of hypotheses about possible worlds is chosen as premisses and the observations then support some hypotheses and are against some others. Inductiveness is thus embedded in the premisses and the actual inference is deductive. A logic capable of inductive inference must be able to represent uncertainty since usually the observations do not assertain or reject any hypothesis completely but only support or are against to some extent.

p(Mw | DI) = p(Mw | I) p(D | MwI) / p(D | I)

It can be used for predicting new things using marginalisation principle. Often only one structure is selected with neural networks, but, according to Bayesian probability theory, the right way is to use all structures and parameters weighed by the posterior density.

The more complex a model, the better it can represent the data. Still it is not reasonable to choose too complicated models. Assume, for instance, that we have observed the data D. An extreme example of a complicated model is the one that says that the obserevation is D. It explains the observation completely but cannot be used for anything.

Intuitively it is clear that a more simple explanation is better
than a more complex if both are equally good in explaining the
observations. Information theoretically motivated method *Minimum
Message Length* (MML) expresses this exactly. It deals with the
length of description of the data when using a model. First the model
is described and then the discrepancy between the model and the data.
Their total descpription length is then minimised.

The minimisation of Kullback-Leibler information yields, as a special case, practically the same formula as MML.

Level 2 (upper level)

Updated 13.10.1998

Harri Lappalainen