Next: Structure among unknown variables Up: Bayesian probability theory Previous: Representations of data and Contents

The Bayes rule and the marginalisation principle

The Bayes rule was formulated by reverend Thomas Bayes in the 18th century (Bayes, 1958). It can be derived from very basic axioms (Cox, 1946). The Bayes rule tells how to update ones beliefs when receiving new information. In the following, $\mathcal{H}$ stands for the assumed model, $\boldsymbol{X}$ stands for observation (or data), and $\boldsymbol{\Theta}$ stands for unknown variables. $p(\boldsymbol{\Theta}\mid\mathcal{H})$ is the prior distribution, or the distribution of the unknown variables before making the observation. The posterior distribution is

$\displaystyle p(\boldsymbol{\Theta}\mid \boldsymbol{X},\mathcal{H}) = \frac{p(\... ...eta})p(\boldsymbol{\Theta}\mid\mathcal{H})}{p(\boldsymbol{X}\mid \mathcal{H})}.$

(2.1)

The term $p(\boldsymbol{X}\mid \mathcal{H},\boldsymbol{\Theta})$ is called the likelihood of the unknown variables given the data and the term $p(\boldsymbol{X}\mid \mathcal{H})$ is called the evidence (or marginal likelihood) of the model.

The marginalisation principle specifies how a learning system can predict or generalise. The probability of observing $ A$ with prior knowledge of $\boldsymbol{X},\mathcal{H}$ is

$\displaystyle p(A \mid \boldsymbol{X},\mathcal{H}) = \int p(A \mid \boldsymbol{... ...H}) p(\boldsymbol{\Theta}\mid \boldsymbol{X},\mathcal{H}) d\boldsymbol{\Theta}.$

(2.2)

It means that the probability of observing $ A$

can be acquired by summing or integrating over all different explanations $\boldsymbol{\Theta}$ . The term $p(A \mid \boldsymbol{\Theta}, \boldsymbol{X},\mathcal{H})$ is the probability of $ A$

given a particular explanation $\boldsymbol{\Theta}$ and it is weighted with the probability of the explanation $p(\boldsymbol{\Theta}\mid \boldsymbol{X},\mathcal{H})$ .

Using the marginalisation principle, the evidence term can be written as

$\displaystyle p(\boldsymbol{X}\mid \mathcal{H}) = \int p(\boldsymbol{X}\mid \bo... ...eta}, \mathcal{H}) p(\boldsymbol{\Theta}\mid \mathcal{H}) d\boldsymbol{\Theta}.$

(2.3)

This emphasises the role of the evidence term as a normalisation coefficient. It is an integral over the numerator of the Bayes rule (2.1). Sometimes it is impossible to compute the integral exactly, but fortunately it is not always necessary. For example, when comparing posterior probabilities of different instantiations of hidden variables, the evidence cancels out.

Next: Structure among unknown variables Up: Bayesian probability theory Previous: Representations of data and Contents

Tapani Raiko 2006-11-21