Model selection

According to the marginalisation principle, the correct way to compare different models in the Bayesian framework is to always use all the models, weighting their results by the respective posterior probabilities. This is a computationally demanding approach and it may be desirable to use only one model, even if it does not lead to optimal results. The rest of the section concentrates on finding suitable criteria for choosing just one ``best'' model.

Occam's razor is an old scientific principle which states that when trying to explain some phenomenon, the simplest model that can adequately explain it should be chosen. There is no point in choosing an overly complex model when even a much simpler one would do. A very complex model will be able to fit the given data almost perfectly but it will not be able to generalise very well. On the other hand, very simple models will not be able to describe the essential features of the data. One must make a compromise and choose a model with enough but not too much complexity [57,10].

MacKay showed in [37] how this can be done in the Bayesian framework by evaluating and comparing model evidences. The evidence of a model $\mathcal{H}_i$ is defined as the probability $p(\boldsymbol{X}\vert \mathcal{H}_i)$ of the data given the model. This is just the scaling factor in the denominator of the Bayes theorem in Equation (3.1) and it can be evaluated as shown in Equation (3.2).

True Bayesian comparison of the models would require using Bayes theorem to get the posterior probability of the model as

$\displaystyle p(\mathcal{H}_i \vert \boldsymbol{X}) \propto p(\boldsymbol{X}\vert \mathcal{H}_i) p(\mathcal{H}_i).$

(3.9)

Figure 3.2 shows how the model evidence can be used to choose the right model.^3.2 In the figure the horizontal axis ranges through all the possible data sets. The curves show the values of evidence for different models and different data sets. As the distributions $p(\boldsymbol{X}\vert \mathcal{H}_i)$ are all normalised, the area under each curve is equal.

A simple model like $\mathcal{H}_1$ can only describe a small range of possible data sets. It gives a high evidence for those but nothing for the rest. A very complex model like $\mathcal{H}_2$ can describe a much larger variety of data sets. Therefore it has to spread its predictions more thinly than model $\mathcal{H}_1$ and gives lower evidence for simple data sets. And for a data set like $\boldsymbol {X}_1$ lying there in the middle, both the extremes will lose to model $\mathcal{H}_3$ which is just good enough for that data and therefore just the model called for by Occam's razor.

**Figure 3.2:** How model evidence embodies Occam's razor [37]. For given data set $\boldsymbol {X}_1$ , the model with just right complexity will have the highest evidence.
$\includegraphics[width=.6\textwidth]{pics/evidence}$

After this point the explicit references to the model $\mathcal{H}$ in expressions for different probabilities are omitted. This is done purely to simplify the notation. It should be noted that in the Bayesian framework, all the probabilities are conditional to some assumptions because they are always subjective.