According to the marginalisation principle, the correct way to compare different models in the Bayesian framework is to always use all the models, weighting their results by the respective posterior probabilities. This is a computationally demanding approach and it may be desirable to use only one model, even if it does not lead to optimal results. The rest of the section concentrates on finding suitable criteria for choosing just one ``best'' model.
Occam's razor is an old scientific principle which states that when trying to explain some phenomenon, the simplest model that can adequately explain it should be chosen. There is no point in choosing an overly complex model when even a much simpler one would do. A very complex model will be able to fit the given data almost perfectly but it will not be able to generalise very well. On the other hand, very simple models will not be able to describe the essential features of the data. One must make a compromise and choose a model with enough but not too much complexity [57,10].
MacKay showed in [37] how this can be done in the Bayesian
framework by evaluating and comparing model evidences. The
evidence of a model
is defined as the probability
of the data given the model. This is just the scaling factor
in the denominator of the Bayes theorem in
Equation (3.1) and it can be evaluated as shown in
Equation (3.2).
True Bayesian comparison of the models would require using Bayes theorem to get the posterior probability of the model as
Figure 3.2 shows how the model evidence can be used to
choose the right model.3.2 In the figure the horizontal axis
ranges through all the possible data sets. The curves show the values
of evidence for different models and different data sets. As the
distributions
are all normalised, the area under each
curve is equal.
A simple model like
can only describe a small range of
possible data sets. It gives a high evidence for those but nothing
for the rest. A very complex model like
can describe a much
larger variety of data sets. Therefore it has to spread its
predictions more thinly than model
and gives lower evidence
for simple data sets. And for a data set like
lying there in
the middle, both the extremes will lose to model
which is just
good enough for that data and therefore just the model called for by
Occam's razor.
![]() |
After this point the explicit references to the model
in
expressions for different probabilities are omitted. This is done
purely to simplify the notation. It should be noted that in the
Bayesian framework, all the probabilities are conditional to some
assumptions because they are always subjective.