BAYESIAN LEARNING IN PRACTICE

The previous section outlined how learning, reasoning and decision making should function in theory. The theory does not take into account the computational and storage capacity requirements, however. In realistic situations exact computation following Bayes' rule, the marginalisation principle and the rule of expected utility is almost always computationally prohibitive. Therefore an important research topic has long been the development of methods which yield practical approximations of the application of the exact theory.

Currently the frequentist interpretation is the prevailing statistical philosophy. According to this view, probability measures the relative frequencies of different outcomes in an (imaginary) infinite sequence of trials. Probability is defined for random variables which can take different values in different trials. Both the frequency and belief interpretations have coexisted since the early days of probability theory (see, e.g., [75]). The advent of quantum physics gave the frequentist view strong impetus because probability was seen as a measurable property of the physical world.

The Bayesian and frequentist schools use different language and
methods. In frequentist statistics, a hypothesis or a parameter of a
model cannot have probabilities as they are not random variables and
do not take different values in different trials. The methods
developed within the frequentist school include estimators and
confidence intervals for parameters and *P*-values for hypothesis
testing, whereas prior and posterior probabilities do not enter the
calculations.

In Bayesian statistics, probability measures the degree of belief and it is therefore admissible to talk about the probabilities of hypotheses and parameters. The probability distribution of a parameter, for instance, quantifies the uncertainty of its value. Many Bayesian statisticians avoid talking about random variables altogether because almost always one is actually talking about uncertainty of the outcomes of experiments, not some intrinsic property of the world.

The prevailing frequentist language and methods have caused misconceptions about the Bayesian view and not all researchers are ready to consider the Bayesian approach to statistical inference as theoretically optimal. Here are answers to some common arguments against the Bayesian approach to learning:

- ``Prior probabilities are needed in Bayesian learning whereas
other methods do not need them.''
Learning cannot start from a vacuum and therefore prior assumptions of some sort are needed. In some learning algorithms these assumptions are implicit but they still exist. In Bayesian learning, these assumptions are required to be stated explicitly, which makes it easier to locate possible flaws in them.

- ``Bayesian learning works only if the true model is included
in the hypothesis space.''
True models exist only in theoretical constructs. Bayesian statistics does not take a stand on true models because it only talks about the beliefs in propositions. In practice the prior knowledge about models almost never states that one of the models is true but that the models give better and worse predictions about the observations. Then the posterior probability does not measure the belief that a certain hypothesis is true but the belief that the observations can be best predicted by the hypothesis.

- ``Various efficient learning methods are not derived from
Bayesian probability theory but they still work.''
It is true that it is possible to develop efficient learning algorithms without referring to Bayesian probability theory. Evolution of the brain, for instance, was not guided by any theory for certain. However, this does not mean that Bayesian probability theory would not be an appropriate way to try to understand the learning algorithms. In practice they can be interpreted as approximations to the theoretically optimal exact Bayesian approach. It is then often easier to investigate their underlying assumptions and limitations and in some cases generalise the methods.

- Probability density for real valued variables
- Methods for approximating the posterior probability
- Information-theoretic approaches to learning
- Ensemble learning
- Specification of the model and priors