Probability density for real valued variables

Next: Methods for approximating the Up: BAYESIAN LEARNING IN PRACTICE Previous: BAYESIAN LEARNING IN PRACTICE

Probability density for real valued variables

In symbolic representations, the propositions are discrete and similar to simple statements of natural language. When trying to learn models of the environment, the problem with discrete propositions is that an unimaginable number of them is needed for covering all the possible states of the world. The alternative is to build models which have real valued variables. This allows one to manipulate a vast number of elementary propositions by manipulating real valued functions, probability densities.

Following the usual Bayesian convention, probability density is denoted by a lower case p and the ordinary probability by a capital P throughout this thesis. We also use the convenient short hand notation where p(x | y) means the distribution of the belief in the value of x given y. Alternative notation would be f_X|Y(x | y), which makes explicit the fact that p(x | y) is not the same function as, for instance, p(u | v). In cases where the ordinary probability needs to be distinguished from probability density, it is called probability mass in analogy to physical mass and density.

Bayes' rule looks exactly the same for probability densities as it does for probability mass. If a and b are real valued variables, Bayes' rule takes the following form:

$\begin{displaymath}p(a \vert b C) = \frac{p(a \vert C) p(b \vert a C)}{p(b \vert C)}. \end{displaymath}$

(9)

This is convenient but also dangerous. It is all too easy to talk about ``the single most probable model'' when one is actually talking about the model which has the highest probability density. This is dangerous since probability density is a derived quantity and has no role per se in probability theory. This will be more evident when looking at the marginalisation principle

$\begin{displaymath}p(b \vert C) = \int p(a \vert C) p(b \vert a C) da \end{displaymath}$

(10)

or the rule of expected utility

$\begin{displaymath}U(A) = \int p(b \vert A) U(A b) db \end{displaymath}$

(11)

written for probability densities. Notice that for probability density the sum changes into an integral. In the integrals, the impact on the probability p(b | C) or on the utility U(A) is zero at any single point if the density is finite. Only a nonzero range has a nonzero contribution in the integrals. It is then evident that a high density per se is not important, but the overall probability mass in the vicinity of a model is.

Next: Methods for approximating the Up: BAYESIAN LEARNING IN PRACTICE Previous: BAYESIAN LEARNING IN PRACTICE

Harri Valpola
2000-10-31