Level 3 (upper level) |

There are two schools about the interpretation of probability. In classical statistics, probability is interpreted as a limiting frequency when an experiment is repeated infinitely many times. For instance in throwing a dice, the probability of having three is one out of six (exactly so only if the dice is ideal).

In everyday language the probability is, however, understood is a wider sense. One can, for example, speak about the probability of rain tomorrow, even though the event is unique and there is no way its frequency could be measured by repeated experiments. Moreover, different people can give the same event different probability. This is natural since different people have different background knowledge and beliefs.

The interpretation of Bayesian probability theory is very close to everyday language. Probability expresses how strongly someone believes in something. Belief is always subjective and depends on background knowledge. Notation P(A | B) means: how true A seems if B is assumed. Often all the background knowledge is denoted and P(A) can thus mean different things depending on which background assumptions are used. It is good to remember, however, that according to Bayesian interpretation there is no absolute probability since there doesn't exist an absolutely correct set of background assumptions.

Sometimes the interpretation of probability has no effect on how the actual computations are conducted or what is the result. For the probabilities in dice throwing, for example, the interpretation has no significance. However, from the point of view of learning and intelligent systems, the difference in interpretation is significant.

Propositions, for which the probabilities are defined, obey the rules of Boolean algebra. It is defined for elements which have two binary operations, sum and product, and an unary operation, complement, which will be denoted here by ¬. The set of axioms defining the Boolean algebra is

There exist elements 0 and 1, which are not equal. | [A1] | |

AB = BA | A+B = B + A | [A2] |

A(B+C) = (AB)+(AC) | A+(BC) = (A+B)(A+C) | [A3] |

1A = A | 0+A = A | [A4] |

A¬A = 0 | A+¬A = 1 | [A5] |

The axioms on the same row are dual. If the product and sum, and 0 and 1 are exchanged, one can transform between the dual axioms. Let's denote the axioms on the left hand column by a and right hand by b, i.e., A2b means the axiom AB = BA. From the axioms one can derive the following lemmas

¬¬A = A | [L1] | |

AA = A | A+A = A | [L2] |

¬1 = 0 | ¬0 = 1 | [L3] |

AB = 0 & A+B = 1 => B = ¬A | [L4] | |

0A = 0 | 1+A = 1 | [L5] |

A(A+B) = A | A+AB = A | [L6] |

A(BC) = (AB)C | A+(B+C) = (A+B)+C | [L7] |

¬A(AB) = 0 | ¬A+(A+B) = 1 | [L8] |

¬(AB) = ¬A+¬B | ¬(A+B) = ¬A¬B | [L9] |

AB = 1 => A = 1 | A+B = 0 => A = 0 | [L10] |

Boolean logic will be obtained when only the elements 0 and 1 are taken in the algebra. Zero is interpreted as false and one as truth. Product means the and, sum the or and complement the negation operation.

Sum Rule: P(A | B) + P(¬A | B) = 1

If one wishes to verify the truth of AB, one can first verify A and then verify B assuming A. Hence P(AB | C) is evidently a function of P(A | C) and P(B | AC). The product rule states that this function is a product.

Product Rule: P(AB | C) = P(A | C) P(B | AC)

Probability is a real number between zero and one. The probability is not defined if the background assumptions, premisses, conflict. P(A | B¬B), for example, is undefined.

Using the rules of arithmetics and Boolean algebra, all other rules of Bayesian probability theory can be derived from the sum and product rule. Let's take the derivation of the generalised sum rule for example. In what follows, the rule that will be applied is denoted at each step, unless only the rules of basic arithmetics are applied.

P(A+B | C) = | [L1] |

P(¬¬(A+B)) | C) = | [L7b] |

P(¬(¬A¬B) | C) = | [Sum Rule] |

1 - P(¬A¬B | C) = | [Product Rule] |

1 - P(¬A | C) P(¬B | ¬AC) = | [Sum Rule] |

1 - P(¬A | C) [1 - P(B | ¬AC)] = | |

1 - P(¬A | C) + P(¬A | C) P(B | ¬AC) = | [Sum Rule] |

P(A | C) + P(¬A | C) P(B | ¬AC) = | [Product Rule] |

P(A | C) + P(¬AB | C) = | [A2a] |

P(A | C) + P(B¬A | C) = | [Product Rule] |

P(A | C) + P(B | C) P(¬A | BC) = | [Sum Rule] |

P(A | C) + P(B | C) [1 - P(A | BC)] = | |

P(A | C) + P(B | C) - P(B | C) P(A | BC) = | [Product Rule] |

P(A | C) + P(B | C) - P(BA | C) = | [A2a] |

P(A | C) + P(B | C) - P(AB | C) |

Usually, of course, not all the intermediate results are presented. From the sum and product rule, also the equations P(1 | A) = 1 and P(A | B) > 0 => P(A | AB) = 1 can be derived. Let's denote x = P(1 | A). Then

1 - x = 1 - P(1 | A) = P(0 | A) = P(10 | A) = P(1 | A) P(0 | 1A) = x(1 - x) => x² - 2x + 1 = 0,

whose only solution is x = 1. On the other hand,

P(A | B) = P(AA | B) = P(A | B) P(A | AB),

and it follows that P(A | AB) = 1 if P(A | B) > 0.

P(AB_{1}+AB_{2} | C) = P(AB_{1} | C) +
P(AB_{2} | C) - P(AB_{1}AB_{2} | C) =
P(AB_{1} | C) + P(AB_{2} | C).

This follows from AB_{1}AB_{2} =
A(B_{1}B_{2}) = A0 = 0. Adding AB_{3} gives

P(AB_{1}+AB_{2}+AB_{3} | C) =
P(AB_{1} | C) + P(AB_{2} | C) + P(AB_{3} | C)
- P((AB_{1} + AB_{2})AB_{3} | C) =
P(AB_{1} | C) + P(AB_{2} | C) + P(AB_{3} | C).

Continuing to AB_{n} results in

P(AB_{1} + AB_{2} + ... + AB_{n} | C) =
P(AB_{1} | C) + P(AB_{2} | C) + ... + P(AB_{n}
| C).

On the other hand, since AB_{1} + AB_{2} + ... +
AB_{n} = A(B_{1} + B_{2} + ... +
B_{n}) = A1 = A, we have

P(A | C) = P(AB_{1} | C) + P(AB_{2} | C) + ... +
P(AB_{n} | C).

By applying the product rule we get the marginalisation principle

P(A | C) = P(A | B_{1}C) P(B_{1} | C) + ... +
P(A | B_{n}C) P(B_{n} | C).

The significance of the principle become clear, then the
propositions B_{i} are interpreted as possible explanations
for A. The probability of A is thus the sum of probabilities which
different explanations give for A weighed by the probabilities of the
explanations.

The Bayes' rule can be derived from the product rule. It tells how the probabilities of explanantions change, when A is observed.

P(B_{i} | AC) = P(B_{i} | C) P(A | B_{i}C)
/ P(A | C)

P(B_{i} | C) is the probability before the knowledge about
A and it is called the prior probability of B_{i}.
Correspondingly, P(B_{i} | AC) is called the posterior
probability of B_{i}. One can see from the Bayes' rule that
the posterior probabilities of explanations B_{i} which
explain A well are higher than the prior probabilities and vice versa.

An example hopefully illuminates the use of Bayes' rule. A = I
have fever, B_{1} = I have a flu and B_{2} = no flu =
¬B_{1}. Let's assume that I know the probabilities P(A |
B_{1}C), P(A | B_{2}C) and P(B_{1} | C), i.e.,
the probabilities of having fever when having flu, of having fever
without having flu and of having flu in the first place. Let's assing
them the numerical values P(A | B_{1}C) = 0.95, P(A |
B_{2}C) = 0.05 and P(B_{1} | C) = 0.1. According to
the marginalisation principle, the probability of having fever is

P(A | C) = P(A | B_{1}C) P(B_{1} | C) + P(A |
B_{2} C) P(B_{2} | C) = 0,95 * 0,1 + 0,05 * 0,9 =
0,095 + 0,045 = 0,14.

The probability of having flu is originally fairly small, only one in 10. If it now turns out that I have fever, the probability of flu increases

P(B_{1} | AC) = P(B_{1} | C) P(A | B_{1}C)
/ P(A | C) = 0,1 * 0,95 / 0,14 = 0,68..

Together the marginalisation principle and the Bayes' rule tell how the belief in a hypothesis changes when observations are made and how the beliefs in hypotheses are taken into account when making predictions based on them.

With real valued quantities, the probability of any particular
value is usually 0. If, for instance, according to a measurement the
length of a pencil is about 16 cm, the probability of the length being
*exactly* 16 cm is zero. The probability that the length is
between 15 cm and 17 cm can, in contrast, easilly be very close to
one.

The phenomenon is tha same as in measuring a mass. If one takes a single point of an object, it doesn't have any mass. If one takes a volume instead, the mass differs from zero. Just like the density of an object equals to the mass divided by volume, the probability density is the probability of a range divided by its length.

The Bayes' rule will remain the same also when using probability densities.

Often probability mass is denoted by capital P and density by lower case p, but usually it becomes clear from the contex whether probability mass or density is ment.

Level 3 (upper level) |