Let us model a set of observations,
,
by a
Gaussian distribution parametrised by mean *m* and log-std
.
We shall approximate the posterior distribution by
*q*(*m*,*v*) =
*q*(*m*)*q*(*v*), where both *q*(*m*) and *q*(*v*) are Gaussian. The
parametrisation with log-std is chosen because the posterior of *v* is
closer to Gaussian than the posterior of
or
would
be. (Notice that the parametrisation yielding close to Gaussian
posterior distributions is connected to uninformative priors discussed
in section 5.1.)

Let the priors for *m* and *v* be Gaussian with means
and
and variances
and
,
respectively.
The joint density of the observations and the parameters *m* and *v*is

(19) |

As we can see, the posterior is a product of many simple terms.

Let us denote by
and
the posterior mean and
variance of *m*.

(20) |

The distribution is analogous.

The cost function is now

= | |||

(21) |

We see that the cost function has many terms, all of which are expectations over

(22) |

A similar term, with replaced by , comes from .

The terms where expectation is taken over
and are also simple since

which means that we only need to be able to compute the expectation of over the Gaussian

since the variance can be defined by which shows that . Integrating the equation 23 and substituting equation 24 thus yields

(25) |

A similar term, with

The last terms are of the form
.
Again we will find out that the factorisation
*q*(*m*, *v*) = *q*(*m*)
*q*(*v*) simplifies the computation of these terms. Recall that *x*(*t*)was assumed to be Gaussian with mean *m* and variance *e*^{2v}. The
term over which the expectation is taken is thus

The expectation over the term (

= | |||

(27) |

This shows that taking expectation over equation 26 yields a term

(28) |

Collecting together all the terms, we obtain the following cost function

= | |||

(29) |

Assuming
and
are very large, the minimum of
the cost function can be solved by setting the gradient of the cost
function *C* to zero. This yields the following:

= | (30) | ||

= | (31) | ||

= | (32) | ||

= | (33) |

In case and cannot be assumed very large, the equations for and are not that simple, but the solution can still be obtained by solving the zero-point of the gradient.

Figure 3 shows a comparison of the true posterior distribution and the approximate posterior. The data set consisted of 100 points drawn from a model with *m*=1 and
.
The contours in both distributions are centred in the same region, a model that underestimates *m*.The contours for the two distributions are qualitatively similar although the true distribution is not symmetrical about the mean value of *v*.