The goal is to estimate the posterior pdf of all the unknown variables of the model. This is done by ensemble learning which amounts to fitting a simple, parametric approximation to the actual posterior pdf [5]. The cost function C is the misfit between the approximation and the actual posterior and is measured by the Kullback-Leibler information which is sensitive to the probability mass of densities. This is the most important advantage over maximum a posteriori (MAP) estimation which is computationally less expensive but is sensitive to probability density, not mass. This is why MAP estimation suffers from overfitting, which would be a serious problem since there are so many estimated variables, while ensemble learning is able avoid it.
For the time being, let us denote the set of all observations vectors
x(t) by X and denote all the other parameters by a vector
.
The actual posterior pdf is thus
.
The joint pdf
is obtained from the definition of the model in
(3)-(14) and p(X) is a
normalising factor which does not depend on the unknown variables.
Let us denote the approximation of the posterior pdf by
.
In order for the cost function to be computable
in practice, a simple factorial form needs to be chosen for the
approximation
.
The maximally factorial form would
be
![]() |
(15) |
The assumption of factorial
is equivalent to
assuming the unknown variables independent given the observations.
This is not true, of course, but we have to make this approximation in
order to obtain a practical algorithm. The only exception to this
maximally factorial form is that the index Mi(t) of the Gaussian
and the corresponding source si(t) are allowed to have posterior
dependency, that is, the terms
q(Mi(t), si(t)) are not further
factorised.
The approximation
should be chosen so that it fits the
actual posterior as closely as possible. This is accomplished by
choosing
to be Gaussian for other variables than sources
and for sources choosing
q(Mi(t), si(t)) = Q(Mi(t)) q(si(t) |
Mi(t)), where
q(si(t) | Mi(t)) is Gaussian.
Let us denote the mean and variance of
by
and
,
respectively. The result of
learning is then an estimate of
and
which tell the posterior mean and variance of
all the unknown variables.
The term p(X) is constant with respect to the unknown parameters.
Instead of the pure Kullback-Leibler information
it is therefore possible to use the
following cost function:
Due to simple factorial forms of
and
the cost function splits into simple terms which are
easy to compute. Consequently, it is also easy to differentiate the
cost function with respect to
and
and use the derivatives for constructing the
learning algorithm.