Cost Function

The goal is to estimate the posterior pdf of all the unknown variables of the model. This is done by ensemble learning which amounts to fitting a simple, parametric approximation to the actual posterior pdf [5]. The cost function C is the misfit between the approximation and the actual posterior and is measured by the Kullback-Leibler information which is sensitive to the probability mass of densities. This is the most important advantage over maximum a posteriori (MAP) estimation which is computationally less expensive but is sensitive to probability density, not mass. This is why MAP estimation suffers from overfitting, which would be a serious problem since there are so many estimated variables, while ensemble learning is able avoid it.

For the time being, let us denote the set of all observations vectors x(t) by X and denote all the other parameters by a vector $\mathbf{\theta}$ . The actual posterior pdf is thus $p(\boldsymbol{\theta} \vert X) = p(X, \boldsymbol{\theta}) / p(X)$ . The joint pdf $p(X, \boldsymbol{\theta})$ is obtained from the definition of the model in (3)-(14) and p(X) is a normalising factor which does not depend on the unknown variables.

Let us denote the approximation of the posterior pdf by $q(\boldsymbol{\theta})$ . In order for the cost function to be computable in practice, a simple factorial form needs to be chosen for the approximation $q(\boldsymbol{\theta})$ . The maximally factorial form would be

The assumption of factorial $q(\boldsymbol{\theta})$ is equivalent to assuming the unknown variables independent given the observations. This is not true, of course, but we have to make this approximation in order to obtain a practical algorithm. The only exception to this maximally factorial form is that the index M_i(t) of the Gaussian and the corresponding source s_i(t) are allowed to have posterior dependency, that is, the terms q(M_i(t), s_i(t)) are not further factorised.

The approximation $q(\theta_i)$ should be chosen so that it fits the actual posterior as closely as possible. This is accomplished by choosing $q(\theta_i)$ to be Gaussian for other variables than sources and for sources choosing q(M_i(t), s_i(t)) = Q(M_i(t)) q(s_i(t) | M_i(t)), where q(s_i(t) | M_i(t)) is Gaussian.

Let us denote the mean and variance of $q(\theta_i)$ by $\bar{\theta}_i$ and $\tilde{\theta}_i$ , respectively. The result of learning is then an estimate of $\boldsymbol{\bar{\theta}}$ and $\boldsymbol{\tilde{\theta}}$ which tell the posterior mean and variance of all the unknown variables.

The term p(X) is constant with respect to the unknown parameters. Instead of the pure Kullback-Leibler information $K(q(\boldsymbol{\theta}) \vert\vert p(\boldsymbol{\theta} \vert X))$ it is therefore possible to use the following cost function:

$\displaystyle C(\boldsymbol{\bar{\theta}}, \boldsymbol{\tilde{\theta}})$	=	$\displaystyle K(q(\boldsymbol{\theta}) \vert\vert p(\boldsymbol{\theta} \vert X)) - \ln p(X) =$
		$\displaystyle \int q(\boldsymbol{\theta}) \ln \frac{q(\boldsymbol{\theta})}{p(\boldsymbol{\theta} \vert X)} d\boldsymbol{\boldsymbol{\theta}} - \ln p(X) =$
		$\displaystyle \int q(\boldsymbol{\theta}) \ln \frac{q(\boldsymbol{\theta})}{p(X, \boldsymbol{\theta})} d\boldsymbol{\boldsymbol{\theta}} \, .$	(16)

Due to simple factorial forms of $q(\boldsymbol{\theta})$ and $p(X, \boldsymbol{\theta})$ the cost function splits into simple terms which are easy to compute. Consequently, it is also easy to differentiate the cost function with respect to $\boldsymbol{\bar{\theta}}$ and $\boldsymbol{\tilde{\theta}}$ and use the derivatives for constructing the learning algorithm.