Posterior Mean and Variance of f(s(t))

Next: Update Rules Up: Cost Function Previous: Terms of the Cost

Posterior Mean and Variance of f(s(t))

This section describes how to compute the posterior mean and variance of the outputs f_k(s(t)) of the MLP network. Ordinarily the inputs, weights and biases of an MLP network have fixed values. Here the inputs s(t), weights A, B and the biases a, b have posterior distributions which means that we also have a posterior distribution of the outputs. One way to evaluate the posterior mean and variance is to propagate distributions instead of fixed values through the network. Whole distributions would be quite tricky to deal with, and therefore we are going to characterise the distributions by their mean and variance only.

The sources have mixture-of-Gaussians distributions for which it is easy to compute the mean and variance:

$\displaystyle \bar{s}_i(t)$	=	$\displaystyle \sum_l \dot{s}_{il}(t) \bar{s}_{il}(t)$	(19)
$\displaystyle \tilde{s}_i(t)$	=	$\displaystyle \sum_l \dot{s}_{il}(t) [\tilde{s}_{il}(t) + (\bar{s}_{il}(t) - \bar{s}_i(t))^2] \, .$	(20)

Then the sources are multiplied with the first layer weight matrix A and the bias a is added. Let us denote the result by $y_j(t) = a_j + \sum_i A_{ji} s_i(t)$ . Since the sources, weights and biases are all mutually independent a posteriori, the following equations hold:

$\displaystyle \bar{y}_j(t)$	=	$\displaystyle \bar{a}_j + \sum_i \bar{A}_{ji} \bar{s}_i(t)$	(21)
$\displaystyle \tilde{y}_j(t)$	=	$\displaystyle \tilde{a}_j +$
		$\displaystyle \sum_i \bar{A}_{ji}^2\tilde{s}_i(t) + \tilde{A}_{ji}[\bar{s}_i^2(t) + \tilde{s}_i(t)] \, .$	(22)

Equation (31) follows from the identity

$\begin{displaymath}\mathrm{var}(\alpha) = \langle\alpha^2\rangle - \langle\alpha\rangle^2 \, . \end{displaymath}$

(23)

For computing the posterior mean of the output g_j(y_j(t)) of a hidden neuron, we shall utilise the second order Taylor's series expansion of g_j around the posterior mean $\bar{y}_j(t)$ of its input. This means that we approximate

g_j(y_j(t))	$\textstyle \approx$	$\displaystyle g_j(\bar{y}_j(t)) + (y_j(t) - \bar{y}_j(t)) g'_i(\bar{y}_j(t)) +$
		$\displaystyle \frac{1}{2} (y_j(t) - \bar{y}_j(t))^2 g''_i(\bar{y}_j(t)) \, .$	(24)

Since the posterior mean of y_j(t) is by definition $\bar{y}_j(t)$ , the second term vanishes when evaluating the posterior mean, while the posterior mean of $(y_j(t) - \bar{y}_j(t))^2$ is by definition the posterior variance $\tilde{y}_j(t)$ . We thus have

$\begin{displaymath}\bar{g}_j(y_j(t)) \approx g_j(\bar{y}_j(t)) + \frac{1}{2} \tilde{y}_j(t) g''_j(\bar{y}_j(t)) \, . \end{displaymath}$

(25)

The second order expansion was chosen because those are the terms whose posterior mean can be expressed in terms of posterior mean and variance of the input. Higher order terms would have required higher order cumulants of the input, which would have increased the computational complexity with little extra benefit.

For the posterior variance of g_j(y_j(t)) the second order expansion would result in terms which need higher than second order knowledge about the inputs. Therefore we shall use the first order Taylor's series expansion which then yields the following approximation for the posterior variance of g_j(y_j(t)):

$\begin{displaymath}\tilde{g}_j(y_j(t)) \approx [g'_j(\bar{y}_j(t))]^2 \tilde{y}_j(t) \, . \end{displaymath}$

(26)

The next step is to compute the mean and variance of the output after the second layer mapping. The outputs are given by $f_k(t) = b_k + \sum_j B_{kj} g_j(t)$ . The equation for the posterior mean $\bar{f}_k(t)$ is similar to (30):

$\begin{displaymath}\bar{f}_k(t) = \bar{b}_k + \sum_j \bar{B}_{kj} \bar{g}_j(t) \, . \end{displaymath}$

(27)

The equation for the posterior variance $\tilde{f}_k(t)$ is more complicated than (31), however, since s_i(t) are independent a posterior but g_j(t) are not. This is because each s_i(t) affects several -- potentially all -- g_j(t). In other words, each s_i(t) affects each f_k(t) through several paths which interfere. This interference needs to be taken into account when computing the posterior variance of f_k(t).

We shall use a first order approximation of the mapping f(s(t)) for measuring the interference. This is consistent with the first order approximation of the nonlinearities g_j and yields the following equation for the posterior variance of f_k(t):

$\displaystyle \bar{f}_k(t)$	$\textstyle \approx$	$\displaystyle \sum_i \left( \frac{\partial f_k(t)}{\partial s_i(t)} \right)^2 \tilde{s}_i(t) + \tilde{b}_k +$
		$\displaystyle \sum_j \bar{B}^2_{kj} \tilde{g}^*_j(t) + \tilde{B}_{jk} [\bar{g}^2_j(t) + \tilde{g}_j(t)] \, ,$	(28)

where the posterior means of the partial derivatives are obtained by the chain rule

$\displaystyle \frac{\partial f_k(t)}{\partial s_i(t)}$	=	$\displaystyle \sum_j \frac{\partial f_k(t)}{\partial g_j(t)} \frac{\partial g_j(t)}{\partial y_j(t)} \frac{\partial y_j(t)}{\partial s_i(t)} =$
		$\displaystyle \sum_j \bar{B}_{kj} g'_j(\bar{y}_j(t)) \bar{A}_{ji}$	(29)

and $\tilde{g}^*_j(t)$ denotes the posterior variance of g_j(t) without the contribution from the sources. It can be computed as follows:

$\displaystyle \tilde{y}^*_j(t)$	=	$\displaystyle \tilde{a}_j + \sum_i \tilde{A}_{ji}[\bar{s}_i^2(t) + \tilde{s}_i(t)]$	(30)
$\displaystyle \tilde{g}^*_j(t)$	$\textstyle \approx$	$\displaystyle [g'_j(\bar{y}_j(t))]^2 \tilde{y}^*_j(t) \, .$	(31)

Notice that $\tilde{s}_i(t)$ appears in (39) and $\tilde{g}_j(t)$ appears in (37). These terms do not contribute to interference, however, because they are the parts which are randomised by multiplication with A_ji or B_kj and randomising the phase destroys the interference, to use an analogy from physics.

Next: Update Rules Up: Cost Function Previous: Terms of the Cost

Harri Lappalainen
2000-03-03