next up previous
Next: Update Rules Up: Cost Function Previous: Terms of the Cost

Posterior Mean and Variance of f(s(t))

This section describes how to compute the posterior mean and variance of the outputs fk(s(t)) of the MLP network. Ordinarily the inputs, weights and biases of an MLP network have fixed values. Here the inputs s(t), weights A, B and the biases a, b have posterior distributions which means that we also have a posterior distribution of the outputs. One way to evaluate the posterior mean and variance is to propagate distributions instead of fixed values through the network. Whole distributions would be quite tricky to deal with, and therefore we are going to characterise the distributions by their mean and variance only.

The sources have mixture-of-Gaussians distributions for which it is easy to compute the mean and variance:

$\displaystyle \bar{s}_i(t)$ = $\displaystyle \sum_l \dot{s}_{il}(t) \bar{s}_{il}(t)$ (19)
$\displaystyle \tilde{s}_i(t)$ = $\displaystyle \sum_l \dot{s}_{il}(t) [\tilde{s}_{il}(t) + (\bar{s}_{il}(t) -
\bar{s}_i(t))^2] \, .$ (20)

Then the sources are multiplied with the first layer weight matrix A and the bias a is added. Let us denote the result by $y_j(t) = a_j + \sum_i A_{ji} s_i(t)$. Since the sources, weights and biases are all mutually independent a posteriori, the following equations hold:
  
$\displaystyle \bar{y}_j(t)$ = $\displaystyle \bar{a}_j + \sum_i \bar{A}_{ji} \bar{s}_i(t)$ (21)
$\displaystyle \tilde{y}_j(t)$ = $\displaystyle \tilde{a}_j +$  
    $\displaystyle \sum_i \bar{A}_{ji}^2\tilde{s}_i(t) + \tilde{A}_{ji}[\bar{s}_i^2(t) +
\tilde{s}_i(t)] \, .$ (22)

Equation (31) follows from the identity

\begin{displaymath}\mathrm{var}(\alpha) = \langle\alpha^2\rangle - \langle\alpha\rangle^2 \, .
\end{displaymath} (23)

For computing the posterior mean of the output gj(yj(t)) of a hidden neuron, we shall utilise the second order Taylor's series expansion of gj around the posterior mean $\bar{y}_j(t)$ of its input. This means that we approximate
gj(yj(t)) $\textstyle \approx$ $\displaystyle g_j(\bar{y}_j(t)) + (y_j(t) - \bar{y}_j(t))
g'_i(\bar{y}_j(t)) +$  
    $\displaystyle \frac{1}{2} (y_j(t) -
\bar{y}_j(t))^2 g''_i(\bar{y}_j(t)) \, .$ (24)

Since the posterior mean of yj(t) is by definition $\bar{y}_j(t)$, the second term vanishes when evaluating the posterior mean, while the posterior mean of $(y_j(t) - \bar{y}_j(t))^2$ is by definition the posterior variance $\tilde{y}_j(t)$. We thus have

 \begin{displaymath}\bar{g}_j(y_j(t)) \approx g_j(\bar{y}_j(t)) + \frac{1}{2}
\tilde{y}_j(t) g''_j(\bar{y}_j(t)) \, .
\end{displaymath} (25)

The second order expansion was chosen because those are the terms whose posterior mean can be expressed in terms of posterior mean and variance of the input. Higher order terms would have required higher order cumulants of the input, which would have increased the computational complexity with little extra benefit.

For the posterior variance of gj(yj(t)) the second order expansion would result in terms which need higher than second order knowledge about the inputs. Therefore we shall use the first order Taylor's series expansion which then yields the following approximation for the posterior variance of gj(yj(t)):

 \begin{displaymath}\tilde{g}_j(y_j(t)) \approx [g'_j(\bar{y}_j(t))]^2 \tilde{y}_j(t) \, .
\end{displaymath} (26)

The next step is to compute the mean and variance of the output after the second layer mapping. The outputs are given by $f_k(t) = b_k +
\sum_j B_{kj} g_j(t)$. The equation for the posterior mean $\bar{f}_k(t)$ is similar to (30):

\begin{displaymath}\bar{f}_k(t) = \bar{b}_k + \sum_j \bar{B}_{kj} \bar{g}_j(t) \, .
\end{displaymath} (27)

The equation for the posterior variance $\tilde{f}_k(t)$ is more complicated than (31), however, since si(t) are independent a posterior but gj(t) are not. This is because each si(t) affects several -- potentially all -- gj(t). In other words, each si(t) affects each fk(t) through several paths which interfere. This interference needs to be taken into account when computing the posterior variance of fk(t).

We shall use a first order approximation of the mapping f(s(t)) for measuring the interference. This is consistent with the first order approximation of the nonlinearities gj and yields the following equation for the posterior variance of fk(t):

 
$\displaystyle \bar{f}_k(t)$ $\textstyle \approx$ $\displaystyle \sum_i \left( \frac{\partial f_k(t)}{\partial s_i(t)}
\right)^2 \tilde{s}_i(t) + \tilde{b}_k +$  
    $\displaystyle \sum_j \bar{B}^2_{kj}
\tilde{g}^*_j(t) + \tilde{B}_{jk} [\bar{g}^2_j(t) + \tilde{g}_j(t)] \, ,$ (28)

where the posterior means of the partial derivatives are obtained by the chain rule
 
$\displaystyle \frac{\partial f_k(t)}{\partial s_i(t)}$ = $\displaystyle \sum_j \frac{\partial
f_k(t)}{\partial g_j(t)} \frac{\partial g_j(t)}{\partial y_j(t)}
\frac{\partial y_j(t)}{\partial s_i(t)} =$  
    $\displaystyle \sum_j
\bar{B}_{kj} g'_j(\bar{y}_j(t)) \bar{A}_{ji}$ (29)

and $\tilde{g}^*_j(t)$ denotes the posterior variance of gj(t) without the contribution from the sources. It can be computed as follows:
 
$\displaystyle \tilde{y}^*_j(t)$ = $\displaystyle \tilde{a}_j + \sum_i \tilde{A}_{ji}[\bar{s}_i^2(t) +
\tilde{s}_i(t)]$ (30)
$\displaystyle \tilde{g}^*_j(t)$ $\textstyle \approx$ $\displaystyle [g'_j(\bar{y}_j(t))]^2 \tilde{y}^*_j(t) \, .$ (31)

Notice that $\tilde{s}_i(t)$ appears in (39) and $\tilde{g}_j(t)$ appears in (37). These terms do not contribute to interference, however, because they are the parts which are randomised by multiplication with Aji or Bkj and randomising the phase destroys the interference, to use an analogy from physics.


next up previous
Next: Update Rules Up: Cost Function Previous: Terms of the Cost
Harri Lappalainen
2000-03-03