Probabilistic computations for MLP networks

In this appendix, it is shown how to evaluate the distribution of the outputs of an MLP network $\mathbf{f}$ , assuming the inputs $\mathbf{s}$ and all the network weights are independent and Gaussian. These equations are needed in evaluating the ensemble learning cost function of the Bayesian NSSM in Section 6.2.

$\displaystyle \mathbf{f}(\mathbf{s}) = \mathbf{B}\boldsymbol{\varphi}(\mathbf{A}\mathbf{s}+ \mathbf{a}) + \mathbf{b}.$

(B.1)

$\displaystyle s_i$	$\displaystyle \sim N(\overline{s}_i, \widetilde{s}_i)$	(B.2)
$\displaystyle A_{ij}$	$\displaystyle \sim N(\overline{A}_{ij}, \widetilde{A}_{ij})$	(B.3)
$\displaystyle B_{ij}$	$\displaystyle \sim N(\overline{B}_{ij}, \widetilde{B}_{ij})$	(B.4)
$\displaystyle a_{i}$	$\displaystyle \sim N(\overline{a}_i, \widetilde{a}_i)$	(B.5)
$\displaystyle b_{i}$	$\displaystyle \sim N(\overline{b}_i, \widetilde{b}_i).$	(B.6)

**Figure B.1:** The structure of an MLP network with one hidden layer.
$\includegraphics[width=.4\textwidth]{pics/multipath}$

The computations in the first layer of the network can be written as $y_i = a_i + \sum_j A_{ij} s_j$ . Since all the parameters involved are independent, the mean and the variance of

are

$\displaystyle \overline{y}_i$	$\displaystyle = \overline{a}_i + \sum_j \overline{A}_{ij} \overline{s}_j$	(B.7)
$\displaystyle \widetilde{y}_i$	$\displaystyle = \widetilde{a}_i + \sum_j \left[ \overline{A}_{ij}^2 \widetilde{... ...+ \widetilde{A}_{ij} \left( \overline{s}_j^2 + \widetilde{s}_j \right) \right].$	(B.8)

$\displaystyle \operatorname{Var}[ \alpha ] = \operatorname{E}[ \alpha^2 ] - \operatorname{E}[ \alpha ]^2.$

(B.9)

The nonlinear activation function is handled with a truncated Taylor series approximation about the mean $\overline{y}_i$ of the inputs. Using a second order approximation for the mean and first order for the variance yields

$\displaystyle \overline{\varphi}(y_i)$	$\displaystyle \approx \varphi(\overline{y}_i) + \frac{1}{2} \varphi''(\overline{y}_i) \widetilde{y}_i$	(B.10)
$\displaystyle \widetilde{\varphi}(y_i)$	$\displaystyle \approx \left[ \varphi'(\overline{y}_i) \right]^2 \widetilde{y}_i.$	(B.11)

The computations of the output layer are given by $f_i(\mathbf{s}) = b_i + \sum_j B_{ij} \varphi(y_j)$ . This may look the same as the one for the first layer, but there is the big difference that the

are no longer independent. Their dependence does not, however, affect the evaluation of the mean of the outputs, which is

$\displaystyle \overline{f}_i(\mathbf{s}) = \overline{b}_i + \sum_j \overline{B}_{ij} \overline{\varphi}(y_j).$

(B.12)

The dependence between

arises from the fact that each

may potentially affect all of them. Hence, the variances of

would be taken into the account incorrectly if

were assumed independent.

Two possibly interfering paths can be seen in Figure B.1. Let us assume that the net weight of the left path is

and the weight of the right path

, so that the two paths cancel each other out. If, however, the outputs of the hidden layer are incorrectly assumed to be independent, the estimated variance of the output will be greater than zero. The same effect can also happen the other way round when constructive inference of the two paths leads to underestimated output variance.

The effects of different components of the inputs on the outputs can be measured using the Jacobian matrix $\partial \mathbf{f}(\mathbf{s}) / \partial \mathbf{s}$ of the mapping $\mathbf{f}$ with elements $(\partial f_i(\mathbf{s}) / \partial s_j)$ . This leads to the approximation for the output variance

$\begin{displaymath}\begin{split}\widetilde{f}_i(\mathbf{s}) \approx &\sum_j \lef... ...}^2(y_j) + \widetilde{\varphi}(y_j) \right) \right] \end{split}\end{displaymath}$

(B.13)

$\displaystyle \widetilde{y}_i^*$	$\displaystyle = \widetilde{a}_i + \sum_j \widetilde{A}_{ij} \left( \overline{s}_j^2 + \widetilde{s}_j \right)$	(B.14)
$\displaystyle \widetilde{\varphi}^*(y_i)$	$\displaystyle \approx \left[ \varphi'(\overline{y}_i) \right]^2 \widetilde{y}_i^*.$	(B.15)

The needed partial derivatives can be evaluated efficiently at the mean of the inputs with the chain rule

$\displaystyle \frac{\partial f_i(\mathbf{s})}{\partial s_j} = \sum_k \frac{\par... ...ial s_j} = \sum_k \overline{B}_{ik} \varphi'(\overline{y}_k) \overline{A}_{kj}.$

(B.16)