next up previous contents
Next: Bibliography Up: Nonlinear Switching State-Space Models Previous: Dirichlet distribution   Contents


Probabilistic computations for MLP networks

In this appendix, it is shown how to evaluate the distribution of the outputs of an MLP network $ \mathbf{f}$, assuming the inputs $ \mathbf{s}$ and all the network weights are independent and Gaussian. These equations are needed in evaluating the ensemble learning cost function of the Bayesian NSSM in Section 6.2.

The exact model for our single-hidden-layer MLP network $ \mathbf{f}$ is

$\displaystyle \mathbf{f}(\mathbf{s}) = \mathbf{B}\boldsymbol{\varphi}(\mathbf{A}\mathbf{s}+ \mathbf{a}) + \mathbf{b}.$ (B.1)

The structure is shown in Figure B.1. The assumed distributions for all the parameters, which are assumed to be independent, are

$\displaystyle s_i$ $\displaystyle \sim N(\overline{s}_i, \widetilde{s}_i)$ (B.2)
$\displaystyle A_{ij}$ $\displaystyle \sim N(\overline{A}_{ij}, \widetilde{A}_{ij})$ (B.3)
$\displaystyle B_{ij}$ $\displaystyle \sim N(\overline{B}_{ij}, \widetilde{B}_{ij})$ (B.4)
$\displaystyle a_{i}$ $\displaystyle \sim N(\overline{a}_i, \widetilde{a}_i)$ (B.5)
$\displaystyle b_{i}$ $\displaystyle \sim N(\overline{b}_i, \widetilde{b}_i).$ (B.6)

Figure B.1: The structure of an MLP network with one hidden layer.
\includegraphics[width=.4\textwidth]{pics/multipath}

The computations in the first layer of the network can be written as $ y_i = a_i + \sum_j A_{ij} s_j$. Since all the parameters involved are independent, the mean and the variance of $ y_i$ are

$\displaystyle \overline{y}_i$ $\displaystyle = \overline{a}_i + \sum_j \overline{A}_{ij} \overline{s}_j$ (B.7)
$\displaystyle \widetilde{y}_i$ $\displaystyle = \widetilde{a}_i + \sum_j \left[ \overline{A}_{ij}^2 \widetilde{...
...+ \widetilde{A}_{ij} \left( \overline{s}_j^2 + \widetilde{s}_j \right) \right].$ (B.8)

Equation (B.8) follows from the identity

$\displaystyle \operatorname{Var}[ \alpha ] = \operatorname{E}[ \alpha^2 ] - \operatorname{E}[ \alpha ]^2.$ (B.9)

The nonlinear activation function is handled with a truncated Taylor series approximation about the mean $ \overline{y}_i$ of the inputs. Using a second order approximation for the mean and first order for the variance yields

$\displaystyle \overline{\varphi}(y_i)$ $\displaystyle \approx \varphi(\overline{y}_i) + \frac{1}{2} \varphi''(\overline{y}_i) \widetilde{y}_i$ (B.10)
$\displaystyle \widetilde{\varphi}(y_i)$ $\displaystyle \approx \left[ \varphi'(\overline{y}_i) \right]^2 \widetilde{y}_i.$ (B.11)

The reason why these approximations are used is that they are the best ones that can be expressed in terms of the input mean and variance.

The computations of the output layer are given by $ f_i(\mathbf{s}) = b_i +
\sum_j B_{ij} \varphi(y_j)$. This may look the same as the one for the first layer, but there is the big difference that the $ y_j$ are no longer independent. Their dependence does not, however, affect the evaluation of the mean of the outputs, which is

$\displaystyle \overline{f}_i(\mathbf{s}) = \overline{b}_i + \sum_j \overline{B}_{ij} \overline{\varphi}(y_j).$ (B.12)

The dependence between $ y_j$ arises from the fact that each $ s_i$ may potentially affect all of them. Hence, the variances of $ s_i$ would be taken into the account incorrectly if $ y_j$ were assumed independent.

Two possibly interfering paths can be seen in Figure B.1. Let us assume that the net weight of the left path is $ 1$ and the weight of the right path $ -1$, so that the two paths cancel each other out. If, however, the outputs of the hidden layer are incorrectly assumed to be independent, the estimated variance of the output will be greater than zero. The same effect can also happen the other way round when constructive inference of the two paths leads to underestimated output variance.

The effects of different components of the inputs on the outputs can be measured using the Jacobian matrix $ \partial \mathbf{f}(\mathbf{s}) /
\partial \mathbf{s}$ of the mapping $ \mathbf{f}$ with elements $ (\partial
f_i(\mathbf{s}) / \partial s_j)$. This leads to the approximation for the output variance

\begin{displaymath}\begin{split}\widetilde{f}_i(\mathbf{s}) \approx &\sum_j \lef...
...}^2(y_j) + \widetilde{\varphi}(y_j) \right) \right] \end{split}\end{displaymath} (B.13)

where $ \widetilde{\varphi}^*(y_j)$ denotes the posterior variance of $ \varphi(y_j)$ without the contribution of the input variance. It can be computed as

$\displaystyle \widetilde{y}_i^*$ $\displaystyle = \widetilde{a}_i + \sum_j \widetilde{A}_{ij} \left( \overline{s}_j^2 + \widetilde{s}_j \right)$ (B.14)
$\displaystyle \widetilde{\varphi}^*(y_i)$ $\displaystyle \approx \left[ \varphi'(\overline{y}_i) \right]^2 \widetilde{y}_i^*.$ (B.15)

The needed partial derivatives can be evaluated efficiently at the mean of the inputs with the chain rule

$\displaystyle \frac{\partial f_i(\mathbf{s})}{\partial s_j} = \sum_k \frac{\par...
...ial s_j} = \sum_k \overline{B}_{ik} \varphi'(\overline{y}_k) \overline{A}_{kj}.$ (B.16)


next up previous contents
Next: Bibliography Up: Nonlinear Switching State-Space Models Previous: Dirichlet distribution   Contents
Antti Honkela 2001-05-30