Next: Simulations Up: Multi-Layer Perceptrons as Previous: Model structure

Cost function

The cost function was already outlined in section 2.2. We can now go into more detail. Let us denote $X = \{\mathbf{x}(t) \vert t\}$ and $S = \{\mathbf{s}_1(t), \mathbf{s}_2(t) \vert t\}$ and let $\boldsymbol{\theta}$ denote all the unknown parameters of the model. For notational simplicity, let us denote all the unknown variables by $\boldsymbol{\xi} = \{S, \boldsymbol{\theta}\}$ . The cost function is then

$\begin{displaymath}C_{\mathrm KL}= \int d\boldsymbol{\xi}Q(\boldsymbol{\xi}) \log \frac{Q(\boldsymbol{\xi})}{P(\boldsymbol{\xi}\vert X)}. \end{displaymath}$

(5)

The two things needed for equation 5 are the exact formulation of the posterior density $P(\boldsymbol{\xi} \vert X)$ and its parametric approximation $Q(\boldsymbol{\xi})$ .

According to the Bayes' rule, the posterior pdf of the unknown variables S and $\boldsymbol{\theta}$ is

$\begin{displaymath}P(S, \boldsymbol{\theta} \vert X) = \frac{P(X \vert S, \bolds... ...P(S \vert \boldsymbol{\theta}) P(\boldsymbol{\theta})}{P(X)}. \end{displaymath}$

The term $P(X \vert S,\boldsymbol{\theta})$ is obtained from equations 2-4; the distribution of the data is the same as for the noise n(t) except that its mean is corrected by the prediction given by the model. Let us denote the vector of the means of n(t) by $\boldsymbol{\mu}$ and the vector of the variances by $\boldsymbol{\sigma^2}$ . The distribution $P(x_i(t) \vert \mathbf{s}_1(t), \mathbf{s}_2(t), \boldsymbol{\theta})$ is thus Gaussian with mean $\mathbf{a}_{i\cdot} [\mathbf{f}( \mathbf{B} \mathbf{s}_2 + \mathbf{b}) + \mathbf{s}_1]+\mu_i$ and variance $\sigma_i^2$ . Here $\mathbf{a}_{i\cdot}$ denotes the ith row vector of A. As usually, the noise components n_i(t) are assumed to be independent and therefore $P(X \vert S,\boldsymbol{\theta}) = \prod_{t,i} P(x_i(t) \vert \mathbf{s}_1(t), \mathbf{s}_2(t), \boldsymbol{\theta})$ .

The terms $P(S\vert\boldsymbol{\theta})$ and $P(\boldsymbol{\theta})$ are also products of simple Gaussian distributions and they are obtained directly from the definition of the model structure. The term P(X)is not a function of any of the parameters of the model and can be neglected.

The approximation $Q(S,\boldsymbol{\theta})$ needs to be simple for mathematical tractability and computational efficiency. We assume that it is Gaussian density with a diagonal covariance matrix. This means that the approximation is a product of the independent distributions: $Q(\boldsymbol{\xi}) = \prod_i Q_i(\xi_i)$ . The parameters of each $Q_i(\xi_i)$ are the mean and variance which will be denoted by $\hat{\xi}_i$ and $\tilde{\xi}_i$ , respectively.

Both the posterior density $P(S, \boldsymbol{\theta} \vert X)$ and its approximation $Q(S,\boldsymbol{\theta})$ are products of simple Gaussian terms, which simplifies the cost function considerably: it splits into expectations of many simple terms. The terms of the form $E_Q \{ \log Q_i(\xi_i) \}$ are the negative entropies for Gaussians and have the values $-(1 + \log 2\pi\tilde{\xi}_i)/2$ . The most difficult terms are of the form $-E_Q \{ \log P(x_i(t) \vert \mathbf{s}_1(t), \mathbf{s}_2(t), \boldsymbol{\theta} ) \}$ . They are approximated by applying the second order Taylor's series expansions of the nonlinear activation functions as explained in [5]. The rest of the terms are expectations of simple Gaussian terms, whose expectations can be computed as in [6].

The cost function C_KL is a function of $\hat{\xi}_i$ and $\tilde{\xi}_i$ , i.e., the posterior means and variances of the latent variables and the parameters of the network. This is because instead of finding a point estimate, a whole distribution will be estimated for the latent variables and the parameters during learning.

Next: Simulations Up: Multi-Layer Perceptrons as Previous: Model structure

Harri Lappalainen
1999-05-25