Next: Simulations Up: Multi-Layer Perceptrons as Previous: Model structure

# Cost function

The cost function was already outlined in section 2.2. We can now go into more detail. Let us denote and and let denote all the unknown parameters of the model. For notational simplicity, let us denote all the unknown variables by . The cost function is then

 (5)

The two things needed for equation 5 are the exact formulation of the posterior density and its parametric approximation .

According to the Bayes' rule, the posterior pdf of the unknown variables S and is

The term is obtained from equations 2-4; the distribution of the data is the same as for the noise n(t) except that its mean is corrected by the prediction given by the model. Let us denote the vector of the means of n(t) by and the vector of the variances by . The distribution is thus Gaussian with mean and variance . Here denotes the ith row vector of A. As usually, the noise components ni(t) are assumed to be independent and therefore .

The terms and are also products of simple Gaussian distributions and they are obtained directly from the definition of the model structure. The term P(X)is not a function of any of the parameters of the model and can be neglected.

The approximation needs to be simple for mathematical tractability and computational efficiency. We assume that it is Gaussian density with a diagonal covariance matrix. This means that the approximation is a product of the independent distributions: . The parameters of each are the mean and variance which will be denoted by and , respectively.

Both the posterior density and its approximation are products of simple Gaussian terms, which simplifies the cost function considerably: it splits into expectations of many simple terms. The terms of the form are the negative entropies for Gaussians and have the values . The most difficult terms are of the form . They are approximated by applying the second order Taylor's series expansions of the nonlinear activation functions as explained in [5]. The rest of the terms are expectations of simple Gaussian terms, whose expectations can be computed as in [6].

The cost function CKL is a function of and , i.e., the posterior means and variances of the latent variables and the parameters of the network. This is because instead of finding a point estimate, a whole distribution will be estimated for the latent variables and the parameters during learning.

Next: Simulations Up: Multi-Layer Perceptrons as Previous: Model structure
Harri Lappalainen
1999-05-25