The cost function was already outlined in section 2.2.
We can now go into more detail. Let us denote
and
and let
denote all the unknown parameters of the model. For
notational simplicity, let us denote all the unknown variables by
.
The cost function is then

The two things needed for equation 5 are the exact formulation of the posterior density and its parametric approximation .

According to the Bayes' rule, the posterior
pdf of the unknown variables *S* and
is

The term is obtained from equations 2-4; the distribution of the data is the same as for the noise

The terms
and
are also
products of simple Gaussian distributions and they are obtained
directly from the definition of the model structure. The term *P*(*X*)is not a function of any of the parameters of the model and can be
neglected.

The approximation needs to be simple for mathematical tractability and computational efficiency. We assume that it is Gaussian density with a diagonal covariance matrix. This means that the approximation is a product of the independent distributions: . The parameters of each are the mean and variance which will be denoted by and , respectively.

Both the posterior density and its approximation are products of simple Gaussian terms, which simplifies the cost function considerably: it splits into expectations of many simple terms. The terms of the form are the negative entropies for Gaussians and have the values . The most difficult terms are of the form . They are approximated by applying the second order Taylor's series expansions of the nonlinear activation functions as explained in [5]. The rest of the terms are expectations of simple Gaussian terms, whose expectations can be computed as in [6].

The cost function
*C*_{KL} is a function of
and
,
i.e., the posterior means and variances of the latent
variables and the parameters of the network. This is because instead
of finding a point estimate, a whole distribution will be estimated
for the latent variables and the parameters during learning.