The cost function was already outlined in section 2.2.
We can now go into more detail. Let us denote
and
and let
denote all the unknown parameters of the model. For
notational simplicity, let us denote all the unknown variables by
.
The cost function is then
According to the Bayes' rule, the posterior
pdf of the unknown variables S and
is
The terms and are also products of simple Gaussian distributions and they are obtained directly from the definition of the model structure. The term P(X)is not a function of any of the parameters of the model and can be neglected.
The approximation needs to be simple for mathematical tractability and computational efficiency. We assume that it is Gaussian density with a diagonal covariance matrix. This means that the approximation is a product of the independent distributions: . The parameters of each are the mean and variance which will be denoted by and , respectively.
Both the posterior density and its approximation are products of simple Gaussian terms, which simplifies the cost function considerably: it splits into expectations of many simple terms. The terms of the form are the negative entropies for Gaussians and have the values . The most difficult terms are of the form . They are approximated by applying the second order Taylor's series expansions of the nonlinear activation functions as explained in [5]. The rest of the terms are expectations of simple Gaussian terms, whose expectations can be computed as in [6].
The cost function CKL is a function of and , i.e., the posterior means and variances of the latent variables and the parameters of the network. This is because instead of finding a point estimate, a whole distribution will be estimated for the latent variables and the parameters during learning.