The cost function was already outlined in section 2.2.
We can now go into more detail. Let us denote
and
and let
denote all the unknown parameters of the model. For
notational simplicity, let us denote all the unknown variables by
.
The cost function is then
According to the Bayes' rule, the posterior
pdf of the unknown variables S and
is
The terms
and
are also
products of simple Gaussian distributions and they are obtained
directly from the definition of the model structure. The term P(X)is not a function of any of the parameters of the model and can be
neglected.
The approximation
needs to be simple for
mathematical tractability and computational efficiency. We assume
that it is Gaussian density with a diagonal covariance matrix. This
means that the approximation is a product of the independent
distributions:
.
The parameters
of each
are the mean and variance which will be denoted
by
and
,
respectively.
Both the posterior density
and its
approximation
are products of simple Gaussian
terms, which simplifies the cost function considerably: it splits into
expectations of many simple terms. The terms of the form
are the negative entropies for Gaussians and have the
values
.
The most difficult terms
are of the form
.
They are approximated by applying the second
order Taylor's series expansions of the nonlinear activation functions
as explained in [5]. The rest of the terms are
expectations of simple Gaussian terms, whose expectations can be
computed as in [6].
The cost function
CKL is a function of
and
,
i.e., the posterior means and variances of the latent
variables and the parameters of the network. This is because instead
of finding a point estimate, a whole distribution will be estimated
for the latent variables and the parameters during learning.