The approximations in (24) and (25) can give
rise to problems with ill defined posterior variances of sources or
first layer weights
A or biases .
This is because
the approximations take into account only local behaviour of the
nonlinearities g of the hidden neurons. With MLP networks the
posterior is typically multimodal and therefore, in a valley between
two maxima, it is possible that the second order derivative of the
logarithm of the posterior w.r.t. a parameter
is positive.
This means that the derivative of the Cp part of the cost function
with respect to the posterior variance
of that
parameter is negative, leading to a negative estimate of variance in
(28).
It is easy to see that the problem is due to the local estimate of gsince the logarithm of the posterior eventually has to go to negative
infinity. The derivative of the Cp term w.r.t. the posterior
variance
will thus be positive for large
,
but the local estimate of g fails to account for
this.
In order to discourage the network from adapting itself to areas of
parameter space where the problems might occur and to deal with the
problem if it nevertheless occurred, the terms in (24)
which give rise to negative derivative of Cp with respect to
will be neglected in the computation of the
gradients. As this can only make the estimate of
in
(28) smaller, this leads, in general, to increasing the
accuracy of the approximations in (24) and (25).