When constructing a learning algorithm which is based on approximations of the cost function, it is important to make sure that learning does not drive the network into areas of the parameter space where the approximations are no longer valid.
The approximations in (34) and (35) are based
on a roughly quadratic or linear behaviour of the nonlinearities.
This assumption is quite good if the posterior variance
of the inputs to the hidden neurons is not very large.
Since the approximations take into account only local behaviour of the
nonlinearities gj and MLP networks typically have multimodal
posterior distributions, there must be areas of the parameter space
where the second order derivative of the posterior probability with
respect to one of the parameters
is positive. This means
that
is negative which in turn
means that it appears that the cost function can be made arbitrarily
small by letting
grow.
It is easy to see that the problem is due to the local estimate of gsince the logarithm of the posterior eventually has to go to negative
infinity. The derivative
will
thus be positive for large
,
but the local estimate of
gj fails to account for this.
In order to discourage the network from adapting itself into areas of
parameter space where the problems might occur and to deal with the
problem if it nevertheless occurred, the terms in (34)
which have negative contribution to
will be neglected in the computation of the gradients.
As this can only make the estimate of
in
(41) smaller, this leads, in general, to increasing
the accuracy of the approximations in (34) and
(35).