When constructing a learning algorithm which is based on approximations of the cost function, it is important to make sure that learning does not drive the network into areas of the parameter space where the approximations are no longer valid.

The approximations in (34) and (35) are based on a roughly quadratic or linear behaviour of the nonlinearities. This assumption is quite good if the posterior variance of the inputs to the hidden neurons is not very large.

Since the approximations take into account only local behaviour of the
nonlinearities *g*_{j} and MLP networks typically have multimodal
posterior distributions, there must be areas of the parameter space
where the second order derivative of the posterior probability with
respect to one of the parameters
is positive. This means
that
is negative which in turn
means that it appears that the cost function can be made arbitrarily
small by letting
grow.

It is easy to see that the problem is due to the local estimate of *g*since the logarithm of the posterior eventually has to go to negative
infinity. The derivative
will
thus be positive for large
,
but the local estimate of
*g*_{j} fails to account for this.

In order to discourage the network from adapting itself into areas of parameter space where the problems might occur and to deal with the problem if it nevertheless occurred, the terms in (34) which have negative contribution to will be neglected in the computation of the gradients. As this can only make the estimate of in (41) smaller, this leads, in general, to increasing the accuracy of the approximations in (34) and (35).