Figure shows some possible transfer
functions. A good nonlinearity should have an area that is close to
linear and another that is saturated. Linear part guarantees the
possibility to use the unit as a linear one. Saturated area makes
sparse representation possible as was seen in
Section
. When the input fluctuates inside the the
saturated area, the output does not effectively vary. Thus, the input
does not always need to be precisely determined for the output to be
precise. A nonlinearity with two flat parts can be used as a binary
unit that has two typical output values and almost nothing in between.
![]() |
Some expected values after the nonlinearity can be evaluated for
Gaussian input [20]. Therefore we restrict the
nonlinearity to follow immediately after a Gaussian variable node.
Now the mean, variance and the expected exponential of the output of a
nonlinearity have integral expressions. For most nonlinear functions
it is impossible to compute them analytically, but for the function
the mean and variance do have analytical
expressions. Therefore it is used in this work. The required
expectations of the outputs are
Gaussian radial basis functions (RBF) [24] use the same
nonlinearity but the input is the distance from a certain point in the
source space rather than one of the sources directly. Ghahramani and
Roweis [20] used Gaussian RBF approximators with
EM algorithm to model nonlinear dynamical systems. They found that
using the Gaussian nonlinearity the integrals become tractable.
Another potential possibility would be to use the error function
,
since the mean can be evaluated
analytically and the variance can be approximated from above
[15]. This is useful, since increasing the variance increases also
the cost function and minimising an upper bound for the cost
guarantees it to be low. Murphy [50] used the logistic
function approximated iteratively with a Gaussian. Valpola
[62] approximated the same function with a
truncated Taylor series.
Hornik [26] and Funahashi [17] have
independently shown that MLP networks are universal approximators,
that is, given enough hidden units the mapping from inputs to outputs
can approximate any measurable function to any desired degree of
accuracy. This result was proven for any non-decreasing nonlinearity
f(s) that has the limits
and
.
Unfortunately, that is not true
for the function
.
Future work might include a
comparison with other nonlinearities and perhaps the property of
universal approximation could be proven at least for a finite
interval.