Recall the Gaussian node in Section 3.1.
The variance is parameterised using the exponential function as .
This is because then the mean
and expected exponential
of the input
suffice for evaluating the cost function,
as will be shown shortly. Consequently the cost function can be
minimised using the gradients with respect to these expectations.
The gradients are computed backwards from the children
nodes, but otherwise
our learning method differs clearly from standard back-propagation
Haykin98.
Another important reason for using the parameterisation for
the prior variance of a Gaussian random variable
is that the posterior
distribution of
then becomes approximately Gaussian, provided
that the prior mean
of
is Gaussian, too (see for example
Section 7.1 or Lappal-Miskin00).
The conjugate prior distribution of the inverse of the prior variance
of a Gaussian random variable is the gamma distribution
Gelman95. Using such gamma prior pdf causes the posterior
distribution to be gamma, too, which is mathematically convenient.
However, the conjugate prior pdf of the second parameter of the gamma
distribution is something quite intractable. Hence gamma distribution
is not suitable for developing hierarchical variance models.
The logarithm of a gamma distributed variable is approximately Gaussian distributed
Gelman95, justifying the adopted parameterisation
. However, it should be noted that both the gamma and
distributions are used as prior pdfs mainly because they make
the estimation of the posterior pdf mathematically tractable
Lappal-Miskin00; one cannot claim that either of these choices
would be correct.