A Gaussian variable has two inputs and and prior probability . The variance is parametrised this way because then the mean and expected exponential of suffice for computing the cost function. It can be shown that when , and are mutually independent, i.e. , yields

For observed variables this is the only term in the cost function but for latent variables there is also : the part resulting from . The posterior approximation is defined to be Gaussian with mean and variance : . This yields

(2) |

which is the negative entropy of Gaussian variable with variance . The parameters and are to be optimised during learning.

The output of a latent Gaussian node trivially provides expectation and variance: and . The expected exponential can be shown to be . The outputs of observed nodes are scalar values instead of distributions and thus , and .

The posterior distribution of a latent Gaussian node can be updated as follows. 1) First, the gradients of w.r.t. , and are computed. 2) Second, the terms in which depend on and are assumed to be , where , and . This assumption holds exactly if the output of the node is propagated to Gaussian nodes only and not to discrete nodes. If the output is used by a discrete node with a soft-max prior, this term gives an upper bound of as will be explained later. 3) Third, the minimum of is solved. This can be done analytically if , otherwise the minimum is obtained iteratively.