Cost Function

Next: Update Rules Up: Nonlinear Factor Analysis Previous: Definition of the Model

Cost Function

In ensemble learning, the goal is to approximate the posterior pdf of all the unknown values in the model. Let us denote the observations by X. Everything else in the model is unknown, i.e., the sources, parameters and hyperparameters. Let us denote all these unknowns by the vector $\vec{\theta}$ . The cost function measures the misfit between the actual posterior pdf $p(\vec{\theta} \vert \mathrm{X})$ and its approximation $q(\vec{\theta} \vert \mathrm{X})$ .

The posterior is approximated as a product of independent Gaussian distributions

$\begin{displaymath}q(\vec{\theta} \vert \mathrm{X}) = \prod_i q(\theta_i \vert \mathrm{X}) \, . \end{displaymath}$

(16)

Each individual Gaussian $q(\theta_i \vert \mathrm{X})$ is parametrised by the posterior mean $\bar{\theta}_i$ and variance $\tilde{\theta}_i$ of the parameter.

The functional form of the cost function $C_{\vec{\theta}}(\mathrm{X}; \vec{\bar{\theta}}, \vec{\tilde{\theta}})$ is given in Chap. 6. The cost function can be interpreted to measure the misfit between the actual posterior $p(\vec{\theta} \vert \mathrm{X})$ and its factorial approximation $q(\vec{\theta} \vert \mathrm{X})$ . It can also be interpreted as measuring the number of bits it would take to encode X when approximating the posterior pdf of the unknown variables by $q(\vec{\theta} \vert \mathrm{X})$ .

The cost function is minimised with respect to the posterior means $\bar{\theta}_i$ and variances $\tilde{\theta}_i$ of the unknown variables $\theta_i$ . The end result of the learning is therefore not just an estimate of the unknown variables, but a distribution over the variables.

The simple factorising form of the approximation $q(\vec{\theta} \vert \mathrm{X})$ makes the cost function computationally tractable. The cost function can be split into two terms, C_q and C_p, where the former is an expectation over $\ln q(\vec{\theta} \vert \mathrm{X})$ and the latter is an expectation over $-\ln p(\mathrm{X}, \vec{\theta})$ .

It turns out that the term C_q is not a function of the posterior means $\bar{\theta}_i$ of the parameters, only the posterior variances. It has a similar term for each unknown variable.

$\begin{displaymath}C_q(\mathrm{X}; \vec{\tilde{\theta}}) = \sum_i -\frac{1}{2} \ln 2\pi e \tilde{\theta}_i \end{displaymath}$

(17)

Most of the terms of C_p are also trivial. The Gaussian densities in (8)-(15) yield terms of the form

$\begin{displaymath}-\ln p(\theta) = \frac{1}{2}(\theta - m_\theta)^2 e^{-2v_\theta} + v_\theta + \frac{1}{2} \ln 2 \pi \end{displaymath}$

(18)

Since $\theta$ , $m_\theta$ and $v_\theta$ are independent in q, the expectation over q yields

$\begin{displaymath}\frac{1}{2}[(\bar{\theta} - \bar{m}_\theta)^2 + \tilde{\theta... ...\bar{v}_\theta} + \bar{v}_\theta + \frac{1}{2} \ln 2 \pi \, . \end{displaymath}$

(19)

Only the term originating from (7) needs some elaboration. Equation (7) yields

$\begin{displaymath}-\ln p(x) = \frac{1}{2}(x - f)^2 e^{-2v_x} + v_x + \frac{1}{2} \ln 2 \pi \end{displaymath}$

(20)

and the expectation over q is

$\begin{displaymath}\frac{1}{2}[(x - \bar{f})^2 + \tilde{f}]e^{2\tilde{v}_x-2\bar{v}_x} + \bar{v}_x + \frac{1}{2} \ln 2 \pi \end{displaymath}$

(21)

The rest of this section is dedicated on evaluating the posterior mean $\bar{f}$ and variance $\tilde{f}$ of the function f. We shall begin from the sources and weights and show how the posterior mean and variance can be propagated through the network yielding the needed posterior mean and variance of the function f at the output. The effect of nonlinearities g of the hidden neurons are approximated by first and second order Taylor's series expansions around the posterior mean. Apart from that, the computation is analytical.

The function f consists of two multiplications with matrices and a nonlinearity in between. The posterior mean and variance for a product u = yz are

$\begin{displaymath}\bar{u} = E\{u\} = E\{yz\} = E\{y\} E\{z\} = \bar{y} \bar{z} \end{displaymath}$

(22)

and
$\begin{multline}\tilde{u} = E\{u^2\} - \bar{u}^2 = E\{y^2 z^2\} - (\bar{y} \bar... ... + \tilde{y}) (\bar{z}^2 + \tilde{z}) - \bar{y}^2 \bar{z}^2 \, , \end{multline}$
given that y and z are posteriorly independent. According to the assumption of the factorising form of $q(\vec{\theta} \vert \mathrm{X})$ , the sources and the weights are independent and we can use the above formulas. The inputs going to hidden neurons consist of sums of products of weights and sources, each posteriorly independent, and it is therefore easy compute the posterior mean and variance of the inputs going to the hidden neurons; both the means and variances of a sum of independent variables add up.

Let us now pick one hidden neuron having nonlinearity g and input $\xi$ , i.e., the hidden neuron is computing $g(\xi)$ . At this point we are not assuming any particular form of g although we are going to use $g(\xi) = \tanh \xi$ in all the experiments; the following derivation is general and can be applied to any sufficiently smooth function g.

In order to be able to compute the posterior mean and variance of the function g, we are going to apply the Taylor's series expansion around the posterior mean $\bar{\xi}$ of the input. We choose the second order expansion when computing the mean and the first order expansion when computing the variance. The choice is purely practical; higher order expansions could be used as well but these are the ones that can be computed from the posterior mean and variance of the inputs alone.

$\displaystyle \bar{g}(\xi)$	$\textstyle \approx$	$\displaystyle g(\bar{\xi}) + \frac{1}{2} g''(\bar{\xi}) \tilde{\xi}$	(23)
$\displaystyle \tilde{g}(\xi)$	$\textstyle \approx$	$\displaystyle [g'(\bar{\xi})]^2 \tilde{\xi}$	(24)

After having evaluated the outputs of the nonlinear hidden neurons, it would seem that most of the work has already been done. After all, it was already shown how to compute the posterior mean and variance of a weighted sum and the outputs of the network will be weighted sums of the outputs of the hidden neurons. Unfortunately, this time the terms in the sum are no longer independent. The sources are posteriorly independent by virtue of the approximation $q(\vec{\theta} \vert \mathrm{X})$ , but the values of the hidden neurons are posteriorly dependent which enforces us to use a more complicated scheme for computing the posterior variances of these weighted sums. The posterior means will be as simple as before, though.

**Figure 4:** The converging paths from two sources are shown. Both input neurons affect the output neuron through two paths going through the hidden neurons. This means that the posterior variances of the two hidden neurons are neither completely correlated nor uncorrelated and it is impossible to compute the posterior variance of the output neuron without keeping the two paths separate. Effectively this means computing the Jacobian matrix of the output with respect to the inputs
$\includegraphics[width=2.5cm]{paths.eps}$

The reason for the outputs of the hidden neurons to be posteriorly dependent is that the value of one source can potentially affect all the outputs. This is illustrated in Fig. 4. Each source affects the output of the whole network through several paths and in order to be able to determine the variance of the outputs, the paths originating from different sources need to be kept separate. This is done by keeping track of the partial derivatives $\frac{\partial g(\xi)}{\partial s_i}$ . Equation (26) shows how the total posterior variance of the output $g(\xi)$ of one of the hidden neurons can be split into terms originating from each source plus a term $\tilde{g}^*(\xi)$ which contains the variance originating from the weights and biases, i.e., those variables which affect any one output through only a single path.

$\begin{displaymath}\tilde{g}(\xi) = \tilde{g}^*(\xi) + \sum_i \tilde{s}_i \left[ \frac{\partial g(\xi)}{\partial s_i} \right]^2 \end{displaymath}$

(25)

When the outputs are multiplied by weights, it is possible to keep track of how this affects the posterior mean, the derivatives w.r.t. the sources and the variance originating from other variables than the sources, i.e., from weights and biases. The total variance of the output of the network is then obtained by

$\begin{displaymath}\tilde{f} = \tilde{f}^* + \sum_i \tilde{s}_i \left[ \frac{\partial f}{\partial s_i} \right]^2 \, , \end{displaymath}$

(26)

where f denotes the components of the output and we have computed the posterior variance of the outputs of the network which is needed in (21). To recapitulate what is done, the contributions of different sources to the variances of the outputs of the network are monitored by computing the Jacobian matrix of the outputs w.r.t. the sources and keeping this part separate from the variance originating from other variables.

The only approximations done in the computation are the ones approximating the effect of nonlinearity. If the hidden neurons were linear, the computation would be exact. The nonlinearity of the hidden neurons is delt with by linearising around the posterior mean of the inputs of the hidden neurons. The smaller the variances the more accurate this approximation is. With increasing nonlinearity and variance of the inputs, the approximation gets worse.

Compared to ordinary forward phase of an MLP network, the computational complexity is greater by about a factor of 5N, where N is the number of sources. The factor five is due to propagating distributions instead of plain values. The need to keep the paths originating from different sources separate explains the factor N. Fortunately, much of the extra computation can be made into good use later on when adapting the distributions of variables.

Next: Update Rules Up: Nonlinear Factor Analysis Previous: Definition of the Model

Harri Lappalainen
2000-03-03