Feedforward and backward computations

Next: Correction of the step Up: Model and learning algorithm Previous: Improved approximation of the

Feedforward and backward computations

The feedforward computations start with the parameters of the posterior approximation of the unknown variables of the model. For the factors, the parameters of the posterior approximation are the posterior mean $\bar{s}_i(t)$ , the posterior variance $\stackrel{\raisebox{-0.3ex}[0.5ex][0ex]{$\scriptscriptstyle \,\circ$ }}{s}_i(t)$ and the dependence $\breve{s}_i(t,t-1)$ . The end result of the feedforward computations is the value of the cost function C.

The first stage of the computations is the iteration of (8) to obtain the marginalised posterior mean $\bar{s}_i(t)$ and variance $\tilde{s}_i(t)$ of the factors. Thereafter the computations proceed like in the NLFA algorithm: the means and variances are propagated through the MLP networks. The final stage, the computation of the cost function, differs only in the terms $\int q(s_i(t) \vert s_i(t-1), {\mathbf{X}}) \ln q(s_i(t) \vert s_i(t-1), {\mathbf{X}}) ds_i(t)$ and $-\int q({\boldsymbol{\theta}} \vert {\mathbf{X}}) \ln p(s_i(t) \vert {\boldsymbol{\theta}}) d{\boldsymbol{\theta}}$ . In the NLFA algorithm, the former had the form

$\begin{displaymath}\ln 2\pi e \tilde{s}_i(t) \, , \end{displaymath}$

(9)

but now they have the form

$\begin{displaymath}\ln 2\pi e \stackrel{\raisebox{-0.3ex}[0.5ex][0ex]{$\scriptscriptstyle \,\circ$ }}{s}_i(t) \, . \end{displaymath}$

(10)

The latter terms can be shown to yield

$\begin{displaymath}\frac{1}{2}\left[(\bar{s}_i(t) - \bar{g}_i(t))^2 + \stackrel{... ...de{v}_i - 2\bar{v}_i} + \bar{v}_i + \frac{1}{2} \ln 2\pi \, , \end{displaymath}$

(11)

where the ith component of the vector g(s(t-1)) is denoted by g_i(t) and the variance parameter of the ith factor by v_i, and by $\tilde{g}_i^*(t)$ , the posterior variance of g_i(t)without the contribution of s_i(t-1), that is, assuming s_i(t-1)fixed. Notice that if $\breve{s}_i(t,t-1)$ is zero, the term inside the square brackets takes the form $(\bar{s}_i(t) - \bar{g}_i(t))^2 + \stackrel{\raisebox{-0.3ex}[0.5ex][0ex]{$\scriptscriptstyle \,\circ$ }}{s}_i(t) + \tilde{g}_i(t)$ because $\tilde{g}_i^*(t)$ is defined to be $\tilde{g}_i(t) - [\partial g_i(t) / \partial s_i(t-1)]^2 \tilde{s}_i(t-1)$ .

In the feedbackward phase, the gradient of the cost function Cw.r.t. the parameters of the posterior approximation is computed by the back-propagation algorithm, that is, the steps of the feedforward computations are reversed and the gradient of the cost function is propagated backwards to the parameters of the posterior approximation. Since the essential modification to the feedforward phase of NLFA algorithm is (8), this is also the essential modification in the backward computations.

The cost function is a function of parameters of the posterior approximation. In the computation of the cost function, the marginalised posterior variances $\tilde{s}_i(t)$ of the factors are used as intermediate variables and hence the gradient is also computed through these variables. Let us use the notation $C(\tilde{s}_i(t))$ to mean that C is considered to be a function of the intermediate variables $\tilde{s}_i(1)$ , $\ldots$ , $\tilde{s}_i(t)$ in addition to the parameters of the posterior approximation. The gradient computations resulting from (8) by the chain rule are then as follows:

$\begin{displaymath}\frac{\partial C}{\partial \stackrel{\raisebox{-0.3ex}[0.5ex]... ... + \frac{\partial C(\tilde{s}_i(t))}{\partial \tilde{s}_i(t)} \end{displaymath}$

(12)

$\begin{displaymath}\frac{\partial C}{\partial \breve{s}_i(t, t-1)} = \frac{\pa... ...\partial \tilde{s}_i(t)} \breve{s}_i(t, t-1) \tilde{s}_i(t-1) \end{displaymath}$

(13)

$\begin{displaymath}\frac{\partial C(\tilde{s}_i(t))}{\partial \tilde{s}_i(t)} = ... ...e{s}_i(t+1))}{\partial \tilde{s}_i(t+1)} \breve{s}_i^2(t+1, t) \end{displaymath}$

(14)

The terms $\partial C(\tilde{s}_i(t))/\partial \stackrel{\raisebox{-0.3ex}[0.5ex][0ex]{$\scriptscriptstyle \,\circ$ }}{s}_i(t)$ and $\partial C(\tilde{s}_i(t))/\partial \breve{s}_i(t,t-1)$ can be computed from (10) and (11) while $\partial C(\tilde{s}_i(t+1))/\partial \tilde{s}_i(t)$ also includes terms originating from the mappings f and g as their feedforward computation starts with the posterior means $\bar{s}_i(t)$ and variances $\tilde{s}_i(t)$ .

In the adaptation, the posterior means $\bar{s}_i(t)$ of the factors are treated as in the NLFA algorithm except for the correction in the step size which is discussed in section 3.3. The variances $\stackrel{\raisebox{-0.3ex}[0.5ex][0ex]{$\scriptscriptstyle \,\circ$ }}{s}_i(t)$ are adapted like $\tilde{s}_i(t)$ in the NLFA. The posterior dependence $\breve{s}_i(t,t-1)$ is adapted by solving $\partial C/\partial \breve{s}_i(t, t-1) = 0$ which yields

$\begin{displaymath}\breve{s}_i(t, t-1) = \frac{\frac{\partial g_i(t)}{\partial s... ...partial \tilde{s}_i(t)} + e^{2\tilde{v}_i - 2\bar{v}_i}} \, . \end{displaymath}$

(15)

Equation (15) shows that $\breve{s}_i(t,t-1)$ depends on $\partial C(\tilde{s}_i(t))/\partial \tilde{s}_i(t)$ which in turn depends on $\breve{s}_i(t+1, t)$ as (14) shows. This means that the update of the dependencies $\breve{s}_i(t,t-1)$ and the computation of the gradient w.r.t. the marginalised variance $\tilde{s}_i(t)$ are done recursively backward in time which is the counterpart of (8) where the marginalised variances are computed recursively forward in time.

Next: Correction of the step Up: Model and learning algorithm Previous: Improved approximation of the

Harri Valpola
2000-10-17