Total Derivatives

Next: Experiments Up: Nonlinear State-Space Models Previous: Variational Bayesian method

Total Derivatives

When updates are done locally, information spreads around slowly because the states of different time slices affect each other only between updates. It is possible to predict this interaction by a suitable approximation. We get a novel update algorithm for the posterior mean of the states by replacing partial derivatives of the cost function w.r.t. state means $\overline{\mathbf{s}}(t)$ by (approximated) total derivatives

$\displaystyle \frac{d{\cal C}_{\mathrm{KL}}}{d\overline{\mathbf{s}}(t)} = <tex2... ...} \frac{\partial\overline{\mathbf{s}}(\tau)}{\partial\overline{\mathbf{s}}(t)}.$

(6)

They can be computed efficiently using the chain rule and dynamic programming, given that we can approximate the terms $\frac{\partial\overline{\mathbf{s}}(t)}{\partial\overline{\mathbf{s}}(t-1)}$ and $\frac{\partial\overline{\mathbf{s}}(t)}{\partial\overline{\mathbf{s}}(t+1)}$ .

Before going into details, let us go through the idea. The posterior distribution of the state $\mathbf{s}(t)$ can be factored into three potentials, one from $\mathbf{s}(t-1)$ (the past), one from $\mathbf{s}(t+1)$ (the future), and one from $\mathbf{x}(t)$ (the observation). We will linearise the nonlinear mappings so that the three potentials become Gaussian. Then also the posterior of $\mathbf{s}(t)$ becomes Gaussian with a mean that is the weighted average of the means of the three potentials, where the weights are the inverse (co)variances of the potentials. A change in the mean of a potential results in a change of the mean of the posterior inversely proportional to their (co)variances.

The terms of the cost function (See Equation (5.6) in [1], although the notation is somewhat different) that relate to $\mathbf{s}(t)$ are

$\begin{displaymath}\begin{split}{\cal C}_{\mathrm{KL}}(\mathbf{s}(t)) &= \sum_{i... ...t)\right]^2+\widetilde{f}_k(\mathbf{s}(t))\right\}, \end{split}\end{displaymath}$

(7)

where $\overline{\alpha}$ and $\widetilde{\alpha}$ denote the mean and (co)variance of $\alpha$ over the posterior approximation

respectively and

and

are the dimensionalities of $\mathbf{x}$ and $\mathbf{s}$ respectively. Note that we assume diagonal noise covariances $\mathbf{\Sigma}$ . Nonlinearities $\mathbf{f}$ and $\mathbf{g}$ are replaced by the linearisations

$\displaystyle \widehat{\mathbf{f}}(\mathbf{s}(t))$	$\displaystyle =\overline{\mathbf{f}}(\mathbf{s}_\mathrm{cur}(t))+\mathbf{J}_f(t)\left[\mathbf{s}(t)-\overline{\mathbf{s}}_\mathrm{cur}(t)\right]$	(8)
$\displaystyle \widehat{\mathbf{g}}(\mathbf{s}(t))$	$\displaystyle =\overline{\mathbf{g}}(\mathbf{s}_\mathrm{cur}(t))+\mathbf{J}_g(t)\left[\mathbf{s}(t)-\overline{\mathbf{s}}_\mathrm{cur}(t)\right],$	(9)

where the subscript $\mathrm{cur}$ denotes a current estimate that is constant w.r.t. further changes in $\mathbf{s}(t)$ . The minimum of (7) with linearisations can be found at the zero of the gradient:

$\displaystyle \widetilde{\mathbf{s}}_\mathrm{opt}(t)$	$\displaystyle =\left[\mathbf{\Sigma}_s^{-1}+\mathbf{J}_g(t)^\mathrm{T}\mathbf{\... ...(t)+\mathbf{J}_f(t)^\mathrm{T}\mathbf{\Sigma}_x^{-1}\mathbf{J}_f(t)\right]^{-1}$	(10)
$\displaystyle \overline{\mathbf{s}}_\mathrm{opt}(t)$	$\displaystyle =\widetilde{\mathbf{s}}_\mathrm{opt}(t)\left\{\mathbf{\Sigma}_s^{... ...erline{\mathbf{s}}(t-1)-\overline{\mathbf{s}}_\mathrm{cur}(t-1))\right] \right.$
	$\displaystyle +\mathbf{J}_g(t)^\mathrm{T}\mathbf{\Sigma}_s^{-1}\left[\overline{\mathbf{s}}(t+1)-\overline{\mathbf{g}}(\mathbf{s}_\mathrm{cur}(t))\right]$	(11)
	$\displaystyle \left. +\mathbf{J}_f(t)^\mathrm{T}\mathbf{\Sigma}_x^{-1}\left[\ov... ...thbf{x}}(t )-\overline{\mathbf{f}}(\mathbf{s}_\mathrm{cur}(t))\right] \right\}.$

The optimum mean reacts to changes in the past and in the future by

$\displaystyle \frac{\partial\overline{\mathbf{s}}_\mathrm{opt}(t)}{\partial\overline{\mathbf{s}}(t-1)}$	$\displaystyle = \widetilde{\mathbf{s}}_\mathrm{opt}(t) \mathbf{\Sigma}_s^{-1} \mathbf{J}_g(t-1)$	(12)
$\displaystyle \frac{\partial\overline{\mathbf{s}}_\mathrm{opt}(t)}{\partial\overline{\mathbf{s}}(t+1)}$	$\displaystyle = \widetilde{\mathbf{s}}_\mathrm{opt}(t) \mathbf{J}_g(t)^\mathrm{T}\mathbf{\Sigma}_s^{-1}.$	(13)

Finally, we assume that the Equations (12) and (13) apply approximately even in the nonlinear case when the subscripts $\mathrm{opt}$ are dropped out. The linearisation matrices $\mathbf{J}$ need to be computed anyway [7] so the computational overhead is rather small.

Next: Experiments Up: Nonlinear State-Space Models Previous: Variational Bayesian method

Tapani Raiko 2005-12-08