NONLINEAR STATE-SPACE MODEL

Next: EXPERIMENTS Up: VARIATIONAL BAYES AND NONLINEAR Previous: LEARNING ALGORITHMS

NONLINEAR STATE-SPACE MODEL

As a specific example, let us study the nonlinear state-space model (NSSM) introduced in (Valpola and Karhunen, 2002). The model is specified by the generative model

$\displaystyle \mathbf{x}(t)$	$\displaystyle = \mathbf{f}(\mathbf{s}(t ), \boldsymbol{\theta}_\mathbf{f}) + \mathbf{n}(t)$	(23)
$\displaystyle \mathbf{s}(t)$	$\displaystyle = \mathbf{s}(t-1) + \mathbf{g}(\mathbf{s}(t-1), \boldsymbol{\theta}_\mathbf{g}) + \mathbf{m}(t),$	(24)

where

is time, $\mathbf{x}(t)$ are the observations, and $\mathbf{s}(t)$ are the hidden states. The observation mapping $\mathbf{f}$ and the dynamical mapping $\mathbf{g}$ are nonlinear and they are modeled with multilayer perceptron (MLP) networks. Observation noise $\mathbf{n}$ and process noise $\mathbf{m}$ are assumed Gaussian. The latent states $\mathbf{s}(t)$ are commonly denoted by $\boldsymbol{\theta}_{\boldsymbol{S}}$ . The model parameters include both the weights of the MLP networks and a number of hyperparameters. The posterior approximation of these parameters is a Gaussian with a diagonal covariance. The posterior approximation of the states $q(\boldsymbol{\theta}_{\boldsymbol{S}} \vert \boldsymbol{\xi}_{\boldsymbol{S}})$ is also Gaussian, but some dependencies are modeled. The different components of the state vectors are still assumed independent. However, the correlations between the corresponding components of subsequent state vectors

and

are modeled. This is a realistic minimum assumption for modeling the dependence of the state vectors $\mathbf{s}(t)$ and $\mathbf{s}(t-1)$ (Valpola and Karhunen, 2002).

Because of the nonlinearities the model is not in the conjugate exponential family, and the standard VB learning methods are not directly applicable. The bound (22) can nevertheless be evaluated by linearizing the MLP networks $\mathbf{f}$ and $\mathbf{g}$ using the technique of Honkela and Valpola (2005). This allows evaluating the gradient with respect to $\boldsymbol{\xi}_{\boldsymbol{S}}$ , $\boldsymbol{\xi}_{\mathbf{f}}$ , and $\boldsymbol{\xi}_{\mathbf{g}}$ and using a gradient based optimizer to adapt the parameters. These variables are updated jointly rather than using an EM-like split because the same heavy gradient computations are needed for them all.

The natural gradient with respect to the parameters of $q(\boldsymbol{\theta}_{\boldsymbol{S}} \vert \boldsymbol{\xi}_{\boldsymbol{S}})$ , $q(\boldsymbol{\theta}_{\mathbf{f}} \vert \boldsymbol{\xi}_{\mathbf{f}})$ , and $q(\boldsymbol{\theta}_{\mathbf{g}} \vert \boldsymbol{\xi}_{\mathbf{g}})$ was simplified by only using the gradient-based updates for the mean elements. For the parameters $q(\boldsymbol{\theta}_{\boldsymbol{S}} \vert \boldsymbol{\xi}_{\boldsymbol{S}})$ the fully diagonal approximation for the inverse of the metric tensor given by Eqs. (6) and (11) was used. Since the parameters $q(\boldsymbol{\theta}_{\mathbf{f}} \vert \boldsymbol{\xi}_{\mathbf{f}})$ and $q(\boldsymbol{\theta}_{\mathbf{g}} \vert \boldsymbol{\xi}_{\mathbf{g}})$ had a diagonal covariance, no further approximations were necessary. Under these assumptions the natural gradient for the mean elements is given by

$\displaystyle \tilde{\nabla}_{\boldsymbol{\mu}_q} \mathcal{F}(\boldsymbol{\xi})... ...}(\mathbf{\Sigma}_q) \nabla_{\boldsymbol{\mu}_q} \mathcal{F}(\boldsymbol{\xi}),$

(25)

where $\boldsymbol{\mu}_q$ is the mean of the variational approximation $q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ and $\mathrm{diag}(\mathbf{\Sigma}_q)$ is the diagonal of the corresponding covariance.

Variances were updated separately using a fixed-point update rule as described in (Valpola and Karhunen, 2002). The correlation parameters of $q(\boldsymbol{\theta}_{\boldsymbol{S}} \vert \boldsymbol{\xi}_{\boldsymbol{S}})$ were updated in EM style by assuming all other parameters fixed. The remaining hyperparameters were updated by VBEM.

Next: EXPERIMENTS Up: VARIATIONAL BAYES AND NONLINEAR Previous: LEARNING ALGORITHMS

Tapani Raiko 2007-04-18