next up previous
Next: Experiments Up: Natural Conjugate Gradient in Previous: Natural conjugate gradient


VB for nonlinear state-space models

As a specific example, we consider the nonlinear state-space model (NSSM) introduced in [5]. The model is specified by the generative model

$\displaystyle \mathbf{x}(t)$ $\displaystyle = \mathbf{f}(\mathbf{s}(t ), \boldsymbol{\theta}_\mathbf{f}) + \mathbf{n}(t)$ (15)
$\displaystyle \mathbf{s}(t)$ $\displaystyle = \mathbf{s}(t-1) + \mathbf{g}(\mathbf{s}(t-1), \boldsymbol{\theta}_\mathbf{g}) + \mathbf{m}(t),$ (16)

where $ t$ is time, $ \mathbf{x}(t)$ are the observations, and $ \mathbf{s}(t)$ are the hidden states. The observation mapping $ \mathbf{f}$ and the dynamical mapping $ \mathbf{g}$ are nonlinear and they are modeled with multilayer perceptron (MLP) networks. Observation noise $ \mathbf{n}$ and process noise $ \mathbf{m}$ are assumed Gaussian. The latent states $ \mathbf{s}(t)$ are commonly denoted by $ \boldsymbol{\theta}_{\boldsymbol{S}}$. The model parameters include both the weights of the MLP networks and a number of hyperparameters. The posterior approximation of these parameters is a Gaussian with a diagonal covariance. The posterior approximation of the states $ q(\boldsymbol{\theta}_{\boldsymbol{S}} \vert \boldsymbol{\xi}_{\boldsymbol{S}})$ is a Gaussian Markov random field a correlation between the corresponding components of subsequent state vectors $ s_j(t)$ and $ s_j(t-1)$. This is a realistic minimum assumption for modeling the dependence of the state vectors $ \mathbf{s}(t)$ and $ \mathbf{s}(t-1)$ [5].

Because of the nonlinearities the model is not in the conjugate exponential family, and the standard VB learning methods are only applicable to hyperparameters and not the latent states or weights of the MLPs. The bound (1) can nevertheless be evaluated by linearizing the MLP networks $ \mathbf{f}$ and $ \mathbf{g}$ using the technique of [7]. This allows evaluating the gradient with respect to $ \boldsymbol{\xi}_{\boldsymbol{S}}$, $ \boldsymbol{\xi}_{\mathbf{f}}$, and $ \boldsymbol{\xi}_{\mathbf{g}}$ and using a gradient based optimizer to adapt the parameters. The natural gradient for the mean elements is given by

$\displaystyle \tilde{\nabla}_{\boldsymbol{\mu}_q} \mathcal{F}(\boldsymbol{\xi}) = \mathbf{\Sigma}_q \nabla_{\boldsymbol{\mu}_q} \mathcal{F}(\boldsymbol{\xi}),$ (17)

where $ \boldsymbol{\mu}_q$ is the mean of the variational approximation $ q(\boldsymbol{\theta}\vert \boldsymbol{\xi})$ and $ \mathbf{\Sigma}_q$ is the corresponding covariance. The covariance of the model parameters is diagonal while the inverse covariance of the latent states $ \mathbf{s}(t)$ is block-diagonal with tridiagonal blocks. This implies that all computations with these can be done in linear time with respect to the number of the parameters. The covariances were updated separately using a fixed-point update rule similar to (2) as described in [5].


next up previous
Next: Experiments Up: Natural Conjugate Gradient in Previous: Natural conjugate gradient
Tapani Raiko 2007-09-11