next up previous
Next: Iterated Extended Kalman Smoothing Up: Learning Nonlinear State-Space Models Previous: Introduction


Nonlinear State-Space Models

Nonlinear dynamical factor analysis (NDFA) [17] is a powerful tool for modelling the dynamics of an unknown noisy system. NDFA scales only quadratically with the dimensionality of the observation space, so it is also suitable for modelling systems with fairly high dimensionality [17]. In NDFA, the observations $ \mathbf{x}(t)$ have been generated from the hidden state $ \mathbf{s}(t)$ by the following generative model:

$\displaystyle \mathbf{x}(t)$ $\displaystyle = \mathbf{f}(\mathbf{s}(t),\boldsymbol{\theta}_\mathbf{f})+\mathbf{n}(t)$ (1)
$\displaystyle \mathbf{s}(t)$ $\displaystyle = \mathbf{g}(\mathbf{s}(t-1),\boldsymbol{\theta}_\mathbf{g})+\mathbf{m}(t),$ (2)

where $ \boldsymbol{\theta}$ is a vector containing the model parameters and time $ t$ is discrete. The noise terms $ \mathbf{n}(t)$ and $ \mathbf{m}(t)$ are assumed to be Gaussian and white. Only the observations $ \mathbf{x}$ are known beforehand, and both the states $ \mathbf{s}$ and the mappings $ \mathbf{f}$ and $ \mathbf{g}$ are learned from the data. Multilayer perceptron (MLP) networks [6] suit well to modelling both strong and mild nonlinearities. The MLP network models for $ \mathbf{f}$ and $ \mathbf{g}$ are

$\displaystyle \mathbf{f}(\mathbf{s}(t),\boldsymbol{\theta}_\mathbf{f})$ $\displaystyle = \mathbf{B} \tanh \left[ \mathbf{A} \mathbf{s}(t) + \mathbf{a}\right] + \mathbf{b}$ (3)
$\displaystyle \mathbf{g}(\mathbf{s}(t),\boldsymbol{\theta}_\mathbf{g})$ $\displaystyle = \mathbf{s}(t) + \mathbf{D} \tanh \left[ \mathbf{C} \mathbf{s}(t) + \mathbf{c}\right] + \mathbf{d},$ (4)

where the sigmoidal tanh nonlinearity is applied component-wise to its argument vector. The parameters $ \boldsymbol{\theta}$ include: (1) the weight matrices $ \mathbf{A}\dots\mathbf{D}$, the bias vectors $ \mathbf{a}\dots\mathbf{d}$; (2) the parameters of the distributions of the noise signals $ \mathbf{n}(t)$ and $ \mathbf{m}(t)$ and the column vectors of the weight matrices; (3) the hyperparameters describing the distributions of biases and the parameters in group (2). There are infinitely many models that can explain any given data. In Bayesian learning, all the possible explanations are averaged weighting by their posterior probability. The posterior probability $ p(\mathbf{s},\boldsymbol{\theta}\mid\mathbf{x})$ of the states and the parameters after observing the data, contains all the relevant information about them. Variational Bayesian learning is a way to approximate the posterior density by a parametric distribution $ q(\mathbf{s},\boldsymbol{\theta})$. The misfit is measured by the Kullback-Leibler divergence:

$\displaystyle C_{\mathrm{KL}}= \int q(\mathbf{s},\boldsymbol{\theta}) \log
 \fr...
...\mathbf{s},\boldsymbol{\theta}\mid\mathbf{x})} d\boldsymbol{\theta}d\mathbf{s}.$ (5)

The approximation $ q$ needs to be simple for mathematical tractability and computational efficiency. Variables are assumed to depend of each other in the following way:

$\displaystyle q(\mathbf{s},\boldsymbol{\theta}) <tex2html_comment_mark>47$ $\displaystyle = \prod_{t=1}^T \prod_{i=1}^m q(s_i(t)\mid s_i(t-1)) \prod_j q(\theta_j),$ (6)

where $ m$ is the dimensionality of the state space $ \mathbf{s}$. Furthermore, $ q$ is assumed to be Gaussian. Learning and inference happen by adjusting $ q$ such that the cost function $ C_{\mathrm{KL}}$ is minimised. A good initialisation and other measures are essential because the iterative learning algorithm can easily get stuck into a local minimum of the cost function. The standard initialisation is based on principal component analysis of the data augmented with embedding. Details can be found in [17].

Subsections
next up previous
Next: Iterated Extended Kalman Smoothing Up: Learning Nonlinear State-Space Models Previous: Introduction
Tapani Raiko 2005-05-23