Next: Control Schemes Up: Variational Bayesian approach for Previous: Introduction

Nonlinear State-Space Models

Selecting actions based on a state-space model instead of the observation directly has many benefits: Firstly, it is more resistant to noise Raiko05IJCNN because it implicitly involves filtering. Secondly, the observations (without history) do not always carry enough information about the system state. Thirdly, when nonlinear dynamics are modelled by a function approximator such as an multilayer perceptron (MLP) network, a state-space model can find such a representation of the state that it is more suitable for the approximation and thus more predictable.

Nonlinear dynamical factor analysis (NDFA) Valpola02NC is a powerful tool for system identification. It is based on a nonlinear state-space model learned in a variational Bayesian setting. NDFA scales only quadratically with the dimensionality of the observation space, so it is also suitable for modelling systems with fairly high dimensionality Valpola02NC.

In our model, the observation (or measurement) vector $\mathbf{y}(t)$ is assumed to have been generated from the hidden state vector $\mathbf{x}(t)$ driven by the control $\mathbf{u}(t)$ by the following generative model:

$\displaystyle \left[ \begin{array}{c} \mathbf{u}(t) \\ \mathbf{x}(t) \end{array} \right]$	$\displaystyle = \mathbf{g}\left(\left[ \begin{array}{c} \mathbf{u}(t-1) \\ \mat... ...}(t-1) \end{array} \right],\boldsymbol{\theta}_\mathbf{g}\right)+\mathbf{v}(t),$	(1)
$\displaystyle \mathbf{y}(t)$	$\displaystyle = \mathbf{f}(\mathbf{x}(t),\boldsymbol{\theta}_\mathbf{f})+\mathbf{w}(t)$	(2)

where $\boldsymbol{\theta}$ is a vector containing the model parameters and time

is discrete. The process noise $\mathbf{v}(t)$ and the measurement noise $\mathbf{w}(t)$ are assumed to be independent, Gaussian, and white. Only the observations $\mathbf{y}$ are known beforehand, and both the states $\mathbf{x}$ and the mappings $\mathbf{f}$ and $\mathbf{g}$ are learned from the data. In the context of system identification this model can be considered task-oriented identification because of its internal forward model to predict $\mathbf{u}(t)$ Raiko05IJCNN. Note that the uncertainty of the process noise $\mathbf{v}(t)$ leaves the exact selection of the control signal $\mathbf{u}(t)$ open.

Multilayer perceptron (MLP) networks Haykin98 suit well to modelling both strong and mild nonlinearities. The MLP network models for $\mathbf{f}$ and $\mathbf{g}$ are

$\displaystyle \mathbf{g}(\mathbf{x}(t),\boldsymbol{\theta}_\mathbf{g})$	$\displaystyle = \mathbf{x}(t) + \mathbf{B} \tanh \left[ \mathbf{A} \mathbf{x}(t) + \mathbf{a}\right] + \mathbf{b}$	(3)
$\displaystyle \mathbf{f}(\mathbf{x}(t),\boldsymbol{\theta}_\mathbf{f})$	$\displaystyle = \mathbf{D} \tanh \left[ \mathbf{C} \mathbf{x}(t) + \mathbf{c}\right] + \mathbf{d},$	(4)

where the sigmoidal tanh nonlinearity is applied component-wise to its argument vector. The parameters $\boldsymbol{\theta}$ include: (1) the weight matrices $\mathbf{A}\dots\mathbf{D}$ , the bias vectors $\mathbf{a}\dots\mathbf{d}$ ; (2) the parameters of the distributions of the noise signals $\mathbf{w}(t)$ and $\mathbf{v}(t)$ and the column vectors of the weight matrices; (3) the hyperparameters describing the distributions of biases and the parameters in group (2).

There are infinitely many models that can explain any given data. In Bayesian learning, all the possible explanations are averaged weighting by their posterior probability. The posterior probability $p(\mathbf{x},\boldsymbol{\theta}\mid\mathbf{y})$ of the states and the parameters after observing the data, contains all the relevant information about them. Variational Bayesian learning is a way to approximate the posterior density by a parametric distribution $q(\mathbf{x},\boldsymbol{\theta})$ . The misfit is measured by the Kullback-Leibler divergence:

$\displaystyle C_{\mathrm{KL}}= \int q(\mathbf{x},\boldsymbol{\theta}) \log \fr... ...\mathbf{x},\boldsymbol{\theta}\mid\mathbf{y})} d\boldsymbol{\theta}d\mathbf{x},$

(5)

that is, the closer

is to the true Bayesian posterior, the smaller the cost function.

The approximation needs to be simple for mathematical tractability and computational efficiency. Variables are assumed to depend of each other in the following way:

$\displaystyle q(\mathbf{x},\boldsymbol{\theta}) %&= q(\vect{x})q(\boldsymbol{\theta}) \\$

$\displaystyle = \prod_{t=1}^T \prod_{i=1}^m q(x_i(t)\mid x_i(t-1)) \prod_j q(\theta_j),$

(6)

where

is the dimensionality of the state space $\mathbf{x}$ . Furthermore,

is assumed to be Gaussian. To summarise, the distribution

is parametrised by the means and the variances of the unknown states and model parameters, and covariances of consecutive state components. The mean of a variable, say $\mathbf{x}(t)$ , over the distribution

is marked with $E_q \left\{ \mathbf{x}(t) \right\}$ .

Inference (or state estimation) happens by adjusting the values corresponding to hidden states in such that the cost function $C_{\mathrm{KL}}$ is minimised. Learning (or system identification) happens by adjusting both the hidden states and the model parameters in minimising $C_{\mathrm{KL}}$ . The same cost function can also be used for determining the model structure, e.g. the dimensionality of the state space. The NDFA package contains an iterative minimisation algorithm for that. A good initialisation and other measures are essential to avoid getting stuck into a bad local minimum of the cost function. The standard initialisation for the learning is based on principal component analysis of the data augmented with embedding. Details can be found in Valpola02NC.

Next: Control Schemes Up: Variational Bayesian approach for Previous: Introduction

Tapani Raiko 2006-08-24