Hierarchical Nonlinear Factor Analysis (HNFA)

Next: Relational Markov Networks (RMN) Up: Model Description Previous: Model Description

Hierarchical Nonlinear Factor Analysis (HNFA)

In (linear) factor analysis, continuous valued observation vectors ${\mathbf{x}}(t)$ are generated from unknown factors (or sources) ${\mathbf{s}}(t)$ , a bias vector ${\mathbf{b}}$ , and noise ${\mathbf{n}}(t)$ by ${\mathbf{x}}(t) = {\mathbf{A}}{\mathbf{s}}(t) + {\mathbf{b}} + {\mathbf{n}}(t)$ . The factors and noise are assumed to be Gaussian and independent. The index may represent time or the object of the observation. The mapping ${\mathbf{A}}$ , the factors, and parameters such as the noise variances are found using Bayesian learning. Factor analysis is close to principal component analysis (PCA). The unknown factors may represent some real phenomena, or they may just be auxillary variables for inducing a dependency between the observations.

Hierarchical nonlinear factor analysis (HNFA) [11] generalises factor analysis by adding more layers of factors that form a multi-layer perceptron type of a network. In this paper, there are two layers of factors ${\mathbf{h}}$ and ${\mathbf{s}}$ , and the mappings are:

$\displaystyle {\mathbf{h}}(t)$	$\displaystyle = {\mathbf{B}}{\mathbf{s}}(t) + {\mathbf{b}}+ {\mathbf{n}}_h(t)$	(1)
$\displaystyle {\mathbf{x}}(t)$	$\displaystyle = {\mathbf{A}}{\mathbf{f}}[ {\mathbf{h}}(t) ] + {\mathbf{C}}{\mathbf{s}}(t) + {\mathbf{a}} + {\mathbf{n}}_x(t) \, ,$	(2)

where the nonlinearity ${\mathbf{f}}(\xi) = \exp(-\xi^2)$ operates on each element separately. HNFA can easily be implemented using the Bayes Blocks software library [10,12]. The update rules are automatically derived in a manner shortly described below.

The unknown variables $\boldsymbol{\theta}$ (factors, mappings, and the parameters) are learned from data with variational Bayesian learning [4]. A parametric distribution $q(\boldsymbol{\theta})$ over the unknown variables $\boldsymbol{\theta}$ is fitted to the true posterior distribution $p(\boldsymbol{\theta}\mid \boldsymbol{X})$ where the matrix $\boldsymbol{X}$ contains all the observations ${\mathbf{x}}(t)$ . The misfit is measured by Kullback-Leibler divergence $D( \cdot \parallel \cdot )$ . An additional term $-\log p(\boldsymbol{X})$ is included to avoid calculation of the model evidence term $p(\boldsymbol{X})=\int p(\boldsymbol{X},\boldsymbol{\theta}) d\boldsymbol{\theta}$ . The cost function is

$\displaystyle \mathcal{C}= D( q(\boldsymbol{\theta}) \parallel p(\boldsymbol{\t... ... q(\boldsymbol{\theta}) }{ p(\boldsymbol{X},\boldsymbol{\theta}) } \right> \, ,$

(3)

where $\left< \cdot \right>$ denotes the expectation over distribution $q(\boldsymbol{\theta})$ . Note that since $D( q \parallel p) \geq 0$ , it follows that the cost function provides a lower bound for the model evidence $p(\boldsymbol{X}) \geq \exp (-\mathcal{C})$ . The posterior approximation $q(\boldsymbol{\theta})$ is chosen to be Gaussian with a diagonal covariance matrix.

It is possible, though slightly impractical, to model also discrete values in HNFA by using the discrete variable with a soft-max prior [12]. In the binary case, the th component of ${\mathbf{x}}(t)$ is left as a latent auxiliary variable, and an observed binary variable is conditioned by $p(y(t)=1\mid x_i(t)) = \frac{\exp x_i(t) }{ 1 + \exp x_i(t)}$ . The general discrete case follows analogously requiring more than one auxiliary component of ${\mathbf{x}}(t)$ . The experiments in Section 3 use a thousand copies of a binary variable having the same conditional probability. They can be united into one variable by multiplying its cost by one thousand. Observing 800 ones and 200 zeros corresponds to fixing the variable to a distribution of 0.8 times one and 0.2 times zero.

Next: Relational Markov Networks (RMN) Up: Model Description Previous: Model Description

Tapani Raiko 2005-06-17