Next: Priors and hyperpriors Up: No Title Previous: EM and MAP

Construction of probabilistic models

In order to apply the Bayesian approach for modelling, the model needs to be given in probabilistic terms, which means stating the joint distribution of all the variables in the model. In principle, any joint distribution can be regarded as a model, but in practice, the joint distribution will have a simple form.

As an example, we shall see how a generative model turns into a probabilistic model. Suppose we have a model which tells how a sequence $\vec{y} = y(1), \ldots, y(t)$ transforms into sequence $\vec{x} = x(1), \ldots, x(t)$ .

$\begin{displaymath}x(t) = f(y(t), \theta) + n(t) \end{displaymath}$

(14)

This is called a generative model for $\vec{x}$ because it tells explicitly how the squence $\vec{x}$ is generated from the sequence $\vec{y}$ through a mapping f parametrised by $\theta$ . As it is usually unrealistic to assume that it would be possible to model all the things affecting $\vec{x}$ exactly, the models typically include a noise term n(t).

If y(t) and $\theta$ are given, then x(t) has the same distribution as n(t) except that it is offset by $f(y(t), \theta)$ . This means that if n(t) is Gaussian noise with variance $\sigma^2$ , equation 14 translates into

$\begin{displaymath}x(t) \sim N(f(y(t)), \sigma^2) \, \end{displaymath}$

(15)

which is equivalent to

$\begin{displaymath}p(x(t) \vert y(t), \theta, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{[x(t) - f(y(t), \theta)]^2}{2\sigma^2}} \, . \end{displaymath}$

(16)

The joint density of all the variables can then be written as

$\begin{displaymath}p(\vec{x}, \vec{y}, \theta, \sigma) = p(\vec{y}, \theta, \sigma) \prod_t p(x(t) \vert y(t), \theta, \sigma) \, . \end{displaymath}$

(17)

Usually also the probability $p(\vec{y}, \theta, \sigma)$ is stated in a factorisable form making the full joint probability density $p(\vec{x}, \vec{y}, \theta, \sigma)$ a product of many simple terms.

In supervised learning, the sequence $\vec{y}$ is assumed to be fully known, also for any future data, which means that the full joint probability $p(\vec{x}, \vec{y}, \theta, \sigma)$ is not needed, only

$\begin{displaymath}p(\vec{x}, \theta, \sigma \vert \vec{y}) = p(\vec{x} \vert \vec{y}, \theta, \sigma) p(\theta, \sigma \vert \vec{y}) \, . \end{displaymath}$

(18)

Typically $\vec{y}$ is assumed to be independent of $\theta$ and $\sigma$ , i.e., $p(\vec{y}, \theta, \sigma) = p(\vec{y}) p(\theta, \sigma)$ . This also means that $p(\theta, \sigma \vert \vec{y}) = p(\theta, \sigma)$ and thus only $p(\vec{x} \vert \vec{y}, \theta, \sigma)$ , given by the generative model equation 14, and the prior for the parameters $p(\theta, \sigma)$ is needed in supervised learning.

If the probability $p(\vec{y})$ is not modelled in supervised learning, it is impossible to treat missing elements of the sequence $\vec{y}$ . If the probability $p(\vec{y})$ is modelled, however, there are no problems. The posterior density is computed for all unknown variables, including the missing elements of $\vec{y}$ . In fact, unsupervised learning can be seen as a special case where the whole sequence $\vec{y}$ is unknown. In probabilistic framework, the treatment of any missing values is possible as long as the model defines the joint density of all the variables in the model. It is, for instance, easy to treat missing elements of sequence $\vec{x}$ or mix freely between supervised and unsupervised learning depending on how large part of the sequence $\vec{y}$ is known.

Priors and hyperpriors

Next: Priors and hyperpriors Up: No Title Previous: EM and MAP

Harri Lappalainen
2000-03-03