next up previous
Next: Summary Up: Examples Previous: Fixed form Q

Free Form Q

Instead of assuming the form for the approximate posterior distribution we could instead derive the optimal separable distribution (the functions that minimise the cost function subject to the constraint that they be normalised).

Instead of learning the log-std $v = \ln \sigma$, we shall learn the inverse noise variance $\beta=\sigma^{-2}$. The prior on $\beta $ is assumed to be a Gamma distribution of the form

\begin{displaymath}p\left(\beta\right) = \frac{1}{\Gamma\left(c_\beta\right)} b_...
...\beta^{\left(c_\beta-1\right)} \exp\left(-b_\beta \beta\right)
\end{displaymath} (34)

Setting $b_\beta=c_\beta=10^{-3}$ leads to a broad prior in $\ln \beta$, this is equivalent to assuming that $\sigma_v$ is large in the log-std parametrisation.

The cost function that must be minimised is now

C = $\displaystyle D(q(x,m,\beta) \vert\vert P(x \vert m, \beta)) - \ln P(m,\beta)$  
  = $\displaystyle \int q(m,\beta)\ln \frac{q(m,\beta)}{P(x, m, \beta)} dm d\beta$ (35)

If we assume a separable posterior, that is $q(m,\beta)=q(m)q(\beta)$ and substitute our priors into the cost function we obtain
C = $\displaystyle \int q(m)q(\beta)\ln \frac{q(m)q(\beta)}{\left[ \prod_t P(x(t) \vert m, \beta) \right] P(m) P(\beta)} dm d\beta$  
  = $\displaystyle \int q(m)q(\beta)\left[ \ln q(m) +\ln q(\beta) - \ln P(m) -\ln P(\beta) \right.$  
    $\displaystyle \left. -\sum_t \left( \frac{1}{2}\ln \frac{\beta}{2 \pi} -\frac{\beta \left(x(t)-m\right)^2}{2} \right)\right] dm d\beta$ (36)

Assuming we know $q(\beta)$ we can integrate over $\beta $ in C (dropping any terms that are independent of m) to obtain

\begin{displaymath}C= \int q(m)\left[\ln q(m) - \ln P(m) -\sum_t \left( -\frac{\bar{\beta} \left(x(t)-m\right)^2}{2} \right) \right] dm
\end{displaymath} (37)

where $\bar{\beta}$ is the average value of $\beta $ under the distribution $q(\beta)$. We can optimise the cost function with respect to q(m) by performing a functional derivative.
$\displaystyle \frac{\partial C}{\partial q(m)}$ = $\displaystyle 1 + \ln q(m) - \ln P(m) -\sum_t \left( -\frac{\bar{\beta} \left(x(t)-m\right)^2}{2} \right) +\lambda_m$  
  = 0 (38)

where $\lambda_m$ is a Lagrange multiplier introduced to ensure that q(m) is normalised. Rearranging we see that
$\displaystyle \ln q(m)$ = $\displaystyle -1 -\frac{1}{2}\ln 2 \pi \sigma_m^2 - \frac{(m-mu_m)^2}{2 \sigma_m^2}$  
  = $\displaystyle +\sum_t \left( -\frac{\bar{\beta} \left(x(t)-m\right)^2}{2} \right)+\lambda_m$ (39)

and so the approximate posterior distribution is a Gaussian with variance $\tilde{m}=\left(\sigma_m^{-2}+T \bar{\beta} \right)^{-1}$ and mean $\bar{m}=\tilde{m} \left(\frac{\mu_m}{\sigma_m^2}+\bar{\beta} \sum_t x(t)\right)$.

We can obtain the optimum form for $q(\beta)$ by marginalising the cost function over m and dropping terms independent of $\beta $.

C = $\displaystyle \int q(\beta)\left[ \ln q(\beta)-\ln P(\beta) \right.$  
    $\displaystyle \left.-\sum_t \left( \frac{1}{2}\ln \beta -\frac{\beta \left(\left(x(t)-\bar{m}\right)^2+ \tilde{m}\right)}{2} \right) \right] d\beta$ (40)

Again we can perform a functional derivative to obtain
$\displaystyle \frac{\partial C}{\partial q(\beta)}$ = $\displaystyle 1 + \ln q(\beta) -\ln P(\beta)$  
    $\displaystyle -\sum_t \left( \frac{1}{2}\ln \beta -\frac{\beta \left(\left(x(t)-\bar{m}\right)^2+ \tilde{m}\right)}{2} \right)+\lambda_\beta$  
  = 0 (41)

and so

\begin{displaymath}\ln q(\beta) = -1 +\sum_t \left( \frac{1}{2}\ln \beta -\frac{...
...^2+ \tilde{m}\right)}{2} \right) +\ln P(\beta) +\lambda_\beta
\end{displaymath} (42)

So the optimal posterior distribution is a Gamma distribution, with parameters $\hat{b}_\beta=b_\beta+\frac{\sum_t \left(\left(x(t)-\bar{m}\right)^2+ \tilde{m}\right)}{2}$ and $\hat{c}_\beta=c_\beta+\frac{T}{2}$. Therefore the expectation of $\beta $ under the posterior distribution is $\bar{\beta}=\frac{\hat{c}_\beta}{\hat{b}_\beta}$.

The optimal distributions for m and $\beta $ depend on each other (q(m) is a function of $\bar{\beta}$ and $q(\beta)$ is a function of $\bar{m}$ and $\tilde{m}$) so the optimal solutions can be found by iteratively updating q(m) and $q(\beta)$.

A general point is that the freeform optimisation of the cost function will typically lead to a set of iterative update equations where each distribution is updated on the basis of the other distributions in the approximation.

We can also see that if the parametrisation of the model is chosen appropriately the optimal separable model has a similar form to the prior model. If the prior distributions are Gaussians the posterior distributions are also Gaussians (likewise for Gamma distributions). If this is the case then we can say that we have chosen conjugate priors.

Figure: Comparison of the true and approximate posterior distributions for a test set containing 100 data points drawn from a model with m=1 and $\sigma =0.1$. The plot on the left shows the true posterior distribution over m and $\beta $. The plot on the right shows the approximate posterior distribution derived by obtaining the optimal free form separable distribution.

Figure 4 shows a comparison of the true posterior distribution and the approximate posterior. The data set is the same as for the fixed form example. The contours in both distributions are centred in the same region, a model that underestimates m. The contours for the two distributions are qualitatively similar, the approximate distribution also shows the assymmetric density.

next up previous
Next: Summary Up: Examples Previous: Fixed form Q
Harri Lappalainen