next up previous
Next: About this document ... Up: Ensemble Learning for Independent Previous: References

Derivations

  An approximation for $-E_Q\{\ln p(s_i(t) \vert c_i, S_i, \gamma_i)\}$ is derived here. The derivation makes use of the Jensen's inequality and second order Taylor's series expansion.

Let $g_{ij} = -\ln {\cal G}(s_i(t); S_{ij}, \gamma_{ij})$. The expectation to be approximated is then

\begin{displaymath}
E_Q\left\{\ln \sum_j e^{c_{ij}}\right\} - E_Q\left\{\ln \sum_j
e^{c_{ij}-g_{ij}} \right\}.\end{displaymath}

Let us first concider the latter term. The logarithm of the sum is a strictly convex function of gij. By Jensen's inequality, moving the expectation inside a convex function cannot result in increment. Replacing the latter expectation by

\begin{displaymath}
E_{s_i(t), c_i}\left\{\ln \sum_j e^{c_{ij}-E_{S_{ij}, \gamma_{ij}}
 \{g_{ij}\}} \right\}\end{displaymath}

can therefore only result in increment in the approximation. This is safe because we are trying to minimise the Kullback-Leibler information and approximating it above is thus conservative.

The expectation $E_{S_{ij},\gamma_ij}\{g_{ij}\}$ is similar to equation 1 and equals to

\begin{displaymath}
\frac{[(s_i(t) - \hat{S}_{ij})^2 +
 \tilde{S}_{ij}]e^{2(\til...
 ...}_{ij}-\hat{\gamma}_{ij})} + \ln 2\pi} {2}
+ \hat{\gamma}_{ij}.\end{displaymath}

Let us denote this by $\hat{g}_{ij}(s_i(t))$. At this point, the approximation equals to

\begin{displaymath}
E_{c_i}\left\{\ln \sum_j e^{c_{ij}}\right\} - E_{s_i(t),
 c_i}\left\{\ln \sum_j e^{c_{ij}-\hat{g}_{ij}(s_i(t))} \right\}.\end{displaymath}

The terms inside the expectations are functions of si(t) and ci.

Next, let us concider a second order Taylor's series expansion about $\hat{s}_i(t)$ and $\hat{c}_i$. Notice that the first order terms and all second order crossterms disappear in the expectations and only the constant and the pure second order terms remain. This is because the variables are independent in the ensemble.

The constant term will be  
 \begin{displaymath}
 \ln \sum_j e^{\hat{c}_{ij}} - \ln \sum_j
 e^{\hat{c}_{ij}-\hat{g}_{ij}(\hat{s}_i(t))}\end{displaymath} (2)
and the remaining second order terms of cij

\begin{displaymath}
E_{c_i}\left\{\sum_j \frac{(c_{ij}-\hat{c}_{ij})^2}{2} \zeta_{ij} (1 -
\zeta_{ij})\right\}\end{displaymath}

and

\begin{displaymath}
-E_{c_i}\left\{\sum_j \frac{(c_{ij}-\hat{c}_{ij})^2}{2} \xi_{ij} (1 -
\xi_{ij})\right\},\end{displaymath}

where

\begin{displaymath}
\zeta_{ij} = \frac{e^{\hat{c}_{ij}}}{\sum_k e^{\hat{c}_{ik}}}\end{displaymath}

and

\begin{displaymath}
\xi_{ij} = \frac{e^{\hat{c}_{ij}-\hat{g}_{ij}(\hat{s}_i(t))}}
{\sum_k e^{\hat{c}_{ik}-\hat{g}_{ik}(\hat{s}_i(t))}}.\end{displaymath}

Taking the expectations yields  
 \begin{displaymath}
 \sum_j \frac{\tilde{c}_{ij}}{2}[\zeta_{ij} (1 - \zeta_{ij}) - \xi_{ij}
 (1 - \xi_{ij})].\end{displaymath} (3)
The second order term of si(t) will be

\begin{displaymath}
E_{s_i(t)}\left\{\frac{(s_i(t)-\hat{s}_i(t))^2}{2} (\phi_i + \chi_i^2
- \psi_i)\right\},\end{displaymath}

where

\begin{displaymath}
\phi_i = \sum_j \xi_{ij} e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})},\end{displaymath}

\begin{displaymath}
\chi_i = \sum_j \xi_{ij} (\hat{s}_i(t) - \hat{S}_{ij})
e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})}\end{displaymath}

and

\begin{displaymath}
\psi_i = \sum_j \xi_{ij} \left[ (\hat{s}_i(t) - \hat{S}_{ij})
e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})} \right]^2.\end{displaymath}

Taking the expectation yields  
 \begin{displaymath}
 \frac{\tilde{s}_i(t)}{2}(\phi_i + \chi_i^2 - \psi_i).\end{displaymath} (4)
At this point, the approximation of the original expectation is thus the sum of terms in equations 2-4.

Some care has to be taken with the approximation resulting from the Taylor's series expansion because it utilises only local information about the shape of the posterior pdf. For instance, if the mean $\hat{s}_i(t)$ happens to be in a valley between two Gaussians, the term in equation 4 will be negative. It then looks like the Kullback-Leibler information can be decreased by increasing $\tilde{s}_i(t)$. This only holds for small $\tilde{s}_i(t)$, however. At some point after the distribution of si(t) has become broader than the separation between the two Gaussians, the Kullback-Leibler information starts to increase as $\tilde{s}_i(t)$ increases.

In order to avoid the problem, only positive terms of equations 3 and 4 will be included. The final approximation is thus equation 2 plus the positive terms of equations 3 and 4.


next up previous
Next: About this document ... Up: Ensemble Learning for Independent Previous: References
Harri Lappalainen
7/10/1998