Next: About this document ... Up: Ensemble Learning for Independent Previous: References

Derivations

An approximation for $-E_Q\{\ln p(s_i(t) \vert c_i, S_i, \gamma_i)\}$ is derived here. The derivation makes use of the Jensen's inequality and second order Taylor's series expansion.

Let $g_{ij} = -\ln {\cal G}(s_i(t); S_{ij}, \gamma_{ij})$ . The expectation to be approximated is then

$\begin{displaymath} E_Q\left\{\ln \sum_j e^{c_{ij}}\right\} - E_Q\left\{\ln \sum_j e^{c_{ij}-g_{ij}} \right\}.\end{displaymath}$

Let us first concider the latter term. The logarithm of the sum is a strictly convex function of g_ij. By Jensen's inequality, moving the expectation inside a convex function cannot result in increment. Replacing the latter expectation by

$\begin{displaymath} E_{s_i(t), c_i}\left\{\ln \sum_j e^{c_{ij}-E_{S_{ij}, \gamma_{ij}} \{g_{ij}\}} \right\}\end{displaymath}$

can therefore only result in increment in the approximation. This is safe because we are trying to minimise the Kullback-Leibler information and approximating it above is thus conservative.

The expectation $E_{S_{ij},\gamma_ij}\{g_{ij}\}$ is similar to equation 1 and equals to

$\begin{displaymath} \frac{[(s_i(t) - \hat{S}_{ij})^2 + \tilde{S}_{ij}]e^{2(\til... ...}_{ij}-\hat{\gamma}_{ij})} + \ln 2\pi} {2} + \hat{\gamma}_{ij}.\end{displaymath}$

Let us denote this by $\hat{g}_{ij}(s_i(t))$ . At this point, the approximation equals to

$\begin{displaymath} E_{c_i}\left\{\ln \sum_j e^{c_{ij}}\right\} - E_{s_i(t), c_i}\left\{\ln \sum_j e^{c_{ij}-\hat{g}_{ij}(s_i(t))} \right\}.\end{displaymath}$

The terms inside the expectations are functions of s_i(t) and c_i.

Next, let us concider a second order Taylor's series expansion about $\hat{s}_i(t)$ and $\hat{c}_i$ . Notice that the first order terms and all second order crossterms disappear in the expectations and only the constant and the pure second order terms remain. This is because the variables are independent in the ensemble.

The constant term will be

$\begin{displaymath} \ln \sum_j e^{\hat{c}_{ij}} - \ln \sum_j e^{\hat{c}_{ij}-\hat{g}_{ij}(\hat{s}_i(t))}\end{displaymath}$ (2)

and the remaining second order terms of c_ij

$\begin{displaymath} E_{c_i}\left\{\sum_j \frac{(c_{ij}-\hat{c}_{ij})^2}{2} \zeta_{ij} (1 - \zeta_{ij})\right\}\end{displaymath}$

and

$\begin{displaymath} -E_{c_i}\left\{\sum_j \frac{(c_{ij}-\hat{c}_{ij})^2}{2} \xi_{ij} (1 - \xi_{ij})\right\},\end{displaymath}$

where

$\begin{displaymath} \zeta_{ij} = \frac{e^{\hat{c}_{ij}}}{\sum_k e^{\hat{c}_{ik}}}\end{displaymath}$

and

$\begin{displaymath} \xi_{ij} = \frac{e^{\hat{c}_{ij}-\hat{g}_{ij}(\hat{s}_i(t))}} {\sum_k e^{\hat{c}_{ik}-\hat{g}_{ik}(\hat{s}_i(t))}}.\end{displaymath}$

Taking the expectations yields

$\begin{displaymath} \sum_j \frac{\tilde{c}_{ij}}{2}[\zeta_{ij} (1 - \zeta_{ij}) - \xi_{ij} (1 - \xi_{ij})].\end{displaymath}$

(3)

The second order term of s_i(t) will be

$\begin{displaymath} E_{s_i(t)}\left\{\frac{(s_i(t)-\hat{s}_i(t))^2}{2} (\phi_i + \chi_i^2 - \psi_i)\right\},\end{displaymath}$

where

$\begin{displaymath} \phi_i = \sum_j \xi_{ij} e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})},\end{displaymath}$

$\begin{displaymath} \chi_i = \sum_j \xi_{ij} (\hat{s}_i(t) - \hat{S}_{ij}) e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})}\end{displaymath}$

and

$\begin{displaymath} \psi_i = \sum_j \xi_{ij} \left[ (\hat{s}_i(t) - \hat{S}_{ij}) e^{2(\tilde{\gamma}_{ij}-\hat{\gamma}_{ij})} \right]^2.\end{displaymath}$

Taking the expectation yields

$\begin{displaymath} \frac{\tilde{s}_i(t)}{2}(\phi_i + \chi_i^2 - \psi_i).\end{displaymath}$

(4)

At this point, the approximation of the original expectation is thus the sum of terms in equations 2-4.

Some care has to be taken with the approximation resulting from the Taylor's series expansion because it utilises only local information about the shape of the posterior pdf. For instance, if the mean $\hat{s}_i(t)$ happens to be in a valley between two Gaussians, the term in equation 4 will be negative. It then looks like the Kullback-Leibler information can be decreased by increasing $\tilde{s}_i(t)$ . This only holds for small $\tilde{s}_i(t)$ , however. At some point after the distribution of s_i(t) has become broader than the separation between the two Gaussians, the Kullback-Leibler information starts to increase as $\tilde{s}_i(t)$ increases.

In order to avoid the problem, only positive terms of equations 3 and 4 will be included. The final approximation is thus equation 2 plus the positive terms of equations 3 and 4.

Next: About this document ... Up: Ensemble Learning for Independent Previous: References

Harri Lappalainen
7/10/1998