next up previous contents
Next: The approximating posterior distribution Up: The probabilistic model Previous: The prior of the   Contents

The prior of the parameters $ \boldsymbol {\theta }$

Let us denote the elements of the weight matrices of the MLP networks by $ \mathbf{A}= (A_{ij}), \mathbf{B}= (B_{ij}), \mathbf{C}= (C_{ij})$ and $ \mathbf{D}=
(D_{ij})$. The bias vectors consist similarly of elements $ \mathbf{a} =
(a_i), \mathbf{b} = (b_i), \mathbf{c} = (c_i)$ and $ \mathbf{d} = (d_i)$.

All the elements of the weight matrices and the bias vectors are assumed to be independent and Gaussian. Their priors are as follows:

$\displaystyle p(A_{ij})$ $\displaystyle = N(A_{ij};\; 0, 1)$ (5.27)
$\displaystyle p(B_{ij})$ $\displaystyle = N(B_{ij};\; 0, \exp(2 v_{B_j}))$ (5.28)
$\displaystyle p(a_i)$ $\displaystyle = N(a_i;\; m_a, \exp(2 v_a))$ (5.29)
$\displaystyle p(b_i)$ $\displaystyle = N(b_i;\; m_b, \exp(2 v_b))$ (5.30)
$\displaystyle p(C_{ij})$ $\displaystyle = N(C_{ij};\; 0, \exp(2 v_{C_i}))$ (5.31)
$\displaystyle p(D_{ij})$ $\displaystyle = N(D_{ij};\; 0, \exp(2 v_{D_j}))$ (5.32)
$\displaystyle p(c_i)$ $\displaystyle = N(c_i;\; m_c, \exp(2 v_c))$ (5.33)
$\displaystyle p(d_i)$ $\displaystyle = N(d_i;\; m_d, \exp(2 v_d)).$ (5.34)

These distributions should again be written conditional to the corresponding hyperparameters, but the conditioning variables have been here omitted to keep the notation simpler.

Each of the bias vectors has a hierarchical prior that is shared among the different elements of that particular vector. The hyperparameters $ m_a$, $ m_b$, $ m_c$, $ m_d$, $ v_a$, $ v_b$, $ v_c$ and $ v_d$ all have zero mean Gaussian priors with standard deviation 100, which is a flat, essentially noninformative prior.

The structure of the priors of the weight matrices is much more interesting. The prior of $ \mathbf{A}$ is chosen to be fixed to resolve a scaling indeterminacy between the hidden states $ \mathbf{s}(t)$ and the weights of the MLP networks. This is evident from Equation (5.19) where any scaling in one of these parameters could be compensated by the other without affecting the results in any way. The other weight matrices $ \mathbf{B}, \mathbf{C}$ and $ \mathbf{D}$ have zero mean priors with common variance for all the weights related to a single hidden neuron.

The remaining variance parameters from the priors of the weight matrices and from Equations (5.23), (5.25) and (5.26) again have hierarchical priors defined as

$\displaystyle p(v_{B_j})$ $\displaystyle = N(v_{B_j};\; m_{v_B}, \exp(2 v_{v_B}))$ (5.35)
$\displaystyle p(v_{C_i})$ $\displaystyle = N(v_{C_i};\; m_{v_C}, \exp(2 v_{v_C}))$ (5.36)
$\displaystyle p(v_{D_j})$ $\displaystyle = N(v_{D_j};\; m_{v_D}, \exp(2 v_{v_D}))$ (5.37)
$\displaystyle p(v_{n_k})$ $\displaystyle = N(v_{n_k};\; m_{v_n}, \exp(2 v_{v_n}))$ (5.38)
$\displaystyle p(v_{m_k})$ $\displaystyle = N(v_{m_k};\; m_{v_m}, \exp(2 v_{v_m}))$ (5.39)
$\displaystyle p(v_{s^0_k})$ $\displaystyle = N(v_{s^0_k};\; m_{v_s^0}, \exp(2 v_{v_s^0})).$ (5.40)

The prior distributions of the parameters of these distributions are again zero mean Gaussians with standard deviation 100.


next up previous contents
Next: The approximating posterior distribution Up: The probabilistic model Previous: The prior of the   Contents
Antti Honkela 2001-05-30