Feedforward equations

We shall consider a standard MLP-network with input, hidden and output layers. In order to be able to write the feedforward equations in a compact form, we shall assign all parameters and outputs of the neurons a unique index. In our notation, $\xi_i$ can mean either the value of a parameter or output of a neuron, that is, $\xi_i$ are used to denote any value than can be an input for neurons. The set of indices for the parameters is denoted by $\mathcal{P}$ and the transfer functions of the neurons are denoted by f_i. The values $\xi_i$ are defined by equation 4.

$\begin{displaymath} \xi_i(\boldsymbol{\hat{\theta}}, t) \stackrel{\mathit{def}}... ... \in \mathcal{J}_i) & \mbox{other neurons} \end{array} \right.\end{displaymath}$

(4)

Input and output data is parametrised by time t. We have used a shorthand notation for the parameters of a function. For example $f(\xi_j \vert j \in \{2, 4, 5\}) = f(\xi_2, \xi_4, \xi_5)$ . The set $\mathcal{J}_i$ thus contains the indices of the inputs to the neuron i or, equivalently, indices of the parameters of the function f_i. The network is assumed to be strictly feedforward, which means that $j \in \mathcal{J}_i$ implies j < i.

For hidden and output neurons the transfer functions f_i are like in any conventional neural network. They can be sums of inputs multiplied by weights, sigmoids, radial basis functions, etc.

The cost function for supervised learning is L(M_S) + L(D_S | M_S) as explained in section 1.1. The description lengths are computed according to equation 1, with the exception that the terms $-\ln \epsilon$ are omitted from L(D_S | M_S). For each D_i we shall assign a function f_j, which is used to compute the terms $-\ln p(D_i)$ . The set of indices for these functions is denoted by $\mathcal{L_D}$ . Similarly, the set $\mathcal{L_P}$ comprises of the indices of functions f_j, which evaluate the terms $-\ln p(\theta_i)$ . We can now write down the cost function in terms of $\xi_i$ and $\epsilon_{\theta_i}$ .

$\begin{displaymath} L(\boldsymbol{\hat{\theta}}, \boldsymbol{\epsilon_{\boldsym... ...n_{\theta_i} + \sum_{t=1}^N \sum_{i \in \mathcal{L_D}}\xi_i(t)\end{displaymath}$

(5)

The first two terms correspond to L(M_S) and the third term to L(D_S | M_S).

If a parameter $\theta_i$ does not have an associated neuron in the set $\mathcal{L_P}$ , it means that we tacitly assume the probability distribution $p(\theta_i)$ to be constant throughout the range of values of $\theta_i$ , that is, we assume the value of the parameter to be evenly distributed. It has to be reminded that although the constant term $-\ln p(\theta_i)$ can be omitted when adapting the parameters and their accuracies, it should still be taken into account when models with different parametrisations are compared.

**Figure 2:** Structure of a MLP network with an MDL-based cost function is shown schematically. The layers below the dotted line are the same as in a conventional MLP: input, hidden and output layers. The functions above the dotted line are used to compute the cost function L.
$\begin{figure} \begin{centering} \epsfig {file=mdl_mlp_str.eps,width=7cm} \end{centering}\end{figure}$

The structure of the network is shown in figure 2. Desired outputs are marked by D, input neurons by I, and other neurons by f. The parameters of the network are not shown. The functions f above the dotted line are the ones used to compute the description length of the parameters and the data.