Next: Computing the expectation of Up: Using an MDL-based cost Previous: MDL-based cost functions for

Measuring the description length

Typically neural networks have relatively simple structure but many parameters, and it is usually easy to design a satisfactory coding for the structure. We shall therefore concentrate on measuring the description length of the parameters of the model.

If the transmitted data consists of discrete elements with known probabilities, then the length needed in transmission can be computed. It is well-known that if the probability of a data element D is P(D), then the transmission of the element D using the optimal code takes $-\log_2 P(D)$ bits [9,8]. We shall hereafter measure information in base e instead of base 2: $-\log_2 P(D)$ bits equals $-\ln P(D)$ nats, where 1 nat equals $\log_2 e$ bits.

The output data and the parameters in the network have continuous values instead of discrete. Communicating a random continuous value at an infinite precision would require an infinite number of bits, and therefore the values have to be discretised. If the discretisation is done with precision $\epsilon_x$ , then the range of values is divided into small bins with size $\epsilon_x$ . Suppose we know the probability distribution p(x). If $\epsilon_x$ is small enough, p(x) is approximately linear throughout each bin and the probability of each bin is approximately $p(x)\epsilon_x$ , which means that the description length L of the value x is

$\begin{displaymath} L(x) = -\ln p(x)\epsilon_x = -\ln p(x) -\ln \epsilon_x .\end{displaymath}$ (1)

Equation 1 shows that in order to compute the description length, each continuos value must be assigned a probability distribution and an accuracy.

The model is described by its structure, its parameters $\boldsymbol{\theta}= [\theta_1, \theta_2, \dots]$ and their accuracies $\epsilon_{\boldsymbol{\theta}}$ . We shall consider a coding scheme where the model is used to assign a probability distribution p(D) for each data element $D \in \mathcal{D}$ . Some of the parameters may also be hyper parameters which are used to assign probability distribution $p_{\theta_i}$ for other parameters. The data is discretised with accuracy $\epsilon_{\mathcal{D}}$ , but that can be considered an integral part of the coding scheme. This is because the effect of $\epsilon_{\mathcal{D}}$ does not depend on the model: it always adds a constant $-N\ln \epsilon_{\mathcal{D}}$ term in the description length, and can therefore be omitted (N is the number of data elements). The accuracy of the parameters ( $\epsilon_{\boldsymbol{\theta}}$ ) does depend on the model, however, and must be taken into account.

In the literature, several different ways to measure the description length have been proposed. Our approach dates back to Wallace and Boulton [10] and similar approaches have been used in [4,11]. Alternative approaches have been presented e.g. in [6,7].

Suppose the target value for the parameter vector is $\boldsymbol{\theta}$ . Let us denote the actual discretised value by $\boldsymbol{\hat{\theta}}$ and the corresponding round-off error by $\boldsymbol{\tilde{\theta}} = \boldsymbol{\hat{\theta}}- \boldsymbol{\theta}$ . The errors depend on the locations of the borders between the bins, but there is no reason to favour one discretisation over the others, and therefore we should compute the expectation of the description length over the distribution of $\boldsymbol{\tilde{\theta}}$ ,which is approximately even in the range $[-\epsilon_{\boldsymbol{\theta}}/2, \epsilon_{\boldsymbol{\theta}}/2]$ . We are also going to assume that the errors are uncorrelated, that is, $E\{\tilde{\theta}_i\tilde{\theta}_j\} = 0$ .

The expected description length can be approximated by utilising the second order Taylor's series expansion of $L(\boldsymbol{\hat{\theta}})$ about $\boldsymbol{\theta}$ .

$\begin{displaymath} L(\boldsymbol{\hat{\theta}}) \approx L(\boldsymbol{\theta}) ... ...}_i \partial \hat{\theta}_j} \tilde{\theta}_i \tilde{\theta}_j\end{displaymath}$ (2)

Here the derivatives of L are evaluated at $\boldsymbol{\theta}$ .Taking expectation over $\boldsymbol{\tilde{\theta}}$ yields the cost function:

$\begin{displaymath} L_E(\boldsymbol{\theta}, \epsilon_{\boldsymbol{\theta}}) \st... ...ial^2 L}{\partial {\hat{\theta}_i}^2} E\{\tilde{\theta}_i^2\}.\end{displaymath}$ (3)

Here we have used $E\{\tilde{\theta}_i\} = 0$ for all i and $E\{\tilde{\theta}_i\tilde{\theta}_j\} = 0$ for all $i \neq j$ . The optimal parameters $\boldsymbol{\theta}$ and their accuracies $\epsilon_{\boldsymbol{\theta}}$ can be solved by minimising the expected description length L_E.

Since the cost function includes second order derivatives of L with respect to the parameters, it might be troublesome to use it with a large neural network. First the outputs have to be computed in a feedforward phase as usual. Approximations of the second order derivatives needed in the cost function can be computed in a process analogous to backpropagation. It involves two feedback phases: the first one is used to compute the first order derivatives and the second one to compute the second order derivatives. It is then laborious to use backpropagation to compute the gradient, because it has to be applied to all three phases, resulting in a total of six phases.

Fortunately we can overcome these difficulties by recognising that it is not the second order derivatives we wish to evaluate but the expectation over $\boldsymbol{\tilde{\theta}}$ .

Next: Computing the expectation of Up: Using an MDL-based cost Previous: MDL-based cost functions for

Harri Lappalainen
5/19/1998