If we neglect the effect of finite accuracy to the description length
and assume an infinite precision, then the parameter values that
minimise *L* are the ones that maximise the posterior probability of
the parameters, that is, the maximum a posteriori estimate. They are
not necessarily the same that minimise *L* when finite accuracy is
taken into account, however. Taking the expectation over
effectively measures and penalises the
sensitivity of *L* to the values of the parameters, thus finding a
flatter minimum of *L*, which corresponds to more probability mass of
the posterior density.

Using such MDL-based arguments, Hochreiter and Schmidhuber have arrived in a very similar algorithm, which they call the flat minimum search (FMS) [2]. Due to the similarity of the penalty for the complexity of the model, the promising results they have achieved should be reproductable with our method. Indeed, in preliminary simulations, we have been able to reproduce the experiment 1 in [2]: noisy classification.

Although the cost function in FMS is very similar to ours, it does not define a description length, and it is thus difficult to include measures of the complexity of the structure of the network. The computation of our cost function appears to be more simple, but, on the other hand, FMS does not assume independence of the parameters of the functions in the network. In FMS the user has to give an extra parameter, a tolerable error, which regulates the trade-off between the description length of the parameters and the data. In our approach, the optimal accuracies for the parameters are estimated directly from the data.

MDL has been used with neural networks to find representations in an unsupervised manner in [1,12,13]. The main focus therein has been on discrete valued features, such as the indices of vectors in vector quantisation. This paper concentrates on the coding of real valued parameters and features, and thus complements the previous work.