There is a close connection between coding and probabilistic
framework. The optimal code length of
X is
bits, where
P(X) is the probability of observing
X. In order
to be able to encode the data compactly one has to find a good model
for it. A sender and a receiver have agreed on a model structure
and the message will have two parts: the parameters
and the data
X. It can be shown[25] that
where the
dX is the required accuracy of the
data. Therefore minimising the cost function C of ensemble learning
is equivalent to minimising the coding lenght of the data.