Learning was done in batches which had two nested loops. The
outer loop went through all the data vectors. For each data vector,
the posterior distribution *Q* of the latent variables was adapted in
the inner loop of length 15. The distributions of the rest of the
model parameters were updated at the end of each batch. The whole
learning consisted of 200 batches.

Only the first, linear layer was generated in the beginning. The second, nonlinear layer was generated after 20 batches when the first layer had already found a rough representation for the data. To encourage the growth of the second layer, the standard deviations of the latent variables of the first layer were reduced by a factor of three after each batch during 10 batches starting from the creation of the second layer. If this phase is left out, the network easily gets stuck in a local minimum where the training wheels, the latent variables of the first layer, represent the data while the second layer remains silent.

The posterior variance of a parameter contains information about how certain the network is about the value of the parameter. This gives the network the very interesting property of being able to effectively prune away useless weights by increasing their posterior variances and thus decreasing the complexity of the network.

Some care has to be taken in the beginning of the learning since the network might get stuck in some unwanted local minima. When the weights of the network are random and the network is not able to find any structure in the data, the weights can get prematurely pruned away. To prevent this, the posterior variances of the weights and the latent variables were bounded for the first 50 batches.