A great deal of work still requires to be undertaken in the development of the methods discussed in this thesis. There are many extensions of the basic MLP structure which can be utilised. For instance, bilinear neurons that compute weighted products of two inputs could be useful. In some cases, there is a strong reason to believe that the mapping from factors to observations includes this type of functions, but with neurons having simple weighted sums it is difficult, albeit not impossible, to represent such mappings.

One of the strong practical advantages of using the Bayesian framework for learning is that it is easy to combine different models and algorithms. An example of this was seen in the development of the treatment of a non-Gaussian factor distribution, where the method used in publication II was replaced by the one borrowed from [3]. One aspect which could clearly be improved is taking into account the posterior dependences of the factors. The simpler approximation could still be used for providing a good initial guess.

This thesis concentrates on real valued representations, but the extensive research conducted with discrete valued representations and observations can be utilised because within the Bayesian framework, it is straight-forward to combine different models. Likely candidate models include belief networks [101], sigmoid belief networks [91,113], hidden Markov-models [83], switching state-space models [30] and mixture models [89].

Missing observations pose no problem in the Bayesian framework as they can be treated like any other unknown parameter of the model. This enables unsupervised learning to be used for similar tasks as supervised learning but without the requirement to prespecify which of the observations are inputs and which are outputs.

In many large problems the prior knowledge at hand suggests a modular structure for the MLP network which can be taken into account. It should also be easy to develop automatic procedures for pruning and model selection because the cost function in ensemble learning can be reliably used for model selection. When learning large models, this should be useful, as well as in learning procedures where layers of neurons are added to the network one by one.

Factors governing the variance of other factors seem likely candidates for building-blocks for mappings whose learning is computationally efficient but which are representationally powerful. These models are inspired by the properties of complex cells found in the visual area V1 (see, e.g., [62]), whose behaviour appears to match well with this function. Models which have factors resembling complex cells have been proposed, for example, in [66,14,29,54,53].

In order to match human capabilities, the models will need to represent objects and relations between objects. Not enough is known about the representation of these things in the biological brain in order to utilise the knowledge directly for artificial neural network models. The work done in traditional artificial intelligence research (see, e.g., [112]) can, however, give good starting points.

One of the most important application areas for the methods developed in this thesis and a fruitful source of new ideas will probably be the problem of adaptive process control because many processes have natural representations in terms of real valued state-spaces and the controlled signals are also often analogue. Learning models of the environment based on observations is also one of the most demanding problems in reinforcement learning where an autonomous agent is trying to make decisions based on external rewards (see, e.g., [124]).