In the above model, the parameters
and
need to be
assigned a prior probability and in a more general situation typically
also a prior probability for the model structure would also be needed.
In general, prior probabilities for variables should reflect the belief one has about the variables. As people may have difficulties in articulating these beliefs explicitly, some rules of thumb have been developed.
A lot of research has been conducted on uninformative priors. The term means that in some sense the prior gives as little information as possible about the value of the parameter, and as such is a good reference prior although with complex models it is usually impractical to compute the exact form of the uninformative prior.
Roughly speaking, uninformative prior can be defined by saying that in a parametrisation where moving a given distance in parameter space always corresponds the similar change in the probability distribution the model defines, the parameters are uniformly distributed.
A simple example is provided by a Gaussian distribution parametrised
by mean
and variance
.
Doubling the variance always
results in a qualitatively similar change in the distribution.
Similarly, taking a step of size
in the mean always
corresponds to a similar change in the distribution.
Reparametrisation by
and
will
give a parametrisation where equal changes in parameters correspond to
equal changes in distribution. A uniform prior on
and vwould correspond to the prior
in
the original parameter space. If there is additional knowledge that
sould be independent of
,
then
and v give the
needed parameters and the prior is
and
.
None of the above uninformative priors can actually be used because
they are improper, meaning that they are not normalisable. This
can easily be seen by considering a uniform distribution between
.
These priors are nevertheless good references and hint
useful parametrisations for models.
Often it is possible to utilise the fact that the model has a set of parameters which have a similar role. It is, for instance, reasonable to assume that all biases in an MLP network have a similar distribution. This knowledge can be utilised by modelling the distribution by a common parametrised distribution. Then the prior needs to be determined to these common parameters, called hyperparameters but as they control the distribution of a set of parameters, there should be less hyperparameters than parameters. The process can be iterated until the structural knowledge has been used. In the end there are usually only a few priors to determine and since there is typically a lot of data, these priors are usually not significant for the learining process.
For some it may be helpful to think about the prior in terms of coding.
By using the formula
,
any probabilites can be
translated into encoding. In coding terms, the prior means the
aspects of the encoding which the sender and the receiver have agreed
upon prior to the transmission of data.