Priors and hyperpriors

Next: Examples Up: Construction of probabilistic models Previous: Construction of probabilistic models

Priors and hyperpriors

In the above model, the parameters $\theta$ and $\sigma$ need to be assigned a prior probability and in a more general situation typically also a prior probability for the model structure would also be needed.

In general, prior probabilities for variables should reflect the belief one has about the variables. As people may have difficulties in articulating these beliefs explicitly, some rules of thumb have been developed.

A lot of research has been conducted on uninformative priors. The term means that in some sense the prior gives as little information as possible about the value of the parameter, and as such is a good reference prior although with complex models it is usually impractical to compute the exact form of the uninformative prior.

Roughly speaking, uninformative prior can be defined by saying that in a parametrisation where moving a given distance in parameter space always corresponds the similar change in the probability distribution the model defines, the parameters are uniformly distributed.

A simple example is provided by a Gaussian distribution parametrised by mean $\mu$ and variance $\sigma^2$ . Doubling the variance always results in a qualitatively similar change in the distribution. Similarly, taking a step of size $\sigma$ in the mean always corresponds to a similar change in the distribution. Reparametrisation by $\mu' = \mu/\sigma$ and $v = \ln \sigma$ will give a parametrisation where equal changes in parameters correspond to equal changes in distribution. A uniform prior on $\mu'$ and vwould correspond to the prior $p(\mu, \sigma) \propto 1/\sigma^2$ in the original parameter space. If there is additional knowledge that $\mu$ sould be independent of $\sigma$ , then $\mu$ and v give the needed parameters and the prior is $p(\mu) \propto 1$ and $p(\sigma) \propto 1/\sigma$ .

None of the above uninformative priors can actually be used because they are improper, meaning that they are not normalisable. This can easily be seen by considering a uniform distribution between $\pm \infty$ . These priors are nevertheless good references and hint useful parametrisations for models.

Often it is possible to utilise the fact that the model has a set of parameters which have a similar role. It is, for instance, reasonable to assume that all biases in an MLP network have a similar distribution. This knowledge can be utilised by modelling the distribution by a common parametrised distribution. Then the prior needs to be determined to these common parameters, called hyperparameters but as they control the distribution of a set of parameters, there should be less hyperparameters than parameters. The process can be iterated until the structural knowledge has been used. In the end there are usually only a few priors to determine and since there is typically a lot of data, these priors are usually not significant for the learining process.

For some it may be helpful to think about the prior in terms of coding. By using the formula $L(x) = - \ln p(x)$ , any probabilites can be translated into encoding. In coding terms, the prior means the aspects of the encoding which the sender and the receiver have agreed upon prior to the transmission of data.

Next: Examples Up: Construction of probabilistic models Previous: Construction of probabilistic models

Harri Lappalainen
2000-03-03