Harri Lähdesmäki -- ProbTF

Back to Software page

ProbTF

Probabilistic method for transcription factor binding prediction from multiple data sources

Description:

ProbTF provides a flexible and principled probabilistic framework to predict transcription factor binding from multiple data sources. ProbTF uses the standard position specific frequency matrix (PSFM) and the d:th order Markovian background models as its building blocks (this choice is arbitrary though, and similar methods can be implemented using other models as well). In addition, ProbTF can incorporate basically any genome-level information (i.e., information at the level of individual nucleotides) in a probabilistic manner to estimate binding probabilities. For example, ProbTF can use evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) biological knowledge. ProbTF can be used to compute/answer two different types of quantities/questions: 1) the probability that the whole promoter has a binding site, and 2) the binding probability at each nucleotide position. ProbTF provides both a likelihood and Bayesian method, where the latter is implemented with an MCMC algorithm. Further, ProbTF can also be used to model combinatorial regulation by several TFs.

Download:

A Matlab implementation of our ProbTF tool and associated functions can be downloaded for non commercial use from here.
Tools have been tested and used in Matlab releases 2007a and 2008a (versions 7.4 and 7.6).

Usage:

Matlab functions are provided with instructions that should be sufficiently detailed to allow their use. In addition, the usage of ProbTF tools is illustrated in light of an example below.

On-Line tool:

For non-programmers, we have also developed an on-line web tool which implements our probabilistic TF binding prediction method using multiple data sources http://www.probtf.org.

Supplementary data:

Test data set and supplementary data are also publicly available at: http://www.probtf.org.

Reference:

In case you find our software tools useful and apply them in your own work, you can cite our article. Thanks!

Harri Lähdesmäki, Alistair G. Rust and Ilya Shmulevich,
Probabilistic inference of transcription factor binding from multiple data sources,
PLoS ONE, Vol. 3, No. 3, e1820, 2008.
[pdf, html] [pdf]

Authors:

Harri Lähdesmäki (TUT), Alistair Rust (ISB) and Ilya Shmulevich (ISB)

Contact:

harri dot lahdesmaki at tut dot fi

An example usage of the ProbTF tools.

(Practically the same example can be found from the 'ExampleScript_Myod1.m' script that is included in the downloable zip-file.)

We illustrate the ProbTF tools using the promoter sequence of Myod1 gene in mouse and position specific frequency matrices (PSFM) from open access JASPAR. Note that the results shown below do not correspond to any of the results shown in our article which were computed using copyrighted TRANFAC matrices.

'Myod1-ORegAnno.gff' file in the test set (also included in the zip-file) specifies the verified transcription factor binding sites in the Myod1 promoter. They include, among others, binding sites for TBP (TATA-box) and AP2 binding factors between base pairs 1771-1777 and 1671-1678 upstream from the transcription start site (TSS), respectively. Both binding sites are on the forward strand.

>> type('Myod1-OReGAnno.gff');

Myod1    ORegAnno    tfbs_exp    1671    1678    1    +    .    Factor "AP2:CCTGGGGA"    OREG0000157
Myod1    ORegAnno    tfbs_exp    1688    1693    1    +    .    Factor "GC2:CCGCCC"    OREG0000158
Myod1    ORegAnno    tfbs_exp    1706    1712    1    +    .    Factor "CCAAT-box:ATTGGCT"    OREG0000159
Myod1    ORegAnno    tfbs_exp    1723    1728    1    +    .    Factor "Sp1:CCGCCC"    OREG0000315
Myod1    ORegAnno    tfbs_exp    1771    1777    1    +    .    Factor "TATA-box:ATAAATA"    OREG0000316
Myod1    ORegAnno    tfbs_exp    1706    1712    1    +    .    Factor "UNKNOWN:ATTGGCT"    OREG0002151
Myod1    ORegAnno    tfbs_exp    1688    1693    1    +    .    Factor "UNKNOWN:CCGCCC"    OREG0002150

Load data

Read the promoter sequence from file 'Myod1.fa' (included in the test data set as well as in the zip-file) and convert to integers. We have made an arbitrary decision to focus on 2kb region but this choice can be changed.

>> [S,L] = readfastaseqs('Myod1.fa');
>> S = S{1};
>> S = basepairs2num(S);

We use evolutionary conservation to illustrate the data fusion. Many binding sites are under evolutionary selection and, thus, binding sites are often, but not always, conserved. Conservation scores are computed using the PhastCons method. PhastCons outputs for each base pair location the probability for that location being evolutionarily conserved. Upload evolutionary conservation data from 'Myod1-Merged.gff' file (also included in the test set and the zip-file). In order to make the "prior" slightly smoother, we scale the probability values between 0.4 and 0.6.

>> [PC,L] = readPhastConsScoresSingle('Myod1-Merged.gff');
>> PC = PC{1};
>> PC = (2*0.1)*PC + 0.5 - 0.1;

Let us use a 0:th order Markovian background model:

>> Bd = [0.2799 0.2201 0.2201 0.2799];

Let us take PSFM models for the TBP and AP2 factors from the JASPAR database, add one pseudo count and normalize the columns. (AP2 PSFM in JASPAR database is actually for human but we use it for mouse here.)

>> M1 = [61   16 352    3 354 268 360 222 155   56   83   82   82   68   77;
      145   46    0   10    0    0    3    2   44 135 147 127 118 107 101;
     152   18    2    2    5    0   10   44 157 150 128 128 128 139 140;
      31 309   35 374   30 121    6 121   33   48   31   52   61   75   71];
>> M1 = M1 + ones(size(M1));
>> M1 = M1./(ones(4,1)*sum(M1,1));
>> M_1 = {M1};

>> M2 = [ 0   0   0 22 19 55 53 19   9;
       0 185 185 71 57 44 30 16 78;
         185   0   0 46 61 67 91 137 79;
        0   0   0 46 48 19 11 13 19];
>> M2 = M2 + ones(size(M2));
>> M2 = M2./(ones(4,1)*sum(M2,1));
>> M_2 = {M2};

Compute binding predictions to the whole promoter

Now we are ready to compute the binding predictions. Let us first compute binding predictions for TBP to Myod1 promoter without and with the data fusion.

>> [Pc1,c_mu1,priorA1] = TFBS_DF_Likelihood(S,[],M_1,Bd,0);
>> [Pc2,c_mu2,priorA2] = TFBS_DF_Likelihood(S,PC,M_1,Bd,0);

The first output, 'Pc', specifies the probability of having 0,1,2,3,... binding sites. The most natural measure of binding, the probability of having at least one binding site, can be computed as shown below. Note that with the help of data fusion the binding prediction improves remarkably.

>> p1 = 1 - Pc1(1)

            p1 =

    0.2426

>> p2 = 1 - Pc2(1)

            p2 =

    0.8899

Let us compute the same binding predictions for AP2 to Myod1 promoter. Note that because AP2 binding site happens to be at a location that is believed to be non-conserved, the data fusion slightly decreases the predicted binding probability. Because the data fusion is done probabilistically, however, correct conclusion can still be made (i.e., the binding prediction is not set to zero as is typically done in the phylogenetic footprinting methods).

>> [Pc3,c_mu3,priorA3] = TFBS_DF_Likelihood(S,[],M_2,Bd,0);
>> [Pc4,c_mu4,priorA4] = TFBS_DF_Likelihood(S,PC,M_2,Bd,0);
>> p3 = 1 - Pc3(1)

p3 =

0.9365

>> p4 = 1 - Pc4(1)

p4 =

0.8591

Compute binding predictions at single nucleotide resolution

Next we compute binding predictions for TBP to Myod1 at single nucleotide resolution. For that purpose we need to use an MCMC sampling/Bayes version of our method. Argument 'priorA1' defines the logarithm of a prior over multiple motif configurations for each fixed number of motifs. Sounds complicated but that can be obtained from the output of 'TFBS_DF_Likelihood'. For the MCMC sampler, we set the burn-in sample size to 100 000 and then repeatedly draw 100 000 samples more until convergence criterion of 0.05 is reached. Maximum number of iterations is set to 1 000 000. The method actually runs two parallel (and independent) chains based on which the convergence is assessed. Both chains reach the convergence criterion immediately after 100 000 + 100 000 steps.

>> [postnm1,postloc1,postcomb1,totalsamples1] = ...
   TFBS_DF_Bayes_MCMC_convdiag(S,[],0,Bd,M_1,1,priorA1,[],...
   100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000

>> [postnm2,postloc2,postcomb2,totalsamples2] = ...
   TFBS_DF_Bayes_MCMC_convdiag(S,PC,0,Bd,M_1,1,priorA1,[],...
   100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000

The estimated binding locations are shown below, first for the case of no data fusion and then for the case where conservation data has been used. Horizontal axis corresponds to the location relative to TSS and y-axis shows (blue graph) the estimated binding probability. TSS is now located to left. The grey boxes show the verified binding locations. Note that the estimated locations are off by a few base pairs because the PSFM is slightly wider than the known binding site.

Without data fusion:
TBP_Myod1

With data fusion:
TBP_Myod1_PhastCons

We can also compute the binding probability to the whole promoter using the above Bayes estimation. Results are similar with those obtained with the frequentist method.

>> p5 = 1 - postnm1(1)

p5 =

0.2801

>> p6 = 1 - postnm2(1)

p6 =

0.9730

Combinatorial regulation by multiple TFs

Next we illustrate how our tools can be used to assess the probability of combinatorial regulation by multiple TFs. For that purpose, we use the above two verified binding sites (TBP and AP2) on Myod1 promoter. A simple approach is to compute an estimate by multiplying binding probabilities from individual runs (with and without data fusion), i.e.,

>> pp1 = p1*p3

pp1 =

0.2272

>> pp2 = p2*p4

pp2 =

0.7645

We can also analyze the two TFs simultaneously using our MCMC sampler. (The first function call is only needed to construct the prior 'priorA'.) Now the MCMC sampler needs 400 000 and 300 000 samples to reach convergence.

>> [Pc,c_mu,priorA] = TFBS_DF_Likelihood(S,[],M_12,Bd,0);
>> M_12 = {M1,M2};
>> [postnm3,postloc3,postcomb3,totalsamples3] = ...
   TFBS_DF_Bayes_MCMC_convdiag(S,[],0,Bd,M_12,1,priorA,[],...
   100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
Increasing MCMC iterations to 200000
Increasing MCMC iterations to 300000
Increasing MCMC iterations to 400000

>> [postnm4,postloc4,postcomb4,totalsamples4] = ...
   TFBS_DF_Bayes_MCMC_convdiag(S,PC,0,Bd,M_12,1,priorA,[],...
   100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
Increasing MCMC iterations to 200000
Increasing MCMC iterations to 300000

>> pp3 = 1 - postnm1(1)

pp3 =

    0.4476

>> pp4 = 1 - postnm2(1)

pp4 =

    0.9245

Binding predictions using both strands of the DNA

Binding predictions can also be done using both strands of the DNA and the results remain similar. This is illustrated below by predicting TBP binding to both strands of Myod1 promoter.

>> [Pc5,c_mu5,priorA5] = TFBS_DF_Likelihood_Double(S,[],M_1,Bd,0);
>> [Pc6,c_mu6,priorA6] = TFBS_DF_Likelihood_Double(S,PC,M_1,Bd,0);
>> p7 = 1 - Pc5(1)

p7 =

0.0630

>> p8 = 1 - Pc6(1)

p8 =

0.6109

Back to Software page

Last modified 9 Oct 2008.