Description:
ProbTF provides a flexible and principled probabilistic
framework to predict transcription factor binding from
multiple data sources. ProbTF uses the standard position specific
frequency matrix (PSFM) and the d:th order Markovian background models
as its building blocks (this choice is arbitrary though, and similar
methods can be implemented using other models as well). In addition,
ProbTF can incorporate basically any genome-level information (i.e.,
information at the level of individual nucleotides) in a probabilistic
manner to estimate binding probabilities. For example, ProbTF can use
evolutionary conservation, regulatory
potential, CpG islands, nucleosome positioning, DNase hypersensitive
sites, ChIP-chip binding segments and other (prior)
biological knowledge. ProbTF can be used to compute/answer two
different types of quantities/questions: 1) the probability that the
whole promoter has a binding site, and 2) the binding probability at
each nucleotide position. ProbTF provides both a likelihood and
Bayesian method, where the latter is implemented with an MCMC
algorithm. Further, ProbTF can also be used to model combinatorial
regulation by several TFs.
Download:
A Matlab implementation of our ProbTF tool and associated functions can be downloaded for non commercial use from here.
Tools have been tested and used in Matlab releases 2007a and 2008a (versions 7.4 and 7.6).
Usage:
Matlab functions are provided with instructions that should be
sufficiently detailed to allow their use. In addition, the usage of
ProbTF tools is illustrated in light of an example below.
On-Line tool:
For non-programmers, we have also developed an on-line web tool which
implements our probabilistic TF binding prediction method using
multiple data sources http://www.probtf.org.
Supplementary data:
Reference:
In case you find our software tools useful and apply them in your own work, you can cite our article. Thanks!
Harri Lähdesmäki, Alistair G. Rust and Ilya Shmulevich,
Probabilistic inference of transcription factor binding from multiple data sources,
PLoS ONE, Vol. 3, No. 3, e1820, 2008.
[ pdf, html] [ pdf]
Authors:
Harri Lähdesmäki (TUT), Alistair Rust (ISB) and Ilya Shmulevich (ISB)
Contact:
harri dot lahdesmaki at tut dot fi
An example usage of the ProbTF tools.
(Practically the same example can be found from the 'ExampleScript_Myod1.m' script that is included in the downloable zip-file.)
We illustrate the ProbTF tools using the promoter sequence of Myod1
gene in mouse and position specific frequency matrices (PSFM) from
open access JASPAR. Note
that the results shown below do not correspond to any of
the results shown in our article which were computed using
copyrighted TRANFAC matrices.
'Myod1-ORegAnno.gff' file in the test set (also included in the zip-file) specifies the verified
transcription factor binding sites in the Myod1 promoter. They include,
among others, binding sites for TBP (TATA-box) and AP2 binding factors
between base pairs 1771-1777 and 1671-1678 upstream from the
transcription start site (TSS), respectively. Both binding sites are on the forward
strand.
>> type('Myod1-OReGAnno.gff');
Myod1
ORegAnno tfbs_exp
1671 1678 1
+ . Factor
"AP2:CCTGGGGA" OREG0000157
Myod1
ORegAnno tfbs_exp
1688 1693 1
+ . Factor
"GC2:CCGCCC" OREG0000158
Myod1
ORegAnno tfbs_exp
1706 1712 1
+ . Factor
"CCAAT-box:ATTGGCT" OREG0000159
Myod1
ORegAnno tfbs_exp
1723 1728 1
+ . Factor
"Sp1:CCGCCC" OREG0000315
Myod1
ORegAnno tfbs_exp
1771 1777 1
+ . Factor
"TATA-box:ATAAATA" OREG0000316
Myod1
ORegAnno tfbs_exp
1706 1712 1
+ . Factor
"UNKNOWN:ATTGGCT" OREG0002151
Myod1
ORegAnno tfbs_exp
1688 1693 1
+ . Factor
"UNKNOWN:CCGCCC" OREG0002150
Load data
Read the promoter sequence from file 'Myod1.fa' (included in the
test data set as well as in the zip-file) and convert to integers. We
have made an arbitrary decision to focus on 2kb region but this choice
can be changed.
>> [S,L] = readfastaseqs('Myod1.fa');
>> S = S{1};
>> S = basepairs2num(S);
We use evolutionary conservation to illustrate the data fusion. Many binding sites are
under evolutionary selection and, thus, binding sites are often, but
not always, conserved. Conservation scores are computed using the
PhastCons method. PhastCons outputs for each base pair location the
probability for that location being evolutionarily conserved. Upload
evolutionary conservation data from 'Myod1-Merged.gff' file (also
included in the test set and the zip-file). In order
to make the "prior" slightly smoother, we scale the probability values
between 0.4 and 0.6.
>> [PC,L] = readPhastConsScoresSingle('Myod1-Merged.gff');
>> PC = PC{1};
>> PC = (2*0.1)*PC + 0.5 - 0.1;
Let us use a 0:th order Markovian background model:
>> Bd = [0.2799 0.2201 0.2201 0.2799];
Let us take PSFM models for the TBP and AP2 factors from the JASPAR
database, add one pseudo count and normalize the columns. (AP2 PSFM in
JASPAR database is actually for human but we use it for mouse here.)
>>
M1 = [61 16 352 3 354
268 360 222 155 56
83 82 82 68 77;
145 46 0
10 0 0
3 2 44 135 147
127 118 107 101;
152 18 2
2 5 0 10
44 157 150 128 128 128 139
140;
31 309 35 374 30
121 6 121 33
48 31 52 61
75 71];
>> M1 = M1 + ones(size(M1));
>> M1 = M1./(ones(4,1)*sum(M1,1));
>> M_1 = {M1};
>> M2 = [ 0 0 0 22 19 55 53 19 9;
0 185 185 71 57 44 30 16 78;
185 0 0 46 61 67 91 137 79;
0 0 0 46 48
19 11 13 19];
>> M2 = M2 + ones(size(M2));
>> M2 = M2./(ones(4,1)*sum(M2,1));
>> M_2 = {M2};
Compute binding predictions to the whole promoter
Now we are ready to compute the binding predictions. Let us first
compute binding predictions for TBP to Myod1 promoter without and
with the data fusion.
>> [Pc1,c_mu1,priorA1] = TFBS_DF_Likelihood(S,[],M_1,Bd,0);
>> [Pc2,c_mu2,priorA2] = TFBS_DF_Likelihood(S,PC,M_1,Bd,0);
The first output, 'Pc', specifies the probability of having 0,1,2,3,...
binding sites. The most natural measure of binding, the probability of
having at least one binding site, can be computed as shown below.
Note that with the help of data fusion the binding prediction improves
remarkably.
>> p1 = 1 - Pc1(1)
p1 =
0.2426
>> p2 = 1 - Pc2(1)
p2 =
0.8899
Let us compute the same binding predictions for AP2 to
Myod1 promoter. Note that because AP2 binding site happens to
be at a location that is believed to be non-conserved, the
data fusion slightly decreases the predicted binding probability.
Because the data fusion is done probabilistically, however, correct
conclusion can still be made (i.e., the binding prediction is not set
to zero as is typically done in the phylogenetic footprinting methods).
>> [Pc3,c_mu3,priorA3] = TFBS_DF_Likelihood(S,[],M_2,Bd,0);
>> [Pc4,c_mu4,priorA4] = TFBS_DF_Likelihood(S,PC,M_2,Bd,0);
>> p3 = 1 - Pc3(1)
p3 =
0.9365
>> p4 = 1 - Pc4(1)
p4 =
0.8591
Compute binding predictions at single nucleotide resolution
Next we compute binding predictions for TBP to Myod1 at single
nucleotide resolution. For that purpose we need to use an MCMC
sampling/Bayes version of our method. Argument 'priorA1' defines the
logarithm of a
prior over multiple motif configurations for each fixed number of
motifs. Sounds complicated but that can be obtained from the output of
'TFBS_DF_Likelihood'. For the MCMC sampler, we set the burn-in sample
size to 100 000 and then repeatedly draw 100 000 samples more until
convergence criterion of 0.05 is reached. Maximum number of iterations
is set to 1 000 000. The method actually runs two
parallel (and independent) chains based on which the convergence is
assessed. Both chains reach the convergence criterion immediately after
100 000 + 100 000 steps.
>> [postnm1,postloc1,postcomb1,totalsamples1] = ...
TFBS_DF_Bayes_MCMC_convdiag(S,[],0,Bd,M_1,1,priorA1,[],...
100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
>> [postnm2,postloc2,postcomb2,totalsamples2] = ...
TFBS_DF_Bayes_MCMC_convdiag(S,PC,0,Bd,M_1,1,priorA1,[],...
100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
The estimated binding locations are shown below, first for the case of
no data fusion and then for the case where conservation data has been
used. Horizontal axis corresponds to the location relative to TSS and
y-axis shows (blue graph) the estimated binding probability. TSS is now located to left. The
grey boxes show the verified binding locations. Note that the estimated
locations are off by a few base pairs because the PSFM is slightly
wider than the known binding site.
Without data fusion:
With data fusion:
We can also compute the binding probability to the
whole promoter using the above Bayes estimation. Results are similar
with those obtained with the frequentist method.
>> p5 = 1 - postnm1(1)
p5 =
0.2801
>> p6 = 1 - postnm2(1)
p6 =
0.9730
Combinatorial regulation by multiple TFs
Next we illustrate how our tools can be used to assess the probability
of combinatorial regulation by multiple TFs. For that purpose, we use
the above two verified binding sites (TBP and AP2) on Myod1 promoter. A
simple approach is to compute an estimate by multiplying binding
probabilities from individual runs (with and without data fusion), i.e.,
>> pp1 = p1*p3
pp1 =
0.2272
>> pp2 = p2*p4
pp2 =
0.7645
We can also analyze the two TFs simultaneously using our MCMC sampler.
(The first function call is only needed to construct the prior
'priorA'.) Now the MCMC sampler needs 400 000 and 300 000 samples to
reach convergence.
>> [Pc,c_mu,priorA] = TFBS_DF_Likelihood(S,[],M_12,Bd,0);
>> M_12 = {M1,M2};
>> [postnm3,postloc3,postcomb3,totalsamples3] = ...
TFBS_DF_Bayes_MCMC_convdiag(S,[],0,Bd,M_12,1,priorA,[],...
100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
Increasing MCMC iterations to 200000
Increasing MCMC iterations to 300000
Increasing MCMC iterations to 400000
>> [postnm4,postloc4,postcomb4,totalsamples4] = ...
TFBS_DF_Bayes_MCMC_convdiag(S,PC,0,Bd,M_12,1,priorA,[],...
100000,100000,1000000,0.05,2);
burnin = 100000, MCMC samples = 100000
Increasing MCMC iterations to 200000
Increasing MCMC iterations to 300000
>> pp3 = 1 - postnm1(1)
pp3 =
0.4476
>> pp4 = 1 - postnm2(1)
pp4 =
0.9245
Binding predictions using both strands of the DNA
Binding predictions can also be done using both strands of the DNA and
the results remain similar. This is illustrated below by predicting TBP
binding to both strands of Myod1 promoter.
>> [Pc5,c_mu5,priorA5] = TFBS_DF_Likelihood_Double(S,[],M_1,Bd,0);
>> [Pc6,c_mu6,priorA6] = TFBS_DF_Likelihood_Double(S,PC,M_1,Bd,0);
>> p7 = 1 - Pc5(1)
p7 =
0.0630
>> p8 = 1 - Pc6(1)
p8 =
0.6109
|