BMERC : needle tools : Introduction : MRF score generation
This page describes the programs and dataflows required for threading protein cores according to Markov random field statistics. These programs generate the score and environment files used by the the needle protein threading program.
[need a proper introduction here, complete with references to published papers, threading context (i.e. relative to needle), etc. -- rgr, 20-Feb-97.]
For more information, please see the needle tools home page.
Given a library of cores, the steps for any statistical scoring scheme, such as MRF, are as follows:
Figure 2: Generating MRF files
The figure above shows the dataflow involved in generating MRF files from a set of exposure, core, and sequence files. For reference, the program names (in the rectangular boxes) are hyperlinked to the detailed command-line argument descriptions, and the descriptions of input and output files (in the ovals) are hyperlinked to their format definitions (which usually contain examples). Note: It is not necessary (and in fact not recommended!) to follow these links at a first reading. [Bug: Most of the hyperlinks don't work yet. -- rgr, 14-Mar-97.]
[this is the old textual version of the above information. -- rgr, 14-Mar-97.]
[this is really about pairwise score computation in general; should be rewritten as such, and GMT marginals put in that context. -- rgr, 30-Jan-98.]
The GMT "singleton" score term GMTS(r,aa) for a given core position r and amino acid aa is computed as an average of the logs (hence the "geometric mean") of a marginal term for each edge over the set of pairwise edges involving that residue. Each pairwise edge E = (r1,r2,env) in the set of all pairwise interactions PI denotes an interaction between a pair of residues at core positions r1 and r2, which positions could be occupied by amino acids aa1 and aa2, respectively. For any given residue r there are two subsets of edges E1(r) and E2(r) involving r in the first and second position respectively.
E1(r) = {E=(r1,r2,env)|E in PI and r=r1}
E2(r) = {E=(r1,r2,env)|E in PI and r=r2}
and therefore
sigmaE1(r)M(aa,env)+sigmaE2(r)Mt(aa,env)
GMTS(r,aa) = ----------------------------------------
|E1(r)|+|E2(r)|
Here, M and Mt are the GMT marginal terms
for the first and second residue, respectively. These are analogous to
the pairwise marginals Mp and
Mpt, which are defined as follows:
Mp(aa1,env) = -log[sigmaaa2P(aa1,aa2|env)];
and
Mpt(aa1,env) = -log[sigmaaa2P(aa1,aa2|env)].
Given the parwise marginals, the pairwise scores are computed thus:
PS(aa1,aa2,env) = -log[P(aa1,aa2|env)]-Mp(aa1,env)-Mpt(aa1,env)
An edge E can be thought of has having three sets of
attributes, in addition to the amino acids that may be threaded there:
those peculiar to the first core position, those peculiar to the second
core position, and those peculiar to the edge itself. A relevant core
position attribute could be exposure or secondary structure; the edge
itself could encode some notion of distance (e.g. "near" vs. "far"). In
the MRF scheme, the edge itself has no attributes (one could argue that
the same SS/different SS distinction constitutes such an attribute, but
the end result is the same for our purposes). To compute the GMT
marginal value for a given core position and amino acid, we wish to
consider the pairwise counts over all possible interactions such a core
position and amino acid could have, instead of just that environment.
We therefore compute each over a set S(env) of environments,
where S is defined by keeping one residue's attributes fixed
and varying the attributes of the other residue and those of the edge
itself. [this doesn't adequately explain the env versus
envt distinction, though. -- rgr, 5-Dec-97.]
Since mrf-scores does not have the details of the MRF pairwise environment definitions built-in (other than symmetry), these marginal sets must be defined externally by the -gmt-marginal-file option to the mrf-scores program. If this option is not supplied, the sets are made to be singletons, i.e. S(env)={env} for all env, which is equivalent to the pairwise marginals.
Marginals are therefore computed from the raw pairwise counts C[aa1,aa2,env] as follows:
M(aa1,env) = -log P(aa1|aa2,S(env));
and
Mt(aa1,env) = -log P(aa2|aa1,S(env))
where
sigmaenv in Ssigmaaa2C[aa1,aa2,env]
P(aa1|aa2,S) = ---------------------------------------------
sigmaaa1[sigmaenv in Ssigmaaa2C[aa1,aa2,env]]
and
sigmaenv in Ssigmaaa1C[aa2,aa1,env]
P(aa2|aa1,S) = ---------------------------------------------
sigmaaa2[sigmaenv in Ssigmaaa1C[aa1,aa2,env]]
Note that in general,
P(aa1|aa2,S) = P(aa2|aa1,S)
only if S contains only symmetric environments.
The GMT marginal file specifies these marginal sets completely. Here is the one for the "revised" WMS in its entirety. [must really decide what to call this. -- rgr, 8-Sep-97.]
# e-bur
5,6,7t,9t,11t,17,18,19t
# e-exp
6t,7,8,10t,12t,18t,19,20
# h-bur
1,2,3t,9,10,13,14,15t
# h-exp
2t,3,4,11,12,14t,15,16
The lines beginning with "#" are comments; they merely delimit
the four sets defined by the singleton states to the extent that they
are distinguished in the pairwise environments. This is the simplest
form of the file, consisting of lists of environment numbers, some of
which have a "t" (for "transpose") appended. All self-symmetric
environments appear exactly once, without the "t"; all others
appear exactly twice, once with a "t" and once without. The
singleton states thus divide the environments and their transposes into
equivalence classes. Given that the default class for the transpose of
a symmetric environment is that of the untransposed environment, the
file above implicitly describes a partition. (However, it would not pay
to make this partition explicit, since mentioning both the symmetric
environment "1" and its transpose "1t" would result in
including the same counts twice.) [but more on that when the complete
syntax is described. -- rgr, 8-Sep-97.]
For instance, the environment number 5 (strand_different_e1_e1 in the MRF environment encoding table above) appears in the first set ("e-bur") because it involves buried strand residues interacting with other buried strand residues; it is therefore self-symmetric, and so there is no "5t" environment.
[**finish**; describe detailed syntax. -- rgr, 8-Sep-97.]
The S? column records whether the environment is self-symmetric. If it is, this means that it is possible to switch the two residues and still end up with the same environment. Note that for the helix/strand cases, it is not even possible to switch the residues, so they are all asymmetric (and mrf-envs takes care to put the helix residue first). The -symmetric-pair-envs option to the mrf-counts and mrf-scores programs expects a list of such environments; the default is suitable for the encoding below.
| Mnemonic | Encoding | Interaction criteria | S? |
|---|---|---|---|
| helix_different_e1_e1 | 1 | CB distance < 7.2Å and theta > 90° | yes |
| helix_different_e1_e2 | 2 | no | |
| helix_different_e2_e1 | 3 | no | |
| helix_different_e2_e2 | 4 | yes | |
| strand_different_e1_e1 | 5 | CB distance < 7.2Å and:
|
yes |
| strand_different_e1_e2 | 6 | no | |
| strand_different_e2_e1 | 7 | no | |
| strand_different_e2_e2 | 8 | yes | |
| helix_strand_e1_e1 | 9 | CB distance < 7.2Å and theta > 90° | no |
| helix_strand_e1_e2 | 10 | no | |
| helix_strand_e2_e1 | 11 | no | |
| helix_strand_e2_e2 | 12 | no | |
| same_helix_e1_e1 | 13 | residues at ±1, ±3, ±4 only | yes |
| same_helix_e1_e2 | 14 | no | |
| same_helix_e2_e1 | 15 | no | |
| same_helix_e2_e2 | 16 | yes | |
| same_strand_e1_e1 | 17 | residues at ±2 only | yes |
| same_strand_e1_e2 | 18 | no | |
| same_strand_e2_e1 | 19 | no | |
| same_strand_e2_e2 | 20 | yes |