Introduction to MRF score generation

BMERC : needle tools : Introduction : MRF score generation


This page describes the programs and dataflows required for threading protein cores according to Markov random field statistics. These programs generate the score and environment files used by the the needle protein threading program.

[need a proper introduction here, complete with references to published papers, threading context (i.e. relative to needle), etc. -- rgr, 20-Feb-97.]

For more information, please see the needle tools home page.

Table of contents

  1. Introduction to MRF score generation
    1. Table of contents
    2. Outline of MRF generation programs and files
    3. GMT marginal sets
    4. MRF implementation details
      1. Environment encoding

Outline of MRF generation programs and files

Given a library of cores, the steps for any statistical scoring scheme, such as MRF, are as follows:

  1. For each member of the library, classify its singleton core positions and pairwise interactions into discrete "states" or "environments";
  2. Count all amino acids observed in each environment over the whole core set;
  3. Perform any desired statistical massaging to compensate for low counts; and
  4. Produce score files from the modified counts.
Note that the MRF code combines the last two steps, since the massaging is simple and doesn't demand a separate program. In principle, one could effect this separation by adding a separate counts preprocessing step and omitting to use the code built into the score file generation. mrf-envs mrf-counts mrf-scores mrf-scores

Figure 2: Generating MRF files

The figure above shows the dataflow involved in generating MRF files from a set of exposure, core, and sequence files. For reference, the program names (in the rectangular boxes) are hyperlinked to the detailed command-line argument descriptions, and the descriptions of input and output files (in the ovals) are hyperlinked to their format definitions (which usually contain examples). Note: It is not necessary (and in fact not recommended!) to follow these links at a first reading. [Bug: Most of the hyperlinks don't work yet. -- rgr, 14-Mar-97.]

[this is the old textual version of the above information. -- rgr, 14-Mar-97.]


GMT marginal sets

[this is really about pairwise score computation in general; should be rewritten as such, and GMT marginals put in that context. -- rgr, 30-Jan-98.]

The GMT "singleton" score term GMTS(r,aa) for a given core position r and amino acid aa is computed as an average of the logs (hence the "geometric mean") of a marginal term for each edge over the set of pairwise edges involving that residue. Each pairwise edge E = (r1,r2,env) in the set of all pairwise interactions PI denotes an interaction between a pair of residues at core positions r1 and r2, which positions could be occupied by amino acids aa1 and aa2, respectively. For any given residue r there are two subsets of edges E1(r) and E2(r) involving r in the first and second position respectively.

    E1(r) = {E=(r1,r2,env)|E in PI and r=r1}

    E2(r) = {E=(r1,r2,env)|E in PI and r=r2}
and therefore
		  sigmaE1(r)M(aa,env)+sigmaE2(r)Mt(aa,env)
    GMTS(r,aa) = ----------------------------------------
			     |E1(r)|+|E2(r)|
Here, M and Mt are the GMT marginal terms for the first and second residue, respectively. These are analogous to the pairwise marginals Mp and Mpt, which are defined as follows:
    Mp(aa1,env) = -log[sigmaaa2P(aa1,aa2|env)];
and
    Mpt(aa1,env) = -log[sigmaaa2P(aa1,aa2|env)].
Given the parwise marginals, the pairwise scores are computed thus:
    PS(aa1,aa2,env) = -log[P(aa1,aa2|env)]-Mp(aa1,env)-Mpt(aa1,env)
An edge E can be thought of has having three sets of attributes, in addition to the amino acids that may be threaded there: those peculiar to the first core position, those peculiar to the second core position, and those peculiar to the edge itself. A relevant core position attribute could be exposure or secondary structure; the edge itself could encode some notion of distance (e.g. "near" vs. "far"). In the MRF scheme, the edge itself has no attributes (one could argue that the same SS/different SS distinction constitutes such an attribute, but the end result is the same for our purposes). To compute the GMT marginal value for a given core position and amino acid, we wish to consider the pairwise counts over all possible interactions such a core position and amino acid could have, instead of just that environment. We therefore compute each over a set S(env) of environments, where S is defined by keeping one residue's attributes fixed and varying the attributes of the other residue and those of the edge itself. [this doesn't adequately explain the env versus envt distinction, though. -- rgr, 5-Dec-97.]

Since mrf-scores does not have the details of the MRF pairwise environment definitions built-in (other than symmetry), these marginal sets must be defined externally by the -gmt-marginal-file option to the mrf-scores program. If this option is not supplied, the sets are made to be singletons, i.e. S(env)={env} for all env, which is equivalent to the pairwise marginals.

Marginals are therefore computed from the raw pairwise counts C[aa1,aa2,env] as follows:

    M(aa1,env) = -log P(aa1|aa2,S(env));
and
    Mt(aa1,env) = -log P(aa2|aa1,S(env))
where
			sigmaenv in Ssigmaaa2C[aa1,aa2,env]
    P(aa1|aa2,S) = ---------------------------------------------
		    sigmaaa1[sigmaenv in Ssigmaaa2C[aa1,aa2,env]]
and
			sigmaenv in Ssigmaaa1C[aa2,aa1,env]
    P(aa2|aa1,S) = ---------------------------------------------
		    sigmaaa2[sigmaenv in Ssigmaaa1C[aa1,aa2,env]]
Note that in general,
    P(aa1|aa2,S) = P(aa2|aa1,S)
only if S contains only symmetric environments.

The GMT marginal file specifies these marginal sets completely. Here is the one for the "revised" WMS in its entirety. [must really decide what to call this. -- rgr, 8-Sep-97.]

    # e-bur
    5,6,7t,9t,11t,17,18,19t
    # e-exp
    6t,7,8,10t,12t,18t,19,20
    # h-bur
    1,2,3t,9,10,13,14,15t
    # h-exp
    2t,3,4,11,12,14t,15,16
The lines beginning with "#" are comments; they merely delimit the four sets defined by the singleton states to the extent that they are distinguished in the pairwise environments. This is the simplest form of the file, consisting of lists of environment numbers, some of which have a "t" (for "transpose") appended. All self-symmetric environments appear exactly once, without the "t"; all others appear exactly twice, once with a "t" and once without. The singleton states thus divide the environments and their transposes into equivalence classes. Given that the default class for the transpose of a symmetric environment is that of the untransposed environment, the file above implicitly describes a partition. (However, it would not pay to make this partition explicit, since mentioning both the symmetric environment "1" and its transpose "1t" would result in including the same counts twice.) [but more on that when the complete syntax is described. -- rgr, 8-Sep-97.]

For instance, the environment number 5 (strand_different_e1_e1 in the MRF environment encoding table above) appears in the first set ("e-bur") because it involves buried strand residues interacting with other buried strand residues; it is therefore self-symmetric, and so there is no "5t" environment.

[**finish**; describe detailed syntax. -- rgr, 8-Sep-97.]


MRF implementation details

[in progress; originally just a container for the new location of the pairwise encoding. may belong somewhere else. -- rgr, 13-Jan-98.]

Environment encoding

These are what the numbers in the
pairwise environment files generated by the mrf-envs program mean. The words "helix" and "strand" as used in the mnemonic names are self-explanatory; the words "same" and "different" refer to whether the two interacting residues are in the same or different strand or helix. Given these basic secondary structure classifications, some further criteria are applied to decide whether residues in these positions might actually interact. These fall into three classes: Finally, given that the residues are deemed to interact, the exposure for each residue is classified, reflected by the two suffixes (one for each residue), which have possible values of "e1" ("exposure level 1") means "buried", while "e2" means "exposed"; the threshold between these is controlled by the -pair-exp-threshold option.

The S? column records whether the environment is self-symmetric. If it is, this means that it is possible to switch the two residues and still end up with the same environment. Note that for the helix/strand cases, it is not even possible to switch the residues, so they are all asymmetric (and mrf-envs takes care to put the helix residue first). The -symmetric-pair-envs option to the mrf-counts and mrf-scores programs expects a list of such environments; the default is suitable for the encoding below.

Mnemonic Encoding Interaction criteria S?
helix_different_e1_e11 CB distance < 7.2Å and theta  > 90° yes
helix_different_e1_e22no
helix_different_e2_e13no
helix_different_e2_e24yes
strand_different_e1_e15 CB distance < 7.2Å and:
  • theta  < 90° if on same sheet or adjacent in same barrel; or
  • theta  > 90° if on different sheets or sheet/barrel or nonadjacent in same barrel.
yes
strand_different_e1_e26no
strand_different_e2_e17no
strand_different_e2_e28yes
helix_strand_e1_e19 CB distance < 7.2Å and theta  > 90° no
helix_strand_e1_e210no
helix_strand_e2_e111no
helix_strand_e2_e212no
same_helix_e1_e113 residues at ±1, ±3, ±4 only yes
same_helix_e1_e214no
same_helix_e2_e115no
same_helix_e2_e216yes
same_strand_e1_e117 residues at ±2 only yes
same_strand_e1_e218no
same_strand_e2_e119no
same_strand_e2_e220yes

Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:58:24 EST 1999