MRF programs

BMERC : needle tools : Programs : MRF programs


The following sections describe the individual members of the MRF suite, in "man page" style (more or less), with all parameters documented. See below for the list of known bugs and built-in limitations in these programs. See also the make-mrf-depends.pl script.

Table of contents

  1. MRF programs
    1. Table of contents
    2. sing-envs.pl
    3. mrf-envs
      1. mrf-envs command-line options
      2. Running mrf-envs
      3. mrf-envs hardwired options
    4. mrf-counts
    5. mrf-scores
    6. Known bugs in the MRF programs
    7. Limitations


sing-envs.pl

Given a core file and its corresponding exposure file, sing-envs.pl writes the singleton environments in
environment file format, to the standard output by default. Where there is overlap with mrf-envs, the arguments are compatible, though sing-envs.pl has more exposure options and doesn't produce pairwise files.

sing-envs.pl has two main advantages over mrf-envs when the latter is used to produce singleton environment files:

The main disadvantage is that sing-envs.pl is slower than mrf-envs, though not greatly so in the larger scheme of things (i.e. making counts and scores as well as environments). So the usual recipe is to use sing-envs.pl if one of the above additional features is required, and mrf-envs otherwise.

All command-line options are keyword-style, i.e. -name, followed by a value if -name requires one.

File options:

-core-file core-file-name
name of a "core" format file, e.g. "2mhr.core", required for input.
-exposure-file exposure-file-name
name of a .nexp (exposure) format file. [The file is interpreted as if it were the argument to the -vv-file option if the name ends in ".singl". This is not very satisfactory -- it is just a hack to avoid changing makefiles -- and may go away. -- rgr, 13-Jan-98.]
-nexp-file nexp-file-name
name of a .nexp (exposure) format file (i.e. like -exposure-file but forces interpretation as .nexp format).
-vv-file vv-file-name
name of a singleton visible volume file to be used for exposure purposes (i.e. like -exposure-file but forces interpretation as visible volume format).
-sequence-file sequence-file-name
ignored; this argument is accepted only for compatibility with mrf-envs.
-sing-env-file sing-env-file-name
write the singleton environments to the named file; the default is the standard output.
The core and exposure files are always required, but any of the -exposure-file, -nexp-file, or -vv-file options may be used to specify the exposure. Any may be specified as "-", in which case the standard input is read.

State definition options:

-n-singleton-states N (integer; default 20)
number of singleton states to define. State numbers are 1-based (i.e. from 1 through N  inclusive). N  must be at least 1 and no greater than S_MAX_NUM_COND. (Actually, only mrf-counts and mrf-scores need to worry about S_MAX_NUM_COND; sing-envs.pl doesn't care, so it doesn't enforce this limit.)
-sing-exp-bin-width exposure-value (positive float; default 12.5)
width in exposure units of the singleton exposure states (bins). For example, positions with exposure in the range 0 <= exposure < sing-exp-bin-width are labeled as state 1, positions with exposure sing-exp-bin-width <= exposure < 2*sing-exp-bin-width are labeled as state 2, etc. All exposures greater than n-singleton-states*sing-exp-bin-width are lumped together in the top state (state n-singleton-states). The default value is appropriate for Eisenberg fat alanine exposure (non-normalized, in square Ångstroms).
-sing-exp-bins threshold . . . (list of numbers in increasing order; default none)
specifies explicit exposure bin thresholds for variable width bins. The exposure values must be non-negative real numbers (integers or floats) supplied as separate arguments in increasing order, e.g.
       -sing-exp-bins 22 37 45 57 68
Given N values, the exposure will fall into bin 1 if less than or equal to the first number, into bin 2 if greater than the first number but less than or equal to the second number, . . ., and into bin N+1 if greater than the last number. Thus, the value 38.8 will fall into bin 3 out of 6 if given the -sing-exp-bins above.
-sing-ss-p
requests inclusion of secondary structure differences in singleton state definition, which is done by default. If specified (or defaulted), then n-singleton-states/2 exposure bins are used, with the first n-singleton-states/2 states for helices, and the next n-singleton-states/2 for strands. n-singleton-states will be rounded up to an even number if necessary. [could really use a better name for this argument. -- rgr, 18-Nov-96.]
-sing-no-ss-p
requests that counts for different secondary structures be combined in the same state. If specified, then the singleton state is the same as the exposure bin.

Other options:

-debug
requests debugging output on stderr.
-verbose
requests verbose output. All such output, like error messages, goes to stderr. [At present, this is used solely for debugging purposes. -- rgr, 13-Jan-98.]
Together, these options define the number and character of the singleton environments. Needless to say, these definitions must be applied consistently across a given model set for the counts and scores to make sense.


mrf-envs

Given a core and its corresponding sequence and exposure files, the mrf-envs program writes the singleton,
GMT, and/or pairwise MRF environment files in environment file format.

mrf-envs command-line options

All command-line options are keyword-style, i.e. -name, followed by a value if -name requires one.

File options:

-core-file core-file-name
name of a "core" format file, e.g. "2mhr.core", required for input.
-exposure-file exposure-file-name
name of a .nexp (exposure) format file, required for singleton and pairwise environments.
-sequence-file sequence-file-name
name of an IG format sequence file, e.g. "2mhr.seq", required for singleton and pairwise environments. [In principle, the environment could depend on the indices of the native sequence; in practice, the MRF environment does not, so this requirement could be lifted. -- rgr, 13-Jan-97.]
-energy-file energy-file-name
file of contact information based on CHARMm energy. If specified, this is used to further filter pairwise contacts. [The code for doing this still exists, though I haven't been able to test it. But we may not be able to continue to support this in any case; the input file apparently doesn't support chain ID's. -- rgr, 18-Nov-96.]
-sing-env-file sing-env-file-name
write the singleton environments to the named file.
-gmt-env-file gmt-file-name
write the singleton environments that go with the geometric mean term scores to gmt-file-name.
-pair-env-file pair-env-file-name
write the pairwise environments to the named file.
Each specified output file will be written as requested. The core, sequence, and exposure files are required in all cases, except that if only GMT environments are required, then only the core file is needed.

State definition options:

-n-singleton-states N (integer; default 20)
number of singleton states to define. State numbers are 1-based (i.e. from 1 through N  inclusive). N  must be at least 1 and no greater than S_MAX_NUM_COND. [Actually, only mrf-counts and mrf-scores need to worry about S_MAX_NUM_COND; mrf-envs doesn't really need to care, so the fact that it enforces this limit is something of a bug, though still useful. -- rgr, 14-Feb-97.] (Note that there is no -n-pairwise-states option, since changing the number is not meaningful without also changing the code that determines the pairwise environment.)
-sing-exp-bin-width exposure-value (float; default 12.5)
width in exposure units of the singleton exposure states (bins). For example, positions with exposure in the range 0 <= exposure < sing-exp-bin-width are labeled as state 1, positions with exposure sing-exp-bin-width <= exposure < 2*sing-exp-bin-width are labeled as state 2, etc. All exposures greater than n-singleton-states*sing-exp-bin-width are lumped together in the top state (state n-singleton-states). The default value is appropriate for Eisenberg fat alanine exposure (non-normalized, in square Ångstroms).
-sing-ss-p
requests inclusion of secondary structure differences in singleton state definition. If this option is specified (or defaulted), then the first n-singleton-states/2 states will apply to helices, and the next n-singleton-states/2 will apply to strands. n-singleton-states will be rounded up to an even number if necessary. [could really use a better name. -- rgr, 18-Nov-96.]
-sing-no-ss-p
requests that counts for different secondary structures be combined in the same state. If specified, then the singleton state depends only on exposure.
-pair-exp-threshold exposure-value (float; default 37.5)
threshold in exposure units between exposed and buried pairwise states (see the MRF environment encoding for details). The default value is appropriate for Eisenberg fat alanine exposure (non-normalized, in square Ångstroms). (Details of its calibration experiments have been lost, though. -- rgr, 23-Jan-97.)
Together, these options define the number and character of the singleton environments and (to a lesser degree) the pairwise environments. Needless to say, these definitions must be consistent across a given model set for the counts and scores to make sense.

[more -- table of resulting singleton states? -- rgr, 14-Feb-97.]

Other switch options:

-print-features
if specified, the pairwise features are written to the standard output, one per line. The values written are a pairwise interaction identifier (made from the core name and the residue core total indices), the residue index, 3-letter amino-acid abbreviation, and secondary structure letter ("H" or "E") for the first residue, the residue index, 3-letter amino-acid abbreviation, and secondary structure letter for the second residue, 1 if the residues lie on the same core segment (else 0), the distance between beta carbons, two angles [known only to me as "phi1" and "phi2"; the code doesn't document them -- rgr, 10-Jan-97], the angle in degrees between the two alpha-beta vectors, and the two exposure values.
-verbose
specifies verbose output. All such output, like error messages, goes to stderr. [At present, this is used solely for debugging purposes. -- rgr, 18-Nov-96.]

Running mrf-envs

[to be filled in later. probably not much to say. -- rgr, 10-Jan-97.]

[Note that the sequence file could be omitted altogether. In principle, the environment assignment function could use the sequence positions of the core residues; in practice, the MRF function does not. -- rgr, 23-Jan-97.]

mrf-envs hardwired options

None of these are implemented as command-line parameters yet; you need to edit the stat-env.h file & recompile if you want to change them.

Hardwired threshold options:

-angle-threshold angle (degrees)
minimum angle between alpha-beta vectors below which two residues are considered not to interact. [Actually, it's more complicated than that . . . -- rgr, 14-Feb-97.] The default is 90 degrees (i.e. the alpha-beta vectors must be closer to anti-parallel than parallel), which captures the notion that the beta carbons must be pointing approximately "at" each other. (Updated version of the ANGLE_THD compile-time switch).
-coulomb-threshold ???
Coulomb energy threshold. Contacts for which Van der Waals energy is greater than the -vdw-threshold value and Coulomb energy is less than the -coulomb-threshold value and exposure is less than EXP_THD*BIN_WIDTH [fix this] are retained; all others are discarded. The default value value is 0.5, which retains around 96% of contacts. (Updated version of the COULOMB_THD compile-time parameter.) [See also the caveat about energy files. -- rgr, 10-Jan-97.]
-distance-threshold dist (Ångstroms; default is 7.2 Å)
minimum distance between beta carbons beyond which two residues are never considered to interact. The code actually test for strictly less than this value. (Updated version of the D0 compile-time switch.)
EXP_THD val
exposure threshold used by the energy filtering code; see the -energy-file option.
-vdw-threshold units??
van der Waals threshold for contact (used when doing energy filtering; see the -energy-file option). Contacts for which the van der Waals energy is greater than the -vdw-threshold value and Coulomb energy is less than the -coulomb-threshold value and exposure is less than EXP_THD*BIN_WIDTH [fix this] are retained; all others are discarded. The default value is 2.0, which retains around 30% of contacts. (Updated version of the VDW_THD compile-time parameter.)
Oddball hardwired options:

These are compile-time switches for special cases I don't know what to do with; all of them are shut off by default. They are probably obsolete; they were not "de-supported" only because it wasn't a big burden to carry them along. But of course, that could change in the future. They are not implemented as command-line parameters; you need to edit the write-pairwise-environments.h file & recompile if you want to enable these.

COMPUTE_AM
compute (and print) adjacency matrices.
NNET
NNET == 1 means use neural-net format for pairwise environment files. Otherwise use standard format.


mrf-counts

mrf-counts takes the environment files (perhaps made by
mrf-envs) and the core and sequence files from which they were produced, and generates a file in "counts file" format. This file is written as specified by -write-counts, or to the standard output if the file name is missing. Requires that the -core-file and -sequence-file parameters be specified. Singleton and/or pairwise counts will be produced if the -sing-env-file and/or -pair-env-file parameters (respectively) are specified. If neither environment file is given, the result will have only loop and overall distribution counts.

Arguments:

-core-file core-file-name
name of a "core" format file, e.g. "2mhr.core", required for input.
-sequence-file sequence-file-name
name of an IG format sequence file, e.g. "2mhr.seq", required for input.
-alignment-file aln-file-name
optional file of aligned sequences, in homolog or .hlg file format.
-sing-env-file sing-env-file-name
reads the singleton environments from the named file. (Note that this is not the GMT environment file.)
-pair-env-file pair-env-file-name
reads the pairwise environments from the named file.
-symmetric-pair-envs env . . . (zero or more integers)
set of pairwise environments (encoded as small integers) for which amino acids are to be counted symmetrically; the default is correct for the environments used by the mrf-envs program. See the pair counting discussion below for an explanation of why mrf-counts needs to know this.
-write-counts [ count-file-name ]
specifies a file name to which to write the core's counts. Counts are written to the standard output if no file name is specified (i.e. no -write-counts parameter, or -write-counts is at the end of the line, or the parameter is followed immediately by another token starting with a "-", or the file name is "-" exactly).
Hardwired (compile-time) options:

These are defined in the mrf-counts.h file.

HARI
Put secondary structure labels in the first line of aln matrix and print the alignment (i.e., aln matrix). Not done by default.
MARK
For MARK == 0, count all contacts in homologs. For MARK == 1, only distinct contacts are counted, i.e. each amino acid (or AA pair) is counted at most once; this is the default.
If an -alignment-file is specified, but can't be opened, or contains a core sequence that is different from the one specified in the sequence file, an error message is printed, and the homolog alignments are ignored.

It is also an error if the specified singleton environments file contains a number of environments that is different from the number of core positions in the core.

For each core residue position, the singleton count is incremented based on the singleton environment and the observed amino acid. Then, if using homolog data, amino acids in the corresponding position are added to the counts, except that each amino acid is counted at most once (but see the MARK compile-time option above).

Pairwise interactions are handled in the same way, except using the <core position 1, core position 2, environment> triples that are found in the pairwise environment file. If counting homolog data, ALA/PHE is considered distinct from ALA/LYS; both are counted even though the ALA is repeated. And for asymmetric environments, ALA/PHE is distinct from PHE/ALA. However, for symmetric environments, ALA/PHE pairs are also counted as PHE/ALA pairs, and the count matrix for each symmetric environment is thereby forced to be its own transpose. mrf-scores needs to know which environments are symmetric in order to compensate for the fact that those off-diagonal elements have effectively been double-counted. But mrf-counts also needs to know this in order to count pairs in the homolog data symmetrically. If (e.g.) ALA/PHE appears in the native sequence, then PHE/ALA in the homolog data must not counted -- and mrf-scores can't make an after-the-fact correction for this, as it can for the double counting.

For purposes of counting, there are exactly three loop environments, based only on the length of the loop: 0-1, 2-5, and 6 or more. Although mrf-scores lumps the 0-1 and 2-5 environments before generating loop scores, they are nevertheless counted separately. (To change either of these, it is necessary to modify both the mrf-counts and mrf-scores code.) Loop residues in homolog data are counted only if they are aligned to actual (non-gap) residues in the core sequence. In other words, the columns that are headed by gaps in the core sequence are ignored. Therefore, the only alignments that matter are the homolog-to-core alignments, and not the homolog-to-homolog cases. [Note: The program has always treated loops this way, but Temple considers this a bug. Each loop should be independently assigned an environment based on its length, and for each environment, one should count all residues as for singletons (i.e. for each aligned position individually, a given amino acid is counted at most once). This means that true multiple alignments are required, since the exact registration of homolog loops with respect to each other makes a difference in the counts. -- rgr, 28-Mar-97.]


mrf-scores

The mrf-scores program can be used to produce score files from count and environment files after massaging them in various ways. It can also output its results as counts (in which case it doesn't need the environments). If it is not explictly asked to write anything, it writes counts to the standard output.

Note: Unlike mrf-envs and mrf-counts, mrf-scores processes its arguments as it parses them, so their order is important. In particular, the environment files are needed to write the GMT score files, so they must be specified before the appropriate score writing option, which must come after the necessary count processing options. All of the score file writing options take a file name (or two); all such consecutive options are collected together and executed in the order (singleton, loop, pair/GMT) for efficiency.

Count manipulation arguments

-add { count-file-name }
adds the contents of zero or more count files to the current counts.
-sub { count-file-name }
subtracts the contents of zero or more count files from the current counts. Note that this can result in negative counts.
-write-counts [ count-file-name ]
writes the current counts to the named file, or stdout if the file name is missing (i.e. -write-counts is the last parameter, or count-file-name is "-", or -write-counts is followed immediately by another token starting with a "-").

Score option arguments

These remain in effect until reset, and must be specified before generating scores. -min-pair-count and the -*-poisson options only apply when computing scores; their effect does not appear if the counts are subsequently written.

-symmetric-pair-envs env . . . (zero or more integers)
set of pairwise environments (encoded as small integers) for which amino acids are to be counted symmetrically. The default is correct for the environments used by the mrf-envs program. Needless to say, this must be the same as for the mrf-counts invocation that generated the counts, and must be supplied before generating score files (and is not necessary for simply massaging counts). See the mrf-counts discussion of pair counting for details.
-gmt-marginal-file filename
defines filename as holding the GMT marginal set definitions. The file is not actually read until just before GMT scores are generated. See the "GMT marginal sets" section for details.
-min-pair-count min (non-negative integer)
establishes min as the minimum count for all pairwise scores. This happens after applying the -pair-poisson value, if any. (Note that the WMS scoring scheme uses a minimum of 4 for pairwise counts; the default behavior is to leave the counts alone, which is not backward compatible.)
-sing-poisson N  (non-negative float)
specifies a value to be added to all singleton counts before processing into scores. The default is zero.
-pair-poisson N  (non-negative float)
specifies a value to be added to all pairwise counts before turning into scores. This happens before applying the -min-pair-count value, if any. The default is zero.
-loop-poisson N  (non-negative float)
specifies a value to be added to all loop counts before processing into scores. The default is zero.
-poisson
adds 1 to all singleton and loop counts. This is equivalent to
    -sing-poisson 1 -loop-poisson 1
and is supported only for backward compatibility.
-normalize
if specified, singleton and loop scores are computed as -log[P(aa|env)/P(aa)], where P(aa) is summed over all environments. By default, the score is computed as -log[P(aa|env)].
-loop=n
specifies the number of loop environments. Only two values are supported at present: [This may change in the near future; we'd like to be able to experiment with additional loop environments at different breakpoints. -- rgr, 26-Jun-98.]
-sing-aa-bins sebins-file-name
[new feature; not fully implemented yet. -- rgr, 10-Dec-97.]
-sing-env-file sing-env-file-name
reads the singleton environments from the named file. (Note that this must not be the GMT environment file name.) Required only when generating GMT scores.
-pair-env-file pair-env-file-name
reads the pairwise environments from the named file. Required only when generating GMT scores.

Score output arguments

Note: All of the score output options take a file name (or two); all such consecutive options are collected together and executed in the order (singleton, loop, pair/GMT) for efficiency.

-write-loop loop-file-name
writes loop scores.
-write-singletons sing-file-name
write the "true" singleton scores to the named file.
-write-gmt gmt-file-name
write the geometric mean term "singletons" on the GMT file gmt-file-name. If writing GMT scores, one must have specified both the -sing-env-file and -pair-env-file options previously.
-write-pair pair-file-name
write the pairwise scores on pair-file-name.
-write-pair-gmt pair-file-name gmt-file-name
equivalent to specifying -write-pair and -write-gmt on each file independently; included for backward compatibility. Both names are required, but one may specify "" for either name to suppress writing.
-write-pair-sing pair-file-name gmt-file-name
same as -write-pair-gmt; included for backward compatibility.

Miscellaneous arguments

-verbose
specifies verbose output. All such output, like error messages, goes to stderr. [At present, this is used solely for debugging purposes. -- rgr, 18-Nov-96.]
-buggy-min-pair-count min (non-negative integer)
sets all pairwise counts to at least min, an integer. Unlike -min-pair-count, this option takes effect immediately, and will be visible if the counts are subsequently written. Since it interacts badly with symmetric environments, it is used exclusively for testing/debugging.

When writing pairwise scores, the pairwise environments are needed to compute the GMT "singleton" terms, and the singleton environments are needed in the unlikely event that a given core position has no pairwise interactions.


Known bugs in the MRF programs

A line of asterisks means that the bug is significant, and still current.
  1. *** Sequence files will be read incorrectly by mrf-envs and mrf-counts if No error message is generated in any of these cases; strange behavior may result. -- rgr, 9-May-96. [The line length requirement has been relaxed somewhat, but the other problems are still effective. -- rgr, May-96.]
  2. In the get_core function, segment definition lines more than 7 chars long may overwrite the next residue (though that should not cause trouble). -- rgr, 23-May-96. [fixed by side effect; the whole routine was rewritten. -- rgr, 16-Dec-96.]
  3. In mrf-scores, loop counts are handled incorrectly if you do not specify the -poisson option. -- rgr, 27-Nov-96. [Fixed in Release 1.0. -- rgr, 24-Jun-97.]
  4. If you ask for score files based on negative counts, you will lose. -- rgr, 27-Nov-96. [Fixed in Release 1.0. -- rgr, 24-Jun-97.]
  5. *** In principle, mrf-scores should be able to write multiple score file sets per invocation, as for a cross-validated and a non-cross-validated set. In practice, generating the score files trashes the "current" count information. -- rgr, 13-Jan-97. [I think this has been fixed, but have not had time to test it yet. -- rgr, 30-Jan-97.]
  6. *** write_gmt_singleton_file (in new-stat/gmt.c), used by mrf-scores, has krufty code that still uses MAX_LEN_SEQ for compile-time allocation of per-residue information. The code does not check for overrun of these statically allocated arrays. -- rgr, 13-Jan-97.
  7. The various programs ought to be smarter about defaulting file names, e.g. the default sequence file name ought to be a file ending in ".seq" instead of ".core" in the same place as the core file. -- rgr, 13-Jan-97.
  8. mrf-counts and mrf-scores both have to know which pairwise environments are symmetric. The current implementations have the MRF values hard-wired. -- rgr, 13-Jan-97. [Fixed in Release 1.0 via the -symmetric-pair-envs parameter to each of these programs. -- rgr, 7-Mar-97.]
  9. *** get_aln doesn't check for MAX_NUM_HOM overflow. -- rgr, 19-Jan-97.
  10. mrf-counts requires both singleton and pairwise environments, even if one only wishes to count one or the other. -- rgr, 2-Dec-97. [Fixed in Release 1.0. -- rgr, 26-Jan-98.]


Limitations

Many data structures are not allocated dynamically by the MRF programs, so there are a number of built-in limitations, controlled by figurative constants. (In fact, the order of subscripts used by the code is such that you can't introduce dynamic allocation of many of the multidimensional arrays.) Ignoring things like scratch line and string sizes, which have been made relatively enormous, the important ones are:
MAX_LEN_SEQ (1000)
used mostly for static allocation of homolog alignment data. However, due to a bug, there is also a practical limit of 1000 on the cumulative length of all core segments.
MAX_NUM_CONTACTS (5000, in mrf-envs)
maximum number of energetic contacts, read from the -energy-file parameter value. [This could be dynamically allocated, except that we don't ever use the feature, so I can't really test it. -- rgr, 26-Nov-96.]
MAX_NUM_HOM (300, in mrf-counts)
number of homolog sequences, read from the -alignment-file parameter value.
MAX_NUM_COND (21, in mrf-counts and mrf-scores)
maximum size of pairwise counts arrays. [This would be difficult to allocate dynamically; one would need to change the subscript order throughout the code. -- rgr, 13-Jan-97.]
S_MAX_NUM_COND (20, in mrf-counts and mrf-scores)
maximum size of singleton counts arrays. [This would be difficult to allocate dynamically; one would need to change the subscript order throughout the code. -- rgr, 13-Jan-97.]

Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:39:39 EST 1999