BMERC : needle tools : Programs : MRF programs
The following sections describe the individual members of the MRF suite, in "man page" style (more or less), with all parameters documented. See below for the list of known bugs and built-in limitations in these programs. See also the make-mrf-depends.pl script.
sing-envs.pl has two main advantages over mrf-envs when the latter is used to produce singleton environment files:
All command-line options are keyword-style, i.e. -name, followed by a value if -name requires one.
File options:
State definition options:
-sing-exp-bins 22 37 45 57 68
Given N values, the exposure will fall into bin 1
if less than or equal to the first number, into bin 2 if greater
than the first number but less than or equal to the second
number, . . ., and into bin N+1 if greater than
the last number. Thus, the value 38.8 will fall into bin 3 out
of 6 if given the -sing-exp-bins above.
Other options:
File options:
State definition options:
[more -- table of resulting singleton states? -- rgr, 14-Feb-97.]
Other switch options:
[Note that the sequence file could be omitted altogether. In principle, the environment assignment function could use the sequence positions of the core residues; in practice, the MRF function does not. -- rgr, 23-Jan-97.]
Hardwired threshold options:
These are compile-time switches for special cases I don't know what to do with; all of them are shut off by default. They are probably obsolete; they were not "de-supported" only because it wasn't a big burden to carry them along. But of course, that could change in the future. They are not implemented as command-line parameters; you need to edit the write-pairwise-environments.h file & recompile if you want to enable these.
Arguments:
These are defined in the mrf-counts.h file.
It is also an error if the specified singleton environments file
contains a number of environments that is different from the number of
core positions in the core.
For each core residue position, the singleton count is incremented
based on the singleton environment and the observed amino acid. Then,
if using homolog data, amino acids in the corresponding position are
added to the counts, except that each amino acid is counted at most once
(but see the MARK compile-time option above).
Pairwise interactions are handled in the same way, except using the
<core position 1, core position 2,
environment> triples that are found in the pairwise environment
file. If counting homolog data, ALA/PHE is considered distinct from
ALA/LYS; both are counted even though the ALA is repeated. And for
asymmetric environments, ALA/PHE is distinct from PHE/ALA. However, for
symmetric environments, ALA/PHE pairs are also counted as PHE/ALA pairs,
and the count matrix for each symmetric environment is thereby forced to
be its own transpose. mrf-scores needs to know which
environments are symmetric in order to compensate for the fact that
those off-diagonal elements have effectively been double-counted. But
mrf-counts also needs to know this in order to count pairs in
the homolog data symmetrically. If (e.g.) ALA/PHE appears in the
native sequence, then PHE/ALA in the homolog data must not counted --
and mrf-scores can't make an after-the-fact correction for
this, as it can for the double counting.
For purposes of counting, there are exactly three loop environments,
based only on the length of the loop: 0-1, 2-5, and 6 or more.
Although mrf-scores lumps the 0-1 and 2-5 environments before
generating loop scores, they are nevertheless counted separately. (To
change either of these, it is necessary to modify both the
mrf-counts and mrf-scores code.) Loop residues in
homolog data are counted only if they are aligned to actual (non-gap)
residues in the core sequence. In other words, the columns that are
headed by gaps in the core sequence are ignored. Therefore, the only
alignments that matter are the homolog-to-core alignments, and not the
homolog-to-homolog cases. [Note: The program has always treated
loops this way, but Temple considers this a bug. Each loop should be
independently assigned an environment based on its length, and for each
environment, one should count all residues as for singletons (i.e. for
each aligned position individually, a given amino acid is counted at
most once). This means that true multiple alignments are required,
since the exact registration of homolog loops with respect to each other
makes a difference in the counts. -- rgr, 28-Mar-97.]
Note: Unlike mrf-envs and mrf-counts,
mrf-scores processes its arguments as it parses them, so their
order is important. In particular, the environment files are needed to
write the GMT score files, so they must be specified before the
appropriate score writing option, which must come after the necessary
count processing options. All of the score file writing options take a
file name (or two); all such consecutive options are collected together
and executed in the order (singleton, loop, pair/GMT) for efficiency.
Count manipulation arguments
Score option arguments
These remain in effect until reset, and must be specified before
generating scores. -min-pair-count and the -*-poisson
options only apply when computing scores; their effect does not appear
if the counts are subsequently written.
Score output arguments
Note: All of the score output options take a file name (or
two); all such consecutive options are collected together and executed
in the order (singleton, loop, pair/GMT) for efficiency.
Miscellaneous arguments
When writing pairwise scores, the pairwise environments are needed to
compute the GMT "singleton" terms,
and the singleton environments are needed in the unlikely event that a
given core position has no pairwise interactions.
mrf-scores
The mrf-scores program can be used to produce score files from
count and environment files after massaging them in various ways. It
can also output its results as counts (in which case it doesn't need the
environments). If it is not explictly asked to write anything, it
writes counts to the standard output.
-sing-poisson 1 -loop-poisson 1
and is supported only for backward compatibility.
[This may change in the near future; we'd like to be able to
experiment with additional loop environments at different
breakpoints. -- rgr, 26-Jun-98.]
Known bugs in the MRF programs
A line of asterisks means that the bug is significant, and still
current.
No error message is generated in any of these cases; strange
behavior may result. -- rgr, 9-May-96. [The line length
requirement has been relaxed somewhat, but the other problems are
still effective. -- rgr, May-96.]
Limitations
Many data structures are not allocated dynamically by the MRF programs,
so there are a number of built-in limitations, controlled by figurative
constants. (In fact, the order of subscripts used by the code is such
that you can't introduce dynamic allocation of many of the
multidimensional arrays.) Ignoring things like scratch line and string
sizes, which have been made relatively enormous, the important ones are:
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:39:39 EST 1999