BMERC : needle tools : Introduction : MRF examples
[need some general intro here. -- rgr, 19-Jan-98.]
Table of contents
Sample script
[***finish***: this is way out of date. better to read the "Automatic generation via makefiles" section
anyway. -- rgr, 5-Dec-97.]
The make-mrf-files script, shown in its entirety below, does the entire job of generating cross-validated score and environment files for a set of cores that live in the same set of directories. (On my browser configuration, the example is much more readable if the browser window is at least 700 pixels wide.)
make-mrf-files is distributed as part of the MRF score generation suite (it appears in the ~thread/bin/scripts directory at BMERC), but does not have execute permission, since it is not expected to run without modification; at the very least, you will probably want to change the file names, if you opt to use it yourself. For reference, the run time is directly proportional to the number of cores in the core set; it takes slightly less than 5 seconds of elapsed time per core (on the old Sparc 20 that sits on my desk), divided equally between environment generation & counting on the one hand, and score file generation on the other.
Hyperlinks in the make-mrf-files code refer to sections below. Or, if you'd rather not read code the first time through, you can skip to the documentation sections and follow the hyperlinks back to the code (your browser should bring the relevant line of code to the top of the screen).
#!/bin/csh -f
#
# Run the new mrf-* programs on the original core set, generating cross-
# validated MRF files. Expects as its first argument the name of a core list
# file that defines the core set.
#
# Modification history:
#
# created (from test-cores script). -- rgr, 29-Jan-97.
#
# Define directory parameters (relative to the current directory, where we will
# place the MRF files) for the for four files we need.
set alignment = ../alignment
set exposure = ../exposure
set sequence = ../models
set models = ../models
# Prefixes for original naming conventions.
set se_prefix = mrf-se-10efa-2ss-
set pe_prefix = pairwise_environments_reference_
set gmte_prefix = singleton_environments_MRF_
set ss_prefix = singleton_scores_x_MRF_
set ps_prefix = pairwise_scores_x_MRF_
set ls_prefix = loop_scores_x_MRF_
set core_list_file = $1
if (! -r "$core_list_file") then
echo test-all-cores: Error: "can't" read core list file $core_list_file
exit -1
endif
shift
echo Checking 'envs & counts' . . .
foreach locus (`cat $core_list_file`)
set cnt = $locus-mrf.cnt
set se = $se_prefix$locus.dat
set gmte = $gmte_prefix$locus.dat
set pe = $pe_prefix$locus.dat
set envs = (-pair-env-file $pe -sing-env-file $se)
set core_seq_env = (-core-file $models/$locus.core \
-sequence-file $sequence/$locus.seq $envs)
if (! -e $pe || ! -e $se) then
echo Working on $locus envs . . .
# Force recomputation of counts.
rm -f $cnt
mrf-envs -sing-ss-p -n-singleton-states 20 $core_seq_env \
-gmt-env-file $gmte -exposure-file $exposure/$locus.nexp
endif
if (! -e $cnt) then
echo Working on $locus counts . . .
mrf-counts $core_seq_env -alignment-file $alignment/$locus.hlg \
-write-counts $cnt
# force recomputation of totals.
rm -f total-mrf.cnt
endif
end
if (! -e total-mrf.cnt) then
echo Generating total counts . . .
# Delete the totals first, so we don't add them too! [But there had better be
# no other .cnt files in this directory . . .]
rm -f total-mrf.cnt
# Now delete *all* score files, since redoing the counts potentially
# invalidates them all.
rm -f *scores*.dat
# Sum all counts.
mrf-scores -add *-mrf.cnt -write-counts total-mrf.cnt
endif
echo Checking scores . . .
foreach locus (`cat $core_list_file`)
# Score file naming conventions.
set ss = $ss_prefix$locus.dat
set ps = $ps_prefix$locus.dat
set ls = $ls_prefix$locus.dat
if (! -e $ps || ! -e $ss || ! -e $ls) then
echo Working on $locus scores . . .
# Copy of environment file naming conventions from above.
set se = $se_prefix$locus.dat
set pe = $pe_prefix$locus.dat
set envs = (-pair-env-file $pe -sing-env-file $se)
# Generate cross-validated pairwise scores.
mrf-scores -poisson -add total-mrf.cnt -sub $locus-mrf.cnt \
$envs -loop-score-file $ls -write-pair-sing $ps $ss
endif
end
make-mrf-files uses options that define 20 singleton states: 10 equally graded exposure levels for residues in helices, and 10 for those in strands. As it turns out, this information is irrelevant, because only pairwise and GMT score files are ever generated; the "true" singleton statistics are never used. [probably need hyperlinkage! -- rgr, 30-Jan-97.]
In the same manner, the counts file is
recomputed if it does not already exist.
And, if we do so, the total counts must
be recomputed, since they are likely to change.
To compute "true" singleton scores instead of pairwise and GMT scores,
replace "-write-pair-sing $ps $ss" with
"-write-singletons $ss" (and consider picking another
value for $ss_prefix to avoid conflict with the GMT scores).
To omit loop scores (in either the singleton or pairwise case), drop the
"-loop-score-file $ls" arguments.
There are two important cases to consider:
Just as for core generation, we will also need a core list
file, but since we wish to generate scores for the same set as the
core example, we are better off using that core file directly, referring
to it as "../test-cores/cores.loci". This file is discussed in
the "Core list
file" section.
In this core set, it happens (by no small coincidence) that both
1bmtA and 1cus have similar folds (1cus has a flavodoxin fold, and 1bmt
chain A has a domain with a flavodoxin fold). To prevent the statistics
from one structure from helping the scoring scheme to "memorize" the
structure of the other, they must be cross-validated together: neither
structure may be used to produce the scores for either. Their score
files will therefore be identical [with the exception of GMT scores,
explained elsewhere. -- rgr, 27-Nov-97]. In order to achieve this
effect, we need a cross-validation set file, the format of which is
described in the
Cross-validation set file format section. The file is
conventionally called xval-sets.clique, and should look like
this:
Details on how (and why) to use a marginal definition file can be
found in the "GMT marginal
sets" section. Here we simply note that they must be specified in
the appropriate
MRF-*-SCORES macros if used. The example makefile puts
this in the PAIR-SCORE-OPTS macro to avoid having to repeat it
for each of the score macros.
Here are the contents of the file we are using here, identical to the
one discussed in the "GMT
marginal sets" section, repeated here for convenience.
Of course, this file (and its mention in the makefile) may be omitted
entirely if GMT scores are not desired.
As mentioned above, the makefile must be named "makefile", and
should look something like what is shown below (see elsewhere for a
discussion of makefile
syntax):
Notes:
Notes:
After doing make depend, the directory will
look something like this:
Once the dependency file is all set up, the final step is to do
"make", which will create the actual score and environment
files (plus counts files along the way). We do not bother to show the
transcript, which just essentially just a copy of the command lines in
the depends.make file.
Compute total counts
The mrf-scores program produces the totals by simply summing
all counts files in the current directory. Needless to say, any stray
counts files that were not produced by the previous steps will introduce
garbage into all subsequent score files.
And we must compute them all, since each and every score file depends on
the count totals.
Compute score files
Once we have ensured that the totals are up to date, we must set up
another loop on locus names in order to use the mrf-scores program to compute any missing score files for
each locus. (Unfortunately, we must also recompute the core's
environment file names; csh is hard to modularize this way.)
The scores are cross-validated for a give core by simply subtracting
that core's contribution from the total counts. (Non-cross-validated
scores can also be produced by removing the
"-sub $locus-mrf.cnt" arguments, though this will produce
identical pairwise or "true" singleton score files for all loci; only
the GMT singletons will differ per-core.)
Automatic generation via makefiles
Because each core requires many data files, the overall file set can
become enormous for even a moderately large library of cores.
Fortunately, the Unix make utility makes the job of creating
the right files at the right times much easier. And with the
make-mrf-depends.pl script, most of the work of generating the
makefile can also be automated.
Both cases are "unified" through a search path mechanism. If you supply
the -search-path option to the
make-mrf-depends.pl program, it will look through other
directories for files that already exist. This also makes it easier to
implicitly specify the location of the required core, sequence, and
exposure files, which may be located in one or more other directories to
minimize clutter.
Generating a complete set of score/env files
For purposes of illustration, we will construct MRF files for the same
set of 15 cores in the core
generation example. Although this is far too small a set to yield
anything resembling reasonable statistics, it serves admirably to
illustrate the mechanics of score/environment file generation. We start
out by creating a directory that is a sibling to the directories that
contain the core and sequence files we need for input. In the new score
directory, we must create at least the first two of the following four
files:
We will discuss all of these files in turn below.
The cross-validation set file
[oops; this doesn't work any more. 1bmtA is no longer part of the
core set. -- rgr, 17-Sep-98.]
flavodoxins: 1bmtA 1cus
Only one line is needed here, since we wish to use the hold-one-out
strategy for the remaining cores. The line starts with a mnemonic name
for the set that ends in a colon, followed by the tab-separated locus
names (do not put any spaces in this file!). The set name is arbitrary,
but it should be acceptable as part of a file name (because spaces or
asterisks will confuse make).
The marginal definition file
# e-bur
5,6,7t,9t,11t,17,18,19t
# e-exp
6t,7,8,10t,12t,18t,19,20
# h-bur
1,2,3t,9,10,13,14,15t
# h-exp
2t,3,4,11,12,14t,15,16
This just reflects the standard WMS
environment definitions.
The makefile
[***here***. continue. -- rgr, 17-Sep-98.]
# Example makefile for MRF score/environment file generation.
[1] MRF-GMT-ENVS = mrf-envs
MRF-SING-ENVS = mrf-envs
MRF-PAIR-ENVS = mrf-envs
MRF-ENVS = mrf-envs
MRF-COUNTS = mrf-counts
ADD-COUNTS = mrf-scores
[2] # this is shorthand; ${PAIR-SCORE-OPTS} is not required per se.
PAIR-SCORE-OPTS = -gmt-marginal-file mrf.msd -min-pair-count 4
MRF-SING-SCORES = mrf-scores -poisson
MRF-LOOP-SCORES = mrf-scores -poisson
MRF-GMT-SCORES = mrf-scores ${PAIR-SCORE-OPTS}
MRF-PAIR-SCORES = mrf-scores ${PAIR-SCORE-OPTS}
MRF-SCORES = mrf-scores -poisson ${PAIR-SCORE-OPTS}
[3] all: gmt-environments gmt-scores pairwise-scores loop-scores
[4] clean: clean-scores
rm -f ${core-counts} total-mrf.cnt.tmp
rm -f total-mrf.cnt ${xval-set-counts}
rm -f ${singleton-environments} ${pairwise-environments}
rm -f ${gmt-environments}
clean-scores:
rm -f ${singleton-scores} ${loop-scores}
rm -f ${gmt-scores} ${pairwise-scores}
# .core and .seq files are in ../cores/, and exposure files
# are in ../exposure/. We have no homolog data for these cores.
search-path = ../cores/ ../exposure/
depend:
make-mrf-depends.pl -search-path ${search-path} \
-xval-sets xval-sets.clique \
-core-list-file cores.loci > depends.make
include depends.make
(If you cut & paste this out of your
browser, be sure to (a) remove any leading whitespace and footnote
numbers; (b) ensure that the three lines after the "depend:"
target start with ASCII tab characters; and (c) ensure that there are no
spaces after the backslash characters at the ends of lines.)
Making the dependency file
To create the depends.make file, simply do
"touch depends.make" (since the file must exist for
make to read the makefile properly), then run
make-mrf-depends.pl by doing "make depend". The
transcript will look something like this:
hinshelwood% touch depends.make
hinshelwood% make depend
make-mrf-depends.pl -search-path ../cores/ ../exposure/ \
-xval-sets xval-sets.clique \
-core-list-file cores.loci > depends.make
hinshelwood%
Note how the command is echoed with the "${search-path}" macro
invocation filled out. As shown in the initial fragment below,
make-mrf-depends.pl also puts the arguments it sees, along with
the date and time, at the head of the file it creates.
# This file was generated automatically by make-mrf-depends.pl; do not edit!
# Created on Sun Nov 30 16:38:54 EST 1997 with the following options:
#
# make-mrf-depends.pl -search-path ../cores/ ../exposure/ -xval-sets \
# xval-sets.clique -core-list-file cores.loci
#
# perl libraries used:
#
# /home2/staff/thread/bin/scripts/make-lib.pm
# /home2/staff/thread/bin/scripts/mrf-make-lib.pm
#
# 5 loci were found in cores.loci.
#
[1] # Cross-validation sets defined in xval-sets.clique:
flavodoxins-count-files = 1bmtA-mrf.cnt 1cus-mrf.cnt
flavodoxins-mrf.cnt: ${flavodoxins-count-files}
${ADD-COUNTS} -add ${flavodoxins-count-files} -write-counts \
flavodoxins-mrf.cnt
# 1bmtA
singleton_environments_MRF_1bmtA.dat: ../cores/1bmtA.core
${MRF-GMT-ENVS} -core-file ../cores/1bmtA.core -gmt-env-file \
singleton_environments_MRF_1bmtA.dat
mrf-se-10efa-2ss-1bmtA.dat: ../cores/1bmtA.core ../exposure/1bmt.nexp \
../cores/1bmtA.seq
${MRF-SING-ENVS} -sing-env-file mrf-se-10efa-2ss-1bmtA.dat -core-file \
../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
-exposure-file ../exposure/1bmt.nexp
pairwise_environments_reference_1bmtA.dat: ../cores/1bmtA.core \
../exposure/1bmt.nexp ../cores/1bmtA.seq
${MRF-PAIR-ENVS} -pair-env-file \
pairwise_environments_reference_1bmtA.dat -core-file \
../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
-exposure-file ../exposure/1bmt.nexp
1bmtA-mrf.cnt: mrf-se-10efa-2ss-1bmtA.dat \
pairwise_environments_reference_1bmtA.dat ../cores/1bmtA.core
${MRF-COUNTS} -pair-env-file pairwise_environments_reference_1bmtA.dat \
-sing-env-file mrf-se-10efa-2ss-1bmtA.dat -core-file \
../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
-write-counts 1bmtA-mrf.cnt
singleton_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt \
pairwise_environments_reference_1bmtA.dat \
mrf-se-10efa-2ss-1bmtA.dat
${MRF-GMT-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
-pair-env-file pairwise_environments_reference_1bmtA.dat \
-sing-env-file mrf-se-10efa-2ss-1bmtA.dat \
${MRF-MASSAGE-COUNTS} -write-gmt \
singleton_scores_x_MRF_1bmtA.dat
pairwise_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt
${MRF-PAIR-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
${MRF-MASSAGE-COUNTS} -write-pair \
pairwise_scores_x_MRF_1bmtA.dat
loop_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt
${MRF-LOOP-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
-write-pair /dev/null -write-loop loop_scores_x_MRF_1bmtA.dat
. . .
hinshelwood% ls -l
total 17
-rw-rw-r-- 1 rogers 27 Nov 27 15:32 cores.loci
-rw-rw-r-- 1 rogers 11610 Nov 30 16:48 depends.make
-rw-rw-r-- 1 rogers 1158 Nov 30 16:48 makefile
-rw-rw-r-- 1 rogers 126 Nov 27 16:53 mrf.msd
-rw-rw-r-- 1 rogers 24 Nov 27 15:36 xval-sets.clique
hinshelwood%
At this point, the directory is initialized. The
make depend step will need to be redone only if either the
make-mrf-depends.pl arguments, the ${MRF-*} command
macros in the makefile, or the contents of the core list file change.
Making the score and environment files
hinshelwood% ls -l
total 697
-rw-rw-r-- 1 rogers 21461 Nov 30 16:52 1bmtA-mrf.cnt
-rw-rw-r-- 1 rogers 13321 Nov 30 16:52 1cus-mrf.cnt
-rw-rw-r-- 1 rogers 18884 Nov 30 16:52 1ra9-mrf.cnt
-rw-rw-r-- 1 rogers 5631 Nov 30 16:52 1vcc-mrf.cnt
-rw-rw-r-- 1 rogers 36577 Nov 30 16:53 2sblB-mrf.cnt
-rw-rw-r-- 1 rogers 27 Nov 27 15:32 cores.loci
-rw-rw-r-- 1 rogers 11610 Nov 30 16:48 depends.make
-rw-rw-r-- 1 rogers 24780 Nov 30 16:52 flavodoxins-mrf.cnt
-rw-rw-r-- 1 rogers 229 Nov 30 16:53 loop_scores_x_MRF_1bmtA.dat
-rw-rw-r-- 1 rogers 229 Nov 30 16:53 loop_scores_x_MRF_1cus.dat
-rw-rw-r-- 1 rogers 229 Nov 30 16:53 loop_scores_x_MRF_1ra9.dat
-rw-rw-r-- 1 rogers 229 Nov 30 16:53 loop_scores_x_MRF_1vcc.dat
-rw-rw-r-- 1 rogers 229 Nov 30 16:53 loop_scores_x_MRF_2sblB.dat
-rw-rw-r-- 1 rogers 1158 Nov 30 16:48 makefile
-rw-rw-r-- 1 rogers 1749 Nov 30 16:52 mrf-se-10efa-2ss-1bmtA.dat
-rw-rw-r-- 1 rogers 990 Nov 30 16:52 mrf-se-10efa-2ss-1cus.dat
-rw-rw-r-- 1 rogers 880 Nov 30 16:52 mrf-se-10efa-2ss-1ra9.dat
-rw-rw-r-- 1 rogers 363 Nov 30 16:52 mrf-se-10efa-2ss-1vcc.dat
-rw-rw-r-- 1 rogers 4103 Nov 30 16:52 mrf-se-10efa-2ss-2sblB.dat
-rw-rw-r-- 1 rogers 126 Nov 27 16:53 mrf.msd
-rw-rw-r-- 1 rogers 6985 Nov 30 16:52 pairwise_environments_reference_1bmtA.dat
-rw-rw-r-- 1 rogers 3835 Nov 30 16:52 pairwise_environments_reference_1cus.dat
-rw-rw-r-- 1 rogers 2785 Nov 30 16:52 pairwise_environments_reference_1ra9.dat
-rw-rw-r-- 1 rogers 940 Nov 30 16:52 pairwise_environments_reference_1vcc.dat
-rw-rw-r-- 1 rogers 14920 Nov 30 16:53 pairwise_environments_reference_2sblB.dat
-rw-rw-r-- 1 rogers 64586 Nov 30 16:53 pairwise_scores_x_MRF_1bmtA.dat
-rw-rw-r-- 1 rogers 64586 Nov 30 16:53 pairwise_scores_x_MRF_1cus.dat
-rw-rw-r-- 1 rogers 64586 Nov 30 16:53 pairwise_scores_x_MRF_1ra9.dat
-rw-rw-r-- 1 rogers 64586 Nov 30 16:53 pairwise_scores_x_MRF_1vcc.dat
-rw-rw-r-- 1 rogers 64586 Nov 30 16:53 pairwise_scores_x_MRF_2sblB.dat
-rw-rw-r-- 1 rogers 1749 Nov 30 16:52 singleton_environments_MRF_1bmtA.dat
-rw-rw-r-- 1 rogers 990 Nov 30 16:52 singleton_environments_MRF_1cus.dat
-rw-rw-r-- 1 rogers 880 Nov 30 16:52 singleton_environments_MRF_1ra9.dat
-rw-rw-r-- 1 rogers 363 Nov 30 16:52 singleton_environments_MRF_1vcc.dat
-rw-rw-r-- 1 rogers 4103 Nov 30 16:52 singleton_environments_MRF_2sblB.dat
-rw-rw-r-- 1 rogers 22994 Nov 30 16:53 singleton_scores_x_MRF_1bmtA.dat
-rw-rw-r-- 1 rogers 12989 Nov 30 16:53 singleton_scores_x_MRF_1cus.dat
-rw-rw-r-- 1 rogers 11539 Nov 30 16:53 singleton_scores_x_MRF_1ra9.dat
-rw-rw-r-- 1 rogers 4724 Nov 30 16:53 singleton_scores_x_MRF_1vcc.dat
-rw-rw-r-- 1 rogers 54024 Nov 30 16:53 singleton_scores_x_MRF_2sblB.dat
-rw-rw-r-- 1 rogers 39285 Nov 30 16:53 total-mrf.cnt
-rw-rw-r-- 1 rogers 39285 Nov 30 16:53 total-mrf.cnt.tmp
-rw-rw-r-- 1 rogers 24 Nov 27 15:36 xval-sets.clique
hinshelwood%
Note how specifying a "flavodoxin" cross-validation set resulted in the
creation of a flavodoxins-mrf.cnt file. You can verify for
yourself that each of the corresponding score files for 1bmtA and 1cus
is identical.
Generating "add-on" score/env files
[***finish***: need another example, probably an add-on set to the first
example. -- rgr, 27-Nov-97.]
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:59:58 EST 1999