MRF generation examples

BMERC : needle tools : Introduction : MRF examples


[need some general intro here. -- rgr, 19-Jan-98.]

Table of contents

  1. MRF generation examples
    1. Table of contents
    2. Sample script
      1. Customization of directories
      2. make-mrf-files arguments
      3. Environments and counts
      4. Compute total counts
      5. Compute score files
    3. Automatic generation via makefiles
      1. Generating a complete set of score/env files
        1. The cross-validation set file
        2. The marginal definition file
        3. The makefile
        4. Making the dependency file
        5. Making the score and environment files
      2. Generating "add-on" score/env files


Sample script

[***finish***: this is way out of date. better to read the
"Automatic generation via makefiles" section anyway. -- rgr, 5-Dec-97.]

The make-mrf-files script, shown in its entirety below, does the entire job of generating cross-validated score and environment files for a set of cores that live in the same set of directories. (On my browser configuration, the example is much more readable if the browser window is at least 700 pixels wide.)

make-mrf-files is distributed as part of the MRF score generation suite (it appears in the ~thread/bin/scripts directory at BMERC), but does not have execute permission, since it is not expected to run without modification; at the very least, you will probably want to change the file names, if you opt to use it yourself. For reference, the run time is directly proportional to the number of cores in the core set; it takes slightly less than 5 seconds of elapsed time per core (on the old Sparc 20 that sits on my desk), divided equally between environment generation & counting on the one hand, and score file generation on the other.

Hyperlinks in the make-mrf-files code refer to sections below. Or, if you'd rather not read code the first time through, you can skip to the documentation sections and follow the hyperlinks back to the code (your browser should bring the relevant line of code to the top of the screen).


#!/bin/csh -f
#
# Run the new mrf-* programs on the original core set, generating cross-
# validated MRF files.  Expects as its first argument the name of a core list
# file that defines the core set.
#
#    Modification history:
#
# created (from test-cores script).  -- rgr, 29-Jan-97.
#

# Define directory parameters (relative to the current directory, where we will
# place the MRF files) for the for four files we need.
set alignment = ../alignment
set exposure = ../exposure
set sequence = ../models
set models = ../models

# Prefixes for original naming conventions.
set se_prefix = mrf-se-10efa-2ss-
set pe_prefix = pairwise_environments_reference_
set gmte_prefix = singleton_environments_MRF_
set ss_prefix = singleton_scores_x_MRF_
set ps_prefix = pairwise_scores_x_MRF_
set ls_prefix = loop_scores_x_MRF_

set core_list_file = $1
if (! -r "$core_list_file") then
  echo test-all-cores:  Error: "can't" read core list file $core_list_file
  exit -1
endif
shift

echo Checking 'envs & counts' . . .
foreach locus (`cat $core_list_file`)
  set cnt = $locus-mrf.cnt
  set se = $se_prefix$locus.dat
  set gmte = $gmte_prefix$locus.dat
  set pe = $pe_prefix$locus.dat
  set envs = (-pair-env-file $pe -sing-env-file $se)
  set core_seq_env = (-core-file $models/$locus.core \
		      -sequence-file $sequence/$locus.seq $envs)
  if (! -e $pe || ! -e $se) then
    echo Working on $locus envs . . .
    # Force recomputation of counts.
    rm -f $cnt
    mrf-envs -sing-ss-p -n-singleton-states 20 $core_seq_env \
	  -gmt-env-file $gmte -exposure-file $exposure/$locus.nexp
  endif
  if (! -e $cnt) then
    echo Working on $locus counts . . .
    mrf-counts $core_seq_env -alignment-file $alignment/$locus.hlg \
	-write-counts $cnt
    # force recomputation of totals.
    rm -f total-mrf.cnt
  endif
end

if (! -e total-mrf.cnt) then
  echo Generating total counts . . .
  # Delete the totals first, so we don't add them too!  [But there had better be
  # no other .cnt files in this directory . . .]
  rm -f total-mrf.cnt
  # Now delete *all* score files, since redoing the counts potentially
  # invalidates them all.
  rm -f *scores*.dat
  # Sum all counts.
  mrf-scores -add *-mrf.cnt -write-counts total-mrf.cnt
endif

echo Checking scores . . .
foreach locus (`cat $core_list_file`)
  # Score file naming conventions.
  set ss = $ss_prefix$locus.dat
  set ps = $ps_prefix$locus.dat
  set ls = $ls_prefix$locus.dat
  if (! -e $ps || ! -e $ss || ! -e $ls) then
    echo Working on $locus scores . . .
    # Copy of environment file naming conventions from above.
    set se = $se_prefix$locus.dat
    set pe = $pe_prefix$locus.dat
    set envs = (-pair-env-file $pe -sing-env-file $se)
    # Generate cross-validated pairwise scores.
    mrf-scores -poisson -add total-mrf.cnt -sub $locus-mrf.cnt \
	  $envs -loop-score-file $ls -write-pair-sing $ps $ss
  endif
end

Customization of directories

Before attempting to run this script, one will probably need to change the file and directory names
in the first dozen lines. The file names follow the original BMERC naming convention, and are what the set-up-for-true-mrf function uses [hyperlink?].

make-mrf-files arguments

The sole argument to make-mrf-files is the name of a file that contains all core locus names, one per line.
This section of the script just tests that the file exists and is readable.

Environments and counts

Computing core's environments and counts files are done together, since a core's counts depend on its environments, but neither depend on results for other cores (which is not true of score files). After
printing a message, the next line of the script loops over all locus names. For each locus, we construct some file names, and then check to see if either environment file is missing. If so, we must force the .cnt file to be recomputed by physically removing it. Finally, we compute the environment files for this core by running the mrf-envs program.

make-mrf-files uses options that define 20 singleton states: 10 equally graded exposure levels for residues in helices, and 10 for those in strands. As it turns out, this information is irrelevant, because only pairwise and GMT score files are ever generated; the "true" singleton statistics are never used. [probably need hyperlinkage! -- rgr, 30-Jan-97.]

In the same manner, the counts file is recomputed if it does not already exist. And, if we do so, the total counts must be recomputed, since they are likely to change.

Compute total counts

The mrf-scores program produces the totals by simply summing all counts files in the current directory. Needless to say, any stray counts files that were not produced by the previous steps will introduce garbage into all subsequent score files. And we must compute them all, since each and every score file depends on the count totals.

Compute score files

Once we have ensured that the totals are up to date, we must set up another loop on locus names in order to use the
mrf-scores program to compute any missing score files for each locus. (Unfortunately, we must also recompute the core's environment file names; csh is hard to modularize this way.) The scores are cross-validated for a give core by simply subtracting that core's contribution from the total counts. (Non-cross-validated scores can also be produced by removing the "-sub $locus-mrf.cnt" arguments, though this will produce identical pairwise or "true" singleton score files for all loci; only the GMT singletons will differ per-core.)

To compute "true" singleton scores instead of pairwise and GMT scores, replace "-write-pair-sing $ps $ss" with "-write-singletons $ss" (and consider picking another value for $ss_prefix to avoid conflict with the GMT scores). To omit loop scores (in either the singleton or pairwise case), drop the "-loop-score-file $ls" arguments.


Automatic generation via makefiles

Because each core requires many data files, the overall file set can become enormous for even a moderately large library of cores. Fortunately, the Unix make utility makes the job of creating the right files at the right times much easier. And with the make-mrf-depends.pl script, most of the work of generating the makefile can also be automated.

There are two important cases to consider:

Both cases are "unified" through a search path mechanism. If you supply the -search-path option to the make-mrf-depends.pl program, it will look through other directories for files that already exist. This also makes it easier to implicitly specify the location of the required core, sequence, and exposure files, which may be located in one or more other directories to minimize clutter.

Generating a complete set of score/env files

For purposes of illustration, we will construct MRF files for the same set of 15 cores in the
core generation example. Although this is far too small a set to yield anything resembling reasonable statistics, it serves admirably to illustrate the mechanics of score/environment file generation. We start out by creating a directory that is a sibling to the directories that contain the core and sequence files we need for input. In the new score directory, we must create at least the first two of the following four files: We will discuss all of these files in turn below.

Just as for core generation, we will also need a core list file, but since we wish to generate scores for the same set as the core example, we are better off using that core file directly, referring to it as "../test-cores/cores.loci". This file is discussed in the "Core list file" section.

The cross-validation set file
[oops; this doesn't work any more. 1bmtA is no longer part of the core set. -- rgr, 17-Sep-98.]

In this core set, it happens (by no small coincidence) that both 1bmtA and 1cus have similar folds (1cus has a flavodoxin fold, and 1bmt chain A has a domain with a flavodoxin fold). To prevent the statistics from one structure from helping the scoring scheme to "memorize" the structure of the other, they must be cross-validated together: neither structure may be used to produce the scores for either. Their score files will therefore be identical [with the exception of GMT scores, explained elsewhere. -- rgr, 27-Nov-97]. In order to achieve this effect, we need a cross-validation set file, the format of which is described in the Cross-validation set file format section. The file is conventionally called xval-sets.clique, and should look like this:

    flavodoxins:	1bmtA	1cus
Only one line is needed here, since we wish to use the hold-one-out strategy for the remaining cores. The line starts with a mnemonic name for the set that ends in a colon, followed by the tab-separated locus names (do not put any spaces in this file!). The set name is arbitrary, but it should be acceptable as part of a file name (because spaces or asterisks will confuse make).
The marginal definition file

Details on how (and why) to use a marginal definition file can be found in the "GMT marginal sets" section. Here we simply note that they must be specified in the appropriate MRF-*-SCORES macros if used. The example makefile puts this in the PAIR-SCORE-OPTS macro to avoid having to repeat it for each of the score macros.

Here are the contents of the file we are using here, identical to the one discussed in the "GMT marginal sets" section, repeated here for convenience.

    # e-bur
    5,6,7t,9t,11t,17,18,19t
    # e-exp
    6t,7,8,10t,12t,18t,19,20
    # h-bur
    1,2,3t,9,10,13,14,15t
    # h-exp
    2t,3,4,11,12,14t,15,16
This just reflects the standard WMS environment definitions.

Of course, this file (and its mention in the makefile) may be omitted entirely if GMT scores are not desired.

The makefile
[***here***. continue. -- rgr, 17-Sep-98.]

As mentioned above, the makefile must be named "makefile", and should look something like what is shown below (see elsewhere for a discussion of makefile syntax):


    # Example makefile for MRF score/environment file generation.

[1] MRF-GMT-ENVS = mrf-envs
    MRF-SING-ENVS = mrf-envs
    MRF-PAIR-ENVS = mrf-envs
    MRF-ENVS = mrf-envs
    MRF-COUNTS = mrf-counts
    ADD-COUNTS = mrf-scores
[2] # this is shorthand; ${PAIR-SCORE-OPTS} is not required per se.
    PAIR-SCORE-OPTS = -gmt-marginal-file mrf.msd -min-pair-count 4
    MRF-SING-SCORES = mrf-scores -poisson
    MRF-LOOP-SCORES = mrf-scores -poisson
    MRF-GMT-SCORES = mrf-scores ${PAIR-SCORE-OPTS}
    MRF-PAIR-SCORES = mrf-scores ${PAIR-SCORE-OPTS}
    MRF-SCORES = mrf-scores -poisson ${PAIR-SCORE-OPTS}

[3] all:  gmt-environments gmt-scores pairwise-scores loop-scores

[4] clean:	clean-scores
	    rm -f ${core-counts} total-mrf.cnt.tmp
	    rm -f total-mrf.cnt ${xval-set-counts}
	    rm -f ${singleton-environments} ${pairwise-environments}
	    rm -f ${gmt-environments}
    clean-scores:
	    rm -f ${singleton-scores} ${loop-scores}
	    rm -f ${gmt-scores} ${pairwise-scores}

    # .core and .seq files are in ../cores/, and exposure files
    # are in ../exposure/.  We have no homolog data for these cores.
    search-path = ../cores/ ../exposure/

    depend:
	    make-mrf-depends.pl -search-path ${search-path} \
		    -xval-sets xval-sets.clique \
		    -core-list-file cores.loci > depends.make

    include depends.make

(If you cut & paste this out of your browser, be sure to (a) remove any leading whitespace and footnote numbers; (b) ensure that the three lines after the "depend:" target start with ASCII tab characters; and (c) ensure that there are no spaces after the backslash characters at the ends of lines.)

Notes:

  1. At the top of the makefile, a series of required macros are defined in the file.
  2. Notice how the makefile uses the PAIR-SCORE-OPTS macro to avoid having to repeat the series of options desired for pairwise scores for each of the score macros. Different MRF-*-SCORES macros may need to use different options, but they must at least be consistent, and using auxiliary macros like PAIR-SCORE-OPTS to avoid repetition certainly helps in this regard.
  3. The first target in a makefile is traditionally named "all"; being first, it is what gets made by default. Listing all of the score file targets plus the GMT environments (which are not needed for score construction) will ensure that all files needed for threading get built.
  4. While it is never necessary to have a "clean" target, it can come in handy when rebuilding everything. Using the predefined file name macros ensures that the only things deleted are those that can be rebuilt. The scores are done separately with a "clean-scores" target, but this is a matter of personal preference.
Making the dependency file
To create the depends.make file, simply do "touch depends.make" (since the file must exist for make to read the makefile properly), then run make-mrf-depends.pl by doing "make depend". The transcript will look something like this:

    hinshelwood% touch depends.make
    hinshelwood% make depend
    make-mrf-depends.pl -search-path ../cores/ ../exposure/ \
	    -xval-sets xval-sets.clique \
	    -core-list-file cores.loci > depends.make
    hinshelwood% 
Note how the command is echoed with the "${search-path}" macro invocation filled out. As shown in the initial fragment below, make-mrf-depends.pl also puts the arguments it sees, along with the date and time, at the head of the file it creates.

    # This file was generated automatically by make-mrf-depends.pl; do not edit!
    # Created on Sun Nov 30 16:38:54 EST 1997 with the following options:
    #
    #	make-mrf-depends.pl -search-path ../cores/ ../exposure/ -xval-sets \
    #		xval-sets.clique -core-list-file cores.loci
    #
    # perl libraries used:
    #
    #	/home2/staff/thread/bin/scripts/make-lib.pm
    #	/home2/staff/thread/bin/scripts/mrf-make-lib.pm
    #
    # 5 loci were found in cores.loci.
    #

[1] # Cross-validation sets defined in xval-sets.clique:
    flavodoxins-count-files = 1bmtA-mrf.cnt 1cus-mrf.cnt
    flavodoxins-mrf.cnt: ${flavodoxins-count-files}
	    ${ADD-COUNTS} -add ${flavodoxins-count-files} -write-counts \
		    flavodoxins-mrf.cnt

    # 1bmtA
    singleton_environments_MRF_1bmtA.dat: ../cores/1bmtA.core
	    ${MRF-GMT-ENVS} -core-file ../cores/1bmtA.core -gmt-env-file \
		    singleton_environments_MRF_1bmtA.dat
    mrf-se-10efa-2ss-1bmtA.dat: ../cores/1bmtA.core ../exposure/1bmt.nexp \
		../cores/1bmtA.seq
	    ${MRF-SING-ENVS} -sing-env-file mrf-se-10efa-2ss-1bmtA.dat -core-file \
		    ../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
		    -exposure-file ../exposure/1bmt.nexp
    pairwise_environments_reference_1bmtA.dat: ../cores/1bmtA.core \
		../exposure/1bmt.nexp ../cores/1bmtA.seq
	    ${MRF-PAIR-ENVS} -pair-env-file \
		    pairwise_environments_reference_1bmtA.dat -core-file \
		    ../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
		    -exposure-file ../exposure/1bmt.nexp
    1bmtA-mrf.cnt: mrf-se-10efa-2ss-1bmtA.dat \
		pairwise_environments_reference_1bmtA.dat ../cores/1bmtA.core
	    ${MRF-COUNTS} -pair-env-file pairwise_environments_reference_1bmtA.dat \
		    -sing-env-file mrf-se-10efa-2ss-1bmtA.dat -core-file \
		    ../cores/1bmtA.core -sequence-file ../cores/1bmtA.seq \
		    -write-counts 1bmtA-mrf.cnt
    singleton_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt \
		pairwise_environments_reference_1bmtA.dat \
		mrf-se-10efa-2ss-1bmtA.dat
	    ${MRF-GMT-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
		    -pair-env-file pairwise_environments_reference_1bmtA.dat \
		    -sing-env-file mrf-se-10efa-2ss-1bmtA.dat \
		    ${MRF-MASSAGE-COUNTS} -write-gmt \
		    singleton_scores_x_MRF_1bmtA.dat
    pairwise_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt
	    ${MRF-PAIR-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
		    ${MRF-MASSAGE-COUNTS} -write-pair \
		    pairwise_scores_x_MRF_1bmtA.dat
    loop_scores_x_MRF_1bmtA.dat: flavodoxins-mrf.cnt total-mrf.cnt
	    ${MRF-LOOP-SCORES} -add total-mrf.cnt -sub flavodoxins-mrf.cnt \
		    -write-pair /dev/null -write-loop loop_scores_x_MRF_1bmtA.dat
    . . .

Notes:

  1. The first set of targets in the buffer is for the counts files for defined cross-validation sets. Note how all of the subsequent score files for 1bmtA are constructed with "-add total-mrf.cnt -sub flavodoxins-mrf.cnt" instead of "-add total-mrf.cnt -sub 1mbtA-mrf.cnt" as for the other cores.

After doing make depend, the directory will look something like this:


    hinshelwood% ls -l
    total 17
    -rw-rw-r--  1 rogers         27 Nov 27 15:32 cores.loci
    -rw-rw-r--  1 rogers      11610 Nov 30 16:48 depends.make
    -rw-rw-r--  1 rogers       1158 Nov 30 16:48 makefile
    -rw-rw-r--  1 rogers        126 Nov 27 16:53 mrf.msd
    -rw-rw-r--  1 rogers         24 Nov 27 15:36 xval-sets.clique
    hinshelwood% 
At this point, the directory is initialized. The make depend step will need to be redone only if either the make-mrf-depends.pl arguments, the ${MRF-*} command macros in the makefile, or the contents of the core list file change.
Making the score and environment files

Once the dependency file is all set up, the final step is to do "make", which will create the actual score and environment files (plus counts files along the way). We do not bother to show the transcript, which just essentially just a copy of the command lines in the depends.make file.


    hinshelwood% ls -l
    total 697
    -rw-rw-r--  1 rogers      21461 Nov 30 16:52 1bmtA-mrf.cnt
    -rw-rw-r--  1 rogers      13321 Nov 30 16:52 1cus-mrf.cnt
    -rw-rw-r--  1 rogers      18884 Nov 30 16:52 1ra9-mrf.cnt
    -rw-rw-r--  1 rogers       5631 Nov 30 16:52 1vcc-mrf.cnt
    -rw-rw-r--  1 rogers      36577 Nov 30 16:53 2sblB-mrf.cnt
    -rw-rw-r--  1 rogers         27 Nov 27 15:32 cores.loci
    -rw-rw-r--  1 rogers      11610 Nov 30 16:48 depends.make
    -rw-rw-r--  1 rogers      24780 Nov 30 16:52 flavodoxins-mrf.cnt
    -rw-rw-r--  1 rogers        229 Nov 30 16:53 loop_scores_x_MRF_1bmtA.dat
    -rw-rw-r--  1 rogers        229 Nov 30 16:53 loop_scores_x_MRF_1cus.dat
    -rw-rw-r--  1 rogers        229 Nov 30 16:53 loop_scores_x_MRF_1ra9.dat
    -rw-rw-r--  1 rogers        229 Nov 30 16:53 loop_scores_x_MRF_1vcc.dat
    -rw-rw-r--  1 rogers        229 Nov 30 16:53 loop_scores_x_MRF_2sblB.dat
    -rw-rw-r--  1 rogers       1158 Nov 30 16:48 makefile
    -rw-rw-r--  1 rogers       1749 Nov 30 16:52 mrf-se-10efa-2ss-1bmtA.dat
    -rw-rw-r--  1 rogers        990 Nov 30 16:52 mrf-se-10efa-2ss-1cus.dat
    -rw-rw-r--  1 rogers        880 Nov 30 16:52 mrf-se-10efa-2ss-1ra9.dat
    -rw-rw-r--  1 rogers        363 Nov 30 16:52 mrf-se-10efa-2ss-1vcc.dat
    -rw-rw-r--  1 rogers       4103 Nov 30 16:52 mrf-se-10efa-2ss-2sblB.dat
    -rw-rw-r--  1 rogers        126 Nov 27 16:53 mrf.msd
    -rw-rw-r--  1 rogers       6985 Nov 30 16:52 pairwise_environments_reference_1bmtA.dat
    -rw-rw-r--  1 rogers       3835 Nov 30 16:52 pairwise_environments_reference_1cus.dat
    -rw-rw-r--  1 rogers       2785 Nov 30 16:52 pairwise_environments_reference_1ra9.dat
    -rw-rw-r--  1 rogers        940 Nov 30 16:52 pairwise_environments_reference_1vcc.dat
    -rw-rw-r--  1 rogers      14920 Nov 30 16:53 pairwise_environments_reference_2sblB.dat
    -rw-rw-r--  1 rogers      64586 Nov 30 16:53 pairwise_scores_x_MRF_1bmtA.dat
    -rw-rw-r--  1 rogers      64586 Nov 30 16:53 pairwise_scores_x_MRF_1cus.dat
    -rw-rw-r--  1 rogers      64586 Nov 30 16:53 pairwise_scores_x_MRF_1ra9.dat
    -rw-rw-r--  1 rogers      64586 Nov 30 16:53 pairwise_scores_x_MRF_1vcc.dat
    -rw-rw-r--  1 rogers      64586 Nov 30 16:53 pairwise_scores_x_MRF_2sblB.dat
    -rw-rw-r--  1 rogers       1749 Nov 30 16:52 singleton_environments_MRF_1bmtA.dat
    -rw-rw-r--  1 rogers        990 Nov 30 16:52 singleton_environments_MRF_1cus.dat
    -rw-rw-r--  1 rogers        880 Nov 30 16:52 singleton_environments_MRF_1ra9.dat
    -rw-rw-r--  1 rogers        363 Nov 30 16:52 singleton_environments_MRF_1vcc.dat
    -rw-rw-r--  1 rogers       4103 Nov 30 16:52 singleton_environments_MRF_2sblB.dat
    -rw-rw-r--  1 rogers      22994 Nov 30 16:53 singleton_scores_x_MRF_1bmtA.dat
    -rw-rw-r--  1 rogers      12989 Nov 30 16:53 singleton_scores_x_MRF_1cus.dat
    -rw-rw-r--  1 rogers      11539 Nov 30 16:53 singleton_scores_x_MRF_1ra9.dat
    -rw-rw-r--  1 rogers       4724 Nov 30 16:53 singleton_scores_x_MRF_1vcc.dat
    -rw-rw-r--  1 rogers      54024 Nov 30 16:53 singleton_scores_x_MRF_2sblB.dat
    -rw-rw-r--  1 rogers      39285 Nov 30 16:53 total-mrf.cnt
    -rw-rw-r--  1 rogers      39285 Nov 30 16:53 total-mrf.cnt.tmp
    -rw-rw-r--  1 rogers         24 Nov 27 15:36 xval-sets.clique
    hinshelwood% 
Note how specifying a "flavodoxin" cross-validation set resulted in the creation of a flavodoxins-mrf.cnt file. You can verify for yourself that each of the corresponding score files for 1bmtA and 1cus is identical.

Generating "add-on" score/env files

[***finish***: need another example, probably an add-on set to the first example. -- rgr, 27-Nov-97.]
Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:59:58 EST 1999