MRF score and environment formats

BMERC : needle tools : File formats : MRF file formats


Table of contents

  1. MRF score and environment formats
    1. Table of contents
    2. Environment file format
    3. Counts file format
    4. Score file format


Environment file format

The environment describes what set of scores to use at a given residue position (or between a pair of residue positions, for pairwise environments). The environment file maps a
core total index (or two CTI's for pairwise) to an environment number. Choice of environment numbers depends on the score function; the environment-generating code for a given score function is free to assign environment numbers arbitrarily (though for practical reasons, they should be restricted to small positive integers.

Note that there is no loop environment file; the definition of loop environments is implicit (and besides, there are no fixed features on which to hang environment assignments).

The first two numbers in an environment file appear on the first line of the file, and are:

  1. the total number of environment entries; and
  2. the "size" of each entry (which is always 2 for singleton files, and 3 for pairwise files).
Each subsequent line has:
  1. the CTI of the first (or only residue);
  2. for pairwise files only, the CTI of the second residue; and
  3. the environment number.
Traditionally, this is in "%4d %4d %4d\n" format, but any whitespace that separates the numbers is allowed. (In fact, needle doesn't care where you put the newlines, either. But we like to preserve this format because it's easy for humans to read, and it's easier for benighted languages like C to parse.)

The following example is taken from the start of a singleton file, mrf-se-10efa-2ss-2hpr.dat to be exact. Note that there are a total 47 entries in this file; there must be exactly one for each core position in a singleton file, so the CTI in the first column runs through all values from 1 to 47. (There is no requirement that they be in order, though.)

 
      47     2
       1    14
       2    18
       3    11
       4    15
       5     7
       6     9
       7     1
       8     1
       9     5
      10     4
      11     1
      12     1
      13     8
      14     2
      15     1
       . . .

The following is an example of a pairwise environment file, pairwise_environments_MRF_4cpa.dat to be exact. Although there are a total of 198 entries in this file, we show only the lines involving the first 5 core positions. Note that there is no constraint on length or order, and a given core position need not appear in the file at all (though that would be suspicious).

 
     198    3
       1    3   17
       1   21    6
       1   23    5
       2    4   17
      11    2    9
      14    2    9
      15    2    9
       2   22    6
       2   27    5
       2   29    5
       3    5   17
       3   23    5
       3   28    5
       7    4    9
       8    4   11
      11    4    9
       4   22    6
       4   24    6
       4   29    5
       5   30    5
      31    5    9

. . .
[explain about symmetry issues. -- rgr, 14-Jan-97.]

Note that the GMT environment file is of a special form. Since each core position gets its own set of GMT values (which may not be distinct, but we don't bother checking, since the chance is small), each core position also needs its own environment. Therefore, the environment file is an identity mapping, as shown in the singleton_environments_MRF_4fxn.dat file below:

 
      74     2
       1     1
       2     2
       3     3
       4     4
       5     5
       6     6
       7     7
       8     8
       9     9
      10    10
      11    11
      . . .
      68    68
      69    69
      70    70
      71    71
      72    72
      73    73
      74    74
The associated singleton score file with the GMT probabilities must therefore have 74 environments.


Counts file format

The "counts" file contains counts of amino acids seen in singleton, pairwise, and loop environments, for a core and a
singleton and pairwise environment file set, and possibly a homolog file as well. It is used as an intermediate file in the score file production process. [fill this out. -- rgr, 10-Jan-97.]


Score file format

A score file defines the score to assign to a given amino acid in a given singleton or loop environment, or a given pair of amino acids in a given pairwise environment. One set of scores (a 20-vector for singletons and loops, a 20x20 matrix for pairs) must be provided for each defined environment; the score file and
the environment file must therefore be consistent.

Score values are usually floating point, but may be integers. In either case, larger values are interpreted as unfavorable, and smaller values as favorable. Since needle converts all scores to integers for speed, all values should be in the same range in order to avoid loss of precision. (But needle has run parameters that can be used to control the scaling.)

Each score file consists of the following elements:

  1. The first line contains the names of all amino acids in the order their scores will appear. Names should be uppercase (but needle accepts any case).
  2. The second line contains environment count, which is the number of score vectors/tables present in the file. (For some reason, the MRF code adds a blank line before the environment count in the pairwise score file, though needle doesn't care.)
  3. Following lines contain the scores. All numbers are free-format (though the MRF code prints scores in "%6.3f" format for human readability).

The following singleton example is from the start of the singleton_scores_x_MRF_4fxn.dat file. These are GMT scores, which is why there are scores defined for each of 74 environments. The backslashes ("\" characters) do not actually appear in score files -- they are there to indicate where line breaks have been added for readability. (On my browser, these examples will fit in the window when the browser width is at least 700 pixels.)

 
ALA CYS ASP GLU PHE GLY HIS ILE LYS LEU MET ASN PRO GLN ARG SER THR VAL TRP TYR
 74
  1  2.759  3.548  3.609  3.500  2.813  2.958  3.692  2.416  3.416  2.304  \
     3.416  3.653  3.950  3.634  3.366  2.973  2.900  2.088  3.652  2.821 
  2  2.405  3.635  3.799  3.684  2.858  3.015  3.838  2.274  3.722  2.148  \
     3.435  3.792  3.907  3.816  3.650  3.068  2.979  1.965  3.786  2.991 
  3  2.609  3.695  3.824  3.778  2.690  2.869  3.930  2.199  3.692  2.084  \
     3.497  3.872  4.298  3.866  3.595  3.048  2.931  1.864  3.851  2.832 
  4  2.604  3.477  3.553  3.433  2.920  3.028  3.591  2.496  3.440  2.387  \
     3.337  3.558  3.640  3.567  3.401  3.006  2.952  2.226  3.576  2.956 
  5  2.399  3.703  3.859  3.783  2.783  2.944  3.927  2.196  3.791  2.066  \
     3.474  3.869  4.095  3.894  3.697  3.082  2.974  1.877  3.871  2.959 
  6  2.031  3.975  3.713  3.421  2.883  3.084  3.914  2.421  3.556  1.901  \
     3.329  3.700  4.193  3.511  3.307  3.198  3.165  2.297  3.817  3.166 
  . . .

This is the pairwise singleton_scores_x_MRF_4fxn.dat file that corresponds to the singleton example (more heavily edited because pairwise scores have vastly more numbers). The environment numbers are in bold for readability.

 
ALA CYS ASP GLU PHE GLY HIS ILE LYS LEU MET ASN PRO GLN ARG SER THR VAL TRP TYR

  20
   1
 -0.014   0.180   0.358  -0.118   0.101  -0.092   0.353  -0.026   0.137  -0.026 \
     0.247   0.224   0.135   0.156   0.254  -0.041  -0.187  -0.228   0.145   0.121 
  0.180  -1.227  -0.356   0.064   0.035   0.020  -0.362  -0.111  -0.278   0.461 \
    -0.152  -0.441  -0.356  -0.187  -0.245  -0.079   0.274   0.403  -0.569  -0.339 
  0.358  -0.356  -0.871   0.242   0.213  -0.159  -0.407  -0.205  -0.100   0.233 \
     0.180  -0.669  -0.178  -0.232  -0.537  -0.307  -0.241   0.581  -0.209  -0.161 
  . . .
  0.145  -0.569  -0.209   0.434  -0.489   0.257   0.009   0.087   0.093   0.256 \
    -0.321  -0.071  -0.997   0.184  -0.211  -0.115   0.057   0.243  -0.487   0.149 
  0.121  -0.339  -0.161   0.153  -0.228   0.138  -0.454   0.170  -0.083  -0.050 \
    -0.091  -0.614   0.650  -0.279  -0.626   0.010  -0.001   0.453   0.149   0.091 
   2
 -0.755   0.226   0.015  -0.362   0.226   0.226   0.226  -0.131  -0.262  -0.040 \
     0.226   0.226   0.226   0.238  -0.131   0.226   0.250   0.238   0.226   0.226 
  0.226  -0.046  -0.034   0.177  -0.046  -0.046  -0.046   0.002   0.026   0.093 \
    -0.046  -0.046  -0.046  -0.034   0.002  -0.046  -0.022  -0.034  -0.046  -0.046 
  0.226  -0.046  -0.034   0.177  -0.046  -0.046  -0.046   0.002   0.026   0.093 \
    -0.046  -0.046  -0.046  -0.034   0.002  -0.046  -0.022  -0.034  -0.046  -0.046 
  . . .
  0.226  -0.046  -0.034   0.177  -0.046  -0.046  -0.046   0.002   0.026   0.093 \
    -0.046  -0.046  -0.046  -0.034   0.002  -0.046  -0.022  -0.034  -0.046  -0.046 
  0.015  -0.034  -0.022   0.189  -0.034  -0.034  -0.034   0.015   0.038   0.106 \
    -0.034  -0.034  -0.034  -0.022   0.015  -0.034  -0.009  -0.022  -0.034  -0.034 
   3
 -0.171   0.052   0.077   0.077   0.052   0.052   0.052   0.113   0.052  -0.378 \
     0.052   0.052   0.052   0.052   0.052   0.052   0.052  -0.053   0.052   0.065 
  0.282  -0.054  -0.029  -0.029  -0.054  -0.054  -0.054   0.006  -0.054   0.326 \
    -0.054  -0.054  -0.054  -0.054  -0.054  -0.054  -0.054   0.064  -0.054  -0.042 
  0.282  -0.054  -0.029  -0.029  -0.054  -0.054  -0.054   0.006  -0.054   0.326 \
    -0.054  -0.054  -0.054  -0.054  -0.054  -0.054  -0.054   0.064  -0.054  -0.042 
  . . .

Loop scores are vastly simpler. Here is the loop_scores_x_MRF_4fxn.dat file in its entirety.

 
ALA CYS ASP GLU PHE GLY HIS ILE LYS LEU MET ASN PRO GLN ARG SER THR VAL TRP TYR
  1
  0  2.636  4.097  2.568  2.961  3.401  2.141  3.759  3.374  2.842  2.826  \
     4.101  2.790  2.595  3.348  3.188  2.588  2.750  3.058  4.423  3.499 

Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 21:41:17 EST 1999