Dependency file formats

BMERC : needle tools : File formats : Dependency file formats


This consists of the core definition file format and the cross-validation set file format, both of which are used by the dependency file generators to define core sets and specify how to treat them for cross-validation.

Table of contents

  1. Dependency file formats
    1. Table of contents
    2. Core definition file format
      1. Core naming convention
      2. Core chain specification syntax
      3. Core definition file example
      4. Naming of other files
      5. Old prefix naming convention
    3. Cross-validation set file format

Core definition file format

A core definition file contains the following tab-delimited fields:
  1. The core locus, which can be an arbitrary string.
  2. The PDB locus, a string of four digits or letters, starting with a digit.
  3. The chain specification.
  4. SCOP domain locus.
  5. SCOP domain structure identifier (five integers separated by dots, a la EC).
  6. SCOP domain annotation (which is derived from the PDB annotation).
The core locus is always required, and must be valid as a part of a file name (i.e. no slashes or shell metacharacters). The second two fields (PDB locus and chain specification) may be left empty, in which case they are defined by the core locus; if so, the core locus must conform to the core naming convention described below. If both PDB locus and chain specification are nonempty, then the core locus may be an arbitrary string (provided it is acceptable for constructing file names).

The PDB locus is used by the dependency file generators to find the correct PDB file, so the values must correspond to existing PDB entries.

The last three fields are not used by any needle tools code; they are purely for documentation. Only the SCOP annotation field may contain spaces.

Core naming convention

The original BMERC convention for naming core loci was simply to use the 4-character PDB locus name; if there was more than one chain in the entry, then the first chain in the PDB file was always used. This is usually the chain with a " " (space) for a chain ID, but not always; there are one or two exceptions in the "original 57" cores that were constructed using this convention (e.g. 5cyt has only a chain "R"). This was really a syntax for core definition, rather than a convention, since there was no choice of name for a given core. But it didn't matter anyway, since the core generation process had not been automated by then.

Subsequently, starting in the summer of 1996, the convention was extended to include one or more chain ID characters (letters or digits) after the PDB locus. Although the old convention was still supported (in the sense that if no chain ID characters were specified, then make-core.pl still used the first chain in the PDB file), actual core names used since then have always included an explicit chain ID for non-space chains. (The multichain syntax was only briefly used.) The 1996 naming scheme was still self-defining, in that one needs to know just the core name in order to build the core.

In the spring of 1998, a modified version of the SCOP naming convention was adopted in order to support fractional chain domains. The SCOP convention uses a 7-character format to denote domains:

SCOP domains are thus not self defining, hence the need to introduce a core definition file format that includes explicit PDB locus and chain subrange specifications.

The new convention is truly a convention, rather than a syntax for core definition; it decouples core naming from core definition, while supporting the older prescriptive PDB-locus-and-chain style for compatibility. If the PDB locus and chain specification fields are all filled out in a core definition file, then the core locus names are purely arbitrary. If not, then the following merged SCOP/BMERC naming syntax allows them to be defaulted:

Trailing underscores may always be omitted. In any case, the "." (dot) character is never used to denote multiple chains, since it makes file names look funny.

[not always possible to go unambiguously from BMERC locus to SCOP locus. -- rgr, 24-May-98.]

[always use SCOP domain digit spec to avoid confusion. -- rgr, 24-May-98.]

Core chain specification syntax

This syntax is a subset of that used by SCOP to denote chain subranges. The main difference is that the chain ID letter/digit for the first subrange is always required.
<chains> ::= <chain-subrange>
<chains> ::= <chain-subrange>,<chains>
A chain specification is a comma-separated list of one or more subrange specifications.

<chain-subrange> ::= <chainid>
A single chain ID letter or digit, or "_" (underscore) to denote the space (" ") character, which specifies the entire chain; or
<chain-subrange> ::= <chainid>:<interval>
A chain ID as above, followed by a colon and a residue interval.

<interval> ::= <start-pdbres>-<end-pdbres>
for the interval between <start-pdbres> and <end-pdbres> inclusive; or
<interval> ::= <start-pdbres>-
for the rest of the chain starting from <start-pdbres> through the end. (This is an extension to the SCOP syntax.)

<start-pdbres> ::= <pdbres>
<end-pdbres> ::= <pdbres>
<pdbres> ::= <integer>
<pdbres> ::= <integer><letter>
Both <start-pdbres> and <end-pdbres> are pdbres-style residue identifiers, an integer index followed by an optional uppercase alphabetic insertion code, without the leading and trailing spaces used in the PDB format to pad the field. [***bug***: The parser makes no provision for negative residue indices (which do exist). -- rgr, 13-Jan-00.]
Two further extensions to SCOP syntax are supported:
  1. For backward compatibility, a string of letters and digits without punctuation will be interpreted as if it were a series of comma-separated chains.
  2. The chain ID character (letter or digit) and the following colon may be omitted for the second and subsequent interval if it uses the same chain as the previous interval.
Note that different tools interpret their -chain arguments somewhat differently. make-domain-core.pl and pdb-domain-seq.pl produce output that respects the order and multiplicity of the subranges specified on the command line, but filter-pdb-atoms.pl produces output that preserves the order of the input PDB file. (The meaning of each individual subrange is otherwise the same.)

Core definition file example

For example, here is a very small core definition file using SCOP domains as cores, but with the annotation removed:
    1bmtA1      1bmt      A:1-90
    1bmtA2      1bmt      A:91-246
    1bphBA      1bph      B,A
    1bplAB1     1bpl      A,B:1-181
    1bplB1      1bpl      B:182-290
There are no spaces in this core definition file, though it may appear different in HTML. The fields must be separated via TAB characters.

Note that, in 1bphBA, the chains are specified in an order that is reversed from how they appear in the PDB entry. Locus 1bplAB1 has two chain ID characters followed by a domain digit; it consists of all of chain A followed by the first two-thirds of chain B, and 1bplB1 contains the rest of chain B. Of these five, only 1bphBA is self-defining; both the PDB locus and chain specification could be omitted (though the PDB locus could be omitted for all other cores as well).

Naming of other files

[***finish***: -- rgr, 15-Oct-98.]

For most file types, it is sufficient to construct a particular file name by concatenating the file-type-specific prefix, the core locus name, and the file-type-specific suffix. This is because most files are particular to the core. However, some of the files derived from PDB entries are constructed from whole chains or PDB entries. Therefore, the file names must be constructed from the PDB locus instead of the core locus, or the PDB locus and one or more chain ID letters as in the core naming convention described above. These file types (using the same names as in the "Defined file types" section) are listed below.

File type Source of file name
exposure (EFA) PDB locus
vv-data PDB locus and chain ID(s)
vv-singleton PDB locus and chain ID(s)
hyperenv PDB locus and chain ID(s)

Please remember that this is not a naming "convention" but rather a constraint that is imposed by the way we expect these files to be generated. There is no way to change this without hacking the dependency file generator internals.

Old prefix naming convention

The old naming convention for exposure files at BMERC was "pdb1foo.nexp" for locus 1foo; they were kept in the "~thread/structure/exposure/" directory. This was accomodated by including "~thread/structure/exposure/pdb" (note the lack of trailing slash) in the search path. Similarly, PDB files were "pdb1foo.ent" and abbreviated DSSP files were "pdb1foo.ent.out". make-mrf-depends.pl used to convert a core locus into a PDB locus by dropping terminal characters one by one until it located the exposure file. This heuristic search is now obsolete, and has been replaced by concrete naming requirements for such files.

Cross-validation set file format

Each cross-validation set is described in a format essentially identical to that used to describe "cliques" at BMERC. Each line describes a set, or clique, and consists of a clique label followed by a colon and a tab, followed by the tab-delimited list of loci, and a final newline. For example,


	a:	1s01	1end	3fxn	1byh
	b:	1apa	1mbd	3chy	5fd1
	c:	1cde	1bgc
	d:	5tmn	7rsa	8dfr
	e:	2mhr
(The clique labels must start in the first column. These loci were taken at random from the set of original 58 cores; there is no reason they should actually be cross-validated this way.) The labels are used to construct the partial counts files from the members; the counts for the first set will be put into the "a-mrf.cnt" file. The presence of a colon is what defines the label; if the "a:" field had been left off, the same core counts would have been collected into the "set-1s01-mrf.cnt" file instead. Note that there will be no "e-mrf.cnt" since it would just be a copy of the 2mhr-mrf.cnt file. In fact, assuming that 2mhr appears in core-name-file, this singleton entry could have been left out of the cross-validation set file altogether without affecting the result.

Cross-validation sets are allowed to have members that do not appear on the core list. In that case, make-mrf-depends.pl will insist on finding that core's counts file on the search path. If the file is not found, a warning message is generated, and the locus is ignored.

In order to compute scores, the individual core counts files for each of the cross-validation sets is summed together. Then, a "total counts" file produced by adding all of the cross-validation sets together with all the counts for cores not belonging to any cross-validation set; the resulting sum is over the union of all cores in the core list file and the cross-validation sets file. Finally, for a given core (or cross-validation set), one computes the score files using the difference between the total counts and that core's (resp., cross-validation set's) counts.

[There is currently no way of specifying the total counts file name -- it is always total-mrf.cnt. -- rgr, 27-Jun-97.]

[What should happen then for cross-validation sets is that the pairwise and loop scores files should be shared. Unfortunately, the code doesn't do this yet; it should make a cross-validation set's pair and loop files once, and create links to them for each member of the set. As it stands, they are all made individually. GMT scores, since they depend on each residues' particular set of pairwise arcs, must still be made individually in any case. -- rgr, 26-Jul-97.]


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Thu Jan 13 22:03:58 EST 2000