BMERC : needle tools : File formats : Dependency file formats
This consists of the core definition file format and the cross-validation set file format, both of which are used by the dependency file generators to define core sets and specify how to treat them for cross-validation.
The PDB locus is used by the dependency file generators to find the correct PDB file, so the values must correspond to existing PDB entries.
The last three fields are not used by any needle tools code;
they are purely for documentation. Only the SCOP annotation field may
contain spaces.
Subsequently, starting in the summer of 1996, the convention was
extended to include one or more chain ID characters (letters or digits)
after the PDB locus. Although the old convention was still supported
(in the sense that if no chain ID characters were specified, then make-core.pl still used
the first chain in the PDB file), actual core names used since then have
always included an explicit chain ID for non-space chains. (The
multichain syntax was only briefly used.)
The 1996 naming scheme was still self-defining, in that one needs to
know just the core name in order to build the core.
In the spring of 1998, a modified version of the SCOP naming
convention was adopted in order to support fractional chain domains.
The SCOP convention uses a 7-character format to denote domains:
The new convention is truly a convention, rather than a syntax for
core definition; it decouples core naming from core definition, while
supporting the older prescriptive PDB-locus-and-chain style for
compatibility. If the PDB locus and chain specification fields are all
filled out in a core definition
file, then the core locus names are purely arbitrary. If not, then
the following merged SCOP/BMERC naming syntax allows them to be
defaulted:
[not always possible to go unambiguously from BMERC locus to SCOP
locus. -- rgr, 24-May-98.]
[always use SCOP domain digit spec to avoid confusion. -- rgr,
24-May-98.]
Note that, in 1bphBA, the chains are specified in an order that is
reversed from how they appear in the PDB entry. Locus 1bplAB1 has two
chain ID characters followed by a domain digit; it consists of all of
chain A followed by the first two-thirds of chain B, and 1bplB1 contains
the rest of chain B. Of these five, only 1bphBA is self-defining; both the PDB locus
and chain specification could be omitted (though the PDB locus could be
omitted for all other cores as well).
For most file types, it is sufficient to construct a particular file
name by concatenating the file-type-specific prefix, the core locus
name, and the file-type-specific suffix. This is because most files are
particular to the core. However, some of the files derived from PDB
entries are constructed from whole chains or PDB entries. Therefore,
the file names must be constructed from the PDB locus instead of the
core locus, or the PDB locus and one or more chain ID letters as in the
core naming convention described
above. These file types (using the same names as in the "Defined file types"
section) are listed below.
Please remember that this is not a naming "convention" but
rather a constraint that is imposed by the way we expect these files to
be generated. There is no way to change this without hacking the
dependency file generator internals.
Each cross-validation set is described in a format essentially
identical to that used to describe "cliques" at BMERC. Each line
describes a set, or clique, and consists of a clique label followed by a
colon and a tab, followed by the tab-delimited list of loci, and a final
newline. For example,
Cross-validation sets are allowed to have members that do not appear
on the core list. In that case, make-mrf-depends.pl will
insist on finding that core's counts file on the search path. If the
file is not found, a warning message is generated, and the locus is
ignored.
In order to compute scores, the individual core counts files for each
of the cross-validation sets is summed together. Then, a "total counts"
file produced by adding all of the cross-validation sets together with
all the counts for cores not belonging to any cross-validation set; the
resulting sum is over the union of all cores in the core list file and
the cross-validation sets file. Finally, for a given core (or
cross-validation set), one computes the score files using the difference
between the total counts and that core's (resp., cross-validation set's)
counts.
[There is currently no way of specifying the total counts file name
-- it is always total-mrf.cnt. -- rgr, 27-Jun-97.]
[What should happen then for cross-validation sets is that the
pairwise and loop scores files should be shared. Unfortunately, the
code doesn't do this yet; it should make a cross-validation set's pair
and loop files once, and create links to them for each member of the
set. As it stands, they are all made individually. GMT scores, since
they depend on each residues' particular set of pairwise arcs, must
still be made individually in any case. -- rgr, 26-Jul-97.]
Core naming convention
The original BMERC convention for naming core loci was simply to use the
4-character PDB locus name; if there was more than one chain in the
entry, then the first chain in the PDB file was always used.
This is usually the chain with a " " (space) for a chain
ID, but not always; there are one or two exceptions in the "original 57"
cores that were constructed using this convention (e.g. 5cyt has only a
chain "R"). This was really a syntax for core definition,
rather than a convention, since there was no choice of name for a given
core. But it didn't matter anyway, since the core generation process
had not been automated by then.
SCOP domains are thus not self defining, hence the need to introduce a
core definition file format
that includes explicit PDB locus and chain subrange specifications.
Trailing underscores may always be omitted. In any case, the
"." (dot) character is never used to denote multiple chains,
since it makes file names look funny.
Core chain specification syntax
This syntax is a subset of that used by SCOP to denote chain subranges.
The main difference is that the chain ID letter/digit for the first
subrange is always required.
Two further extensions to SCOP syntax are supported:
Note that different tools interpret their -chain arguments
somewhat differently.
make-domain-core.pl and pdb-domain-seq.pl
produce output that respects the order and multiplicity of the subranges
specified on the command line, but
filter-pdb-atoms.pl produces output that preserves the
order of the input PDB file. (The meaning of each individual subrange
is otherwise the same.)
Core definition file example
For example, here is a very small core definition file using SCOP
domains as cores, but with the annotation removed:
1bmtA1 1bmt A:1-90
1bmtA2 1bmt A:91-246
1bphBA 1bph B,A
1bplAB1 1bpl A,B:1-181
1bplB1 1bpl B:182-290
There are no spaces in this core definition file, though it may appear
different in HTML. The fields must be separated via TAB characters.
Naming of other files
[***finish***: -- rgr, 15-Oct-98.]
File type
Source of file name
exposure (EFA)
PDB locus
vv-data
PDB locus and chain ID(s)
vv-singleton
PDB locus and chain ID(s)
hyperenv
PDB locus and chain ID(s)
Old prefix naming convention
The old naming convention for exposure files at BMERC was
"pdb1foo.nexp" for locus 1foo; they were kept in the
"~thread/structure/exposure/" directory. This was accomodated
by including "~thread/structure/exposure/pdb" (note the lack of
trailing slash) in the search path. Similarly, PDB files were
"pdb1foo.ent" and abbreviated DSSP files were
"pdb1foo.ent.out". make-mrf-depends.pl used to
convert a core locus into a PDB locus by dropping terminal characters
one by one until it located the exposure file. This heuristic search is
now obsolete, and has been replaced by concrete naming requirements for such
files.
Cross-validation set file format
a: 1s01 1end 3fxn 1byh
b: 1apa 1mbd 3chy 5fd1
c: 1cde 1bgc
d: 5tmn 7rsa 8dfr
e: 2mhr
(The clique labels must start in the first column. These loci were
taken at random from the set of original 58 cores; there is no reason
they should actually be cross-validated this way.) The labels are used
to construct the partial counts files from the members; the counts for
the first set will be put into the "a-mrf.cnt" file. The
presence of a colon is what defines the label; if the "a:"
field had been left off, the same core counts would have been collected
into the "set-1s01-mrf.cnt" file instead. Note that there will
be no "e-mrf.cnt" since it would just be a copy of the
2mhr-mrf.cnt file. In fact, assuming that 2mhr
appears in core-name-file, this singleton entry could have been
left out of the cross-validation set file altogether without affecting
the result.
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Thu Jan 13 22:03:58 EST 2000