File formats relating to structure

BMERC : needle tools : File formats : Structure file formats


This includes secondary structure as well as tertiary structure. The former is usually based on the DSSP program, and the latter on PDB ATOM format.

Table of contents

  1. File formats relating to structure
    1. Table of contents
    2. Segment file format
    3. Abbreviated DSSP file format
    4. .nexp (exposure) file format
    5. Core file format
    6. Internal "ss" format
    7. Internal "core abstraction" file format


Segment file format

A segment file describes the secondary structures for a PDB entry (file), one structure per line. Secondary structure elements (segments) must appear in the same order as the ATOM records in the PDB file. Each segment is characterized by the following tab-delimited fields:
  1. chain ID ( chain ID; one character) -- identifies the chain for this segment. This must always be exactly one character; for many of the single-chain PDB entries, it will consist of a space character (" ").
  2. index (integer) -- segment index. These are as assigned in the dssp4.pl program's -ss output; since this output is generally not saved, these numbers are therefore not useful except in debugging.
  3. starting residue (pdbres) -- first inclusive residue of the segment.
  4. ending residue (pdbres) -- first inclusive residue of the segment.
  5. structure type (segment designator) -- alphanumeric segment type designator, e.g. "H", "E2", or (for barrels) "C030204".
  6. segment length (integer) -- length in residues. (This field was added in Release 1.0.)
Segment files have various extensions, generally denoting their origin. Most of the segment files in use at BMERC have a ".dssp" extension because they were derived from DSSP, and some end in ".dssp2" or ".dssp3" because they were hand-edited or postprocessed versions derived from an original ".dssp" file. [Unfortunately, this convention is a little too terse. A ".dssp" file is neither an abbreviated DSSP format file, nor the output of the DSSP program. -- rgr, 29-Jan-97.]


Abbreviated DSSP file format

The "abbreviated" DSSP file has a name of the form "pdb-file-name.out" (e.g. "pdb2mhr.ent.out"), and traditionally lives in the /structure/dssp/ directory at BMERC. These files are easier for programs (and humans) to read and more compact than the raw DSSP output; all BMERC programs that read "DSSP files" actually read the abbreviated versions.

See [Kabsch,W., and Sander,C., Biopolymers 22(1983) 2577-2637], or the DSSP web pages for more information on DSSP, and how to obtain a copy. See the Description of the DSSP program page for details of how to run dssp and how to interpret the "raw" DSSP output format.

An abbreviated DSSP file has a single header line, followed by a series of records each with eight tab-delimited fields. There is one record per residue, plus extra records with an "amino acid" of "!" to denote chain breaks or transitions between chains. Some of these fields, described below, have subfields.

  1. the "pdbres" field followed by the chain ID (6 characters total). This serves to completely identify the residue within the PDB file, though normally the chain ID appears immediately before the pdbres (residue index and insertion code) field, as in the PDB ATOM records.
  2. the one-letter amino acid abbreviation (1 character).
  3. the DSSP structure designation codes (9 characters). See the "Output" section of the DSSP: Program description page for details of how to interpret these codes.
  4. the "BP1" field (4 digits) and
  5. "BP2" field (4 digits followed by 2 alphas), which identify the "bridge partner" to which a strand residue is hydrogen bonded. The numbers are 1-based residue indices which include chain breaks and encompass the whole file, allowing inter-chain hydrogen bonding to be described. To find the bridge partner residue in emacs, type "M-x goto-line RET" and then a number one greater than the one in the appropriate field -- one greater because of the header line.
  6. normalized solvent accessibility [explain. -- rgr, 28-Jan-97.]
  7. backbone phi and
  8. psi angles.

DSSP state codes:

These are interpreted in the order given; if more than one applies, the first is chosen. (Based on [Kabsch&Sander].)

Letter Name Definition
H Alpha helix (4-12) Two or more consecutive bridge partners at i and i+4.
B Isolated beta-bridge residue Must not have a neighbor that qualifies it for H, E, G, or I status. Bridge partner is identified in BP1 or BP2 column.
E Strand ("extended") Has at least one bridge partner and at least one neighbor bridged in parallel or antiparallel.
G 3-10 helix Two or more consecutive bridge partners at i and i+3.
I pi helix Two or more consecutive bridge partners at i and i+5.
T Turn Bridge partner at i+3, i+4, or i+5, but no bridged neighbor that would qualify them for H, G, or I status.
S Bend Local curvature greater than 70 degrees, measured as the angle between alpha carbons at i-2, i, and i+2.
blank None Meets none of the criteria above.
Note that the helical states (H, G, and I) need not consist of an unbroken series of bridged residues; a residue will qualify for a given helical state if it is flanked by at least two pairs of helically bridged residues. The first and last bridged residues are not considered part of the helix, however, which is why the minimum alpha helix length is four (residues i through i+3, with one bridge between i-1 and i+3, and another between i and i+4).

Backbone phi and psi angles, chirality, disulphide bonds, and solvent exposure are also computed, but do not affect the state code.

Known bugs:

  1. There is no room in this format to accomodate NMR structures. These typically use the PDB MODEL/ENDMDL construct to delimit two or more variations on the same structure, sometimes as many as 50 of them. [Should discuss this more fully under the generate-dssp program.] -- rgr, 28-Apr-97.


.nexp (exposure) file format

A ".nexp" file contains exposure values for all residues in a PDB model; it is a mapping of residues to non-negative real numbers. These numbers are usually interpreted as square Ångstroms, but the programs that create/use these files are free to agree on any interpretation they like. (The name ".nexp" probably originally meant "new exposure" format, since only DSSP format had been accepted before.)

The file consists of 50-character lines, one per residue, with fields as described below. Columns not mentioned must be spaces (though columns 10 through 43 are universally ignored).

  1. residue index (columns 1-5) -- the PDB sequence number and insertion code for the residue (blanks are significant).
  2. AA code (column 7) -- the standard one-letter amino acid abbreviation.
  3. chain ID (column 9) -- the PDB chain identifier.
  4. exposure value (columns 44-50) -- exposure value, usually in Fortran F7.3 format.

Old files may have more fields, not documented here; the extra whitespace is for compatibility. Note that old files will NOT have the chain ID, though the space that will be in that position should serve as such.


Core file format

Core files are line-oriented, with fixed fields (Fortran style), as is the PDB file from which it is excerpted. Lines are of two formats:
  1. A secondary structure designator, which occupies the entire line and denotes the start of a new core segment; and
  2. A series of PDB-format ATOM records that describe the backbone of the core segment, plus beta carbons (including beta carbons "hallucinated" for glycine residues).
ATOM records must appear exactly as they did in the original PDB entry. In particular, the names of residues and their chain ID's, sequence numbers, and insertion codes are preserved in order to identify them properly with respect to the full PDB file, or other files derived from the PDB entry. This is important for correlation with files that may contain all chains, such as .nexp files, and hence require more than just the sequence number and insertion code in order to be uniquely identified. (needle also uses the residue names to reconstruct the native threading, by searching for the core sequence strings in the native sequence.)

Multiple chains may appear in a core file, but it is assumed that it makes sense to thread the entire model with a single sequence in the order in which the segments appear.

Residues within each core segment and the segments themselves within the file appear in amino to carboxyl order. The backbone must be contiguous, and each residue will have exactly five atoms that appear in the order N, CA, C, O, and CB. It is recommended that core files that are not generated automatically by make-core.pl be passed through filter-pdb-atoms.pl in order to guarantee these conditions. Use

 
	filter-pdb-atoms.pl -output cb -hcb -pass-through all \
		handmade.core > checked.core
to perform these checks, eliminate extra variants, and generate missing glycine beta carbons. (The -pass-through all option copies the segment designators to the output, but does not check them for validity.)


Internal "ss" format

This is a "masticated" version of the DSSP data, and is produced by
dssp4.pl for the benefit of make-dssp-segs. One line is generated per H or E residue, with the following tab-delimited fields:
  1. the chain ID.
  2. the "pdbres" field.
  3. the 1-char amino acid abbreviation.
  4. the 1-based secondary structure index, assigned in chain order, with all chains and all secondary structures (both helices and strands) on the same numbering system).
  5. the first character of the smoothed DSSP secondary structure assignment (H or E).
  6. the residue indices of H-bonded strands, when available, else "" or 0 if not (two values with a tab in between). These indices are 1-based, and count all residues in all chains, including the "!" entries for chain breaks and terminations. (0 means a B or short E strand that was eliminated in the smoothing process.)


Internal "core abstraction" file format

A "core abstraction" or .cab file consists of a series of records that describe a core with its residue exposure values, and optionally the loop lengths as well. This format was agreed upon by Lihua, Jadwiga, and myself on 19-Nov-98, and subsequently extended. The only program that produces this format is
abstract-core.pl, and the only programs that use it are the various versions of the DSM generator, none of which are part of needle tools as such.

A core abstraction file is a series of lines, each of which contains one or more tab-delimited fields, with the first field being the record tag. All tags and string values are in lower case.

seg type length x1 y1 z1 x2 y2 z2
Starts a new segment, where After the seg record, there must be exactly length res records before the next seg or loop record.
res exposure [ vv-exposure ]
Denotes the exposure for the residue. exposure may be the empty string if no EFA exposure was specified when the file was created. If no visible volume "exposure" was specified, then there will be no vv-exposure field.
loop loop-type loop-length
Describes a loop. Loops may appear at the beginning of the file (in which case the loop-type is "a" for the amino leader), after the last res record of one segment but before the next seg (in which case the loop-type is "i" for "internal"), or at the end of the file (in which case the loop-type is "c" for the carboxy trailer). Loop lengths of 0 are possible. Loop records are entirely optional (but all should be present if any are included).

As an example, here is the entire contents of 1puc.cab, a core of minimal (if not sub-minimum) size:


loop	a	10
seg	H	5	13.361	13.423	57.722	21.622	13.524	57.136
res	121.997
res	103.892
res	21.241
res	41.551
res	81.506
loop	i	51
seg	H	6	11.361	27.337	46.732	16.827	21.339	52.382
res	72.848
res	80.841
res	1.468
res	15.439
res	66.237
res	29.616
loop	c	33


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Thu Apr 5 17:37:03 EDT 2001