DSSP-related programs

BMERC : needle tools : programs : DSSP programs


These include programs for generating and using DSSP secondary structure definitions.

Table of contents

  1. DSSP-related programs
    1. Table of contents
    2. generate-dssp
    3. dssp4.pl
    4. make-dssp-segs
    5. dssp-ss-states.pl


generate-dssp

generate-dssp runs the
DSSP program of Kabsch and Sander on a given PDB file and then posprocesses it to produce abbreviated DSSP format output. Usually, it is not necessary to run this program at BMERC, since the output for each PDB locus locus is kept online in the /structure/dssp/pdblocus.ent.out file (with two important exceptions noted below).

Usage:

    generate-dssp locus < pdb-file > abbreviated-dssp-file

Arguments:

locus
arbitrary string that is used to generate file names, and by filter-pdb-atoms.pl for error messages.
pdb-file
PDB format input file.
abbreviated-dssp-file
abbreviated DSSP format output file.
The PDB input is first checked for overall "reasonableness." If no ATOM records for backbone N or C atoms are found, then either (a) the input is a CA trace, or (b) the input contains only nucleic acids, or (c) the input is not PDB format at all. In none of these cases can dssp produce meaningful results, so generate-dssp exits with the following diagnostic message (printed on the standard output, unfortunately):
    generate-dssp:  locus has no carbonyl atoms; can't run DSSP.
If these cases were not specifically detected and excluded by generate-dssp, then filter-pdb-atoms.pl would generate thousands of repetitive error messages, because its focus is much more limited.

Then, filter-pdb-atoms.pl is used to preprocess the PDB input, possibly giving rise to warnings about missing backbone atoms and such; the DSSP error messages of that nature are suppressed. In fact, as long as the dssp program produces output, all messages are ignored, since they are not very useful.

On the other hand, if dssp fails to produce output (which usually indicates a bug), the working files are renamed to locus.filt for the filtered PDB input, locus.raw-dssp for the unprocessed output from the dssp program, and the transcript is left in the locus.dssp.text file, and the diagnostic message:

    generate-dssp:  DSSP generated no output for locus
is printed (on the standard output, unfortunately). (If the output exists but is incorrect for some reason, then it will be necessary to run DSSP by hand in order to find out what went wrong.)

Finally, if all went well, the output is postprocessed by the util-parse-dssp.pl script (which is not documented) into abbreviated DSSP format. Most of the postprocessing consists of selecting the desired subset of fields from the raw DSSP output; the only nontrivial transformation is the normalization of exposure values. DSSP states each residue's exposure in terms of square Ångstroms, but the abbreviated format requires that these be normalized to a fraction of the effective maximum exposure, a dimensionless number between 0.0 and (nominally) 1.0. The maximum values used are shown below.

AA Maximum exposure
Emax (Ångstroms)
A 124
B 157.5 (average of N and D)
C 94
D 154
E 187
F 221
G 89
H 201
I 193
K 214
L 199
M 216
N 161
Q 192
R 244
S 113
T 151
V 169
W 264
Y 237
X 179
Z 189.5 (average of Q and E)

These values were computed by placing the indicated amino acid at the center of a 5-residue chain, flanked by two glycines on either side, then running molecular dynamics and sampling ten arbitrary conformations, computing the exposure, and picking the largest value. There are therefore two ways in which an exposure value observed for a PDB entry might be larger than expected:

  1. The PDB residue might just happen to have a conformation with slightly greater exposure that any of those sampled; or
  2. If at the extreme end of the chain, the exposed backbone atoms will contribute more than expected to the exposure.
This explains why some of the "normalized" exposure values can be larger than 1.0.

There are two classes of PDB entries for which abbreviated DSSP files do not already exist online at BMERC.

Not much can be done about the first case, since DSSP requires all backbone atoms in order to determine the pattern of hydrogen bonding. The second class does not affect generate-dssp, since it passes the PDB entry through filter-pdb-atoms.pl anyway, and filter-pdb-atoms.pl arbitrarily picks model 1 to pass to the dssp program. This works, though different models need not give rise to the exact same secondary structure patterns, and it is not necessarily true that model 1 is the best representative of the ensemble. But that's the best that can be done for now, since the abbreviated DSSP format (and the format format of the dssp program on which it is based) has no room for specifying multiple models.

Known bugs:

  1. *** generate-dssp diagnostics appear on the standard output. This results in a one-line "abbreviated DSSP" file, which fakes dssp-ss-states.pl into believing that the structure is all loop. But I haven't bothered to fix this because (I confess) it is useful in structure mask generation. -- rgr, 28-Apr-97.
  2. Exposure for cysteine residues is normalized incorrectly. Instead of using the cysteine value in the table above, it (in effect) normalizes to an arbitrary amino acid, resulting in values that are usually too low. [Fixed in Release 1.3. -- rgr, 2-Nov-98.]


dssp4.pl

dssp4.pl converts an
abbreviated DSSP format file to a "smoothed DSSP" file, using the scheme developed by Jim White and Temple Smith. Several output formats are available.

(Originally, this was the dssp3.m Matlab script written by Jim White, recoded into perl (in 1995?) by R. Mark Adams, and revised as dssp4.pl in June and July 1996 by Bob Rogers. [With a great number of options added subsequently. -- rgr, 28-Apr-97.])

Usage:


    dssp4.pl [-locus name] [-chain L] [-min-strand-length slen]
	[-min-helix-length hlen] [-keep-short-strands] [ -t | -pdb | -ss ]
	[-dont-fill-h-gaps] [-dont-fill-e-gaps] [filename]

Processing arguments:

-locus name
name by which to identify the structure. Used primarily for warning messages.
-chain L (a single character)
output results only for the PDB chain identifier L. If not specified, all chains found in the DSSP file are processed; to get only the "main" chain, you must explicitly say "-chain _".
-turns-are-loops
specifies that all turns are to be converted to loops. This step is done first.
-dont-fill-h-gaps
specifies that one-residue loops between neighboring helices are not to be altered. The default is to convert these residues to H's.
-dont-fill-e-gaps
specifies that one-residue loops between neighboring strands are not to be altered. The default is to convert these residues to E's.
-min-helix-length hlen (integer)
specifies the minimum length for helices; default is 5. After other "smoothing" operations, if any helices are shorter than this, their secondary structure assignments are changed to 'L'. [The default is BMERC policy for core segments, but is not compatible with Release 0 of the needle tools suite. -- rgr, 28-Apr-97.]
-min-strand-length slen (integer)
specifies the minimum length for strands; default is 3. After other "smoothing" operations, if any strands are shorter than this, their secondary structure assignments are changed to 'L'. Note that this can leave degenerate sheets.
-keep-short-strands
used for backward compatibility; equivalent to -min-strand-length 2, which is the preferred idiom. (DSSP does not produce strands shorter than 2.)
-clean
specifies that an an extra "cleanup" postpass should be done; see below.

Input/output arguments:

The -t, -pdb, and -ss options are mutually exclusive.

-t
indicates to output the data in table (.tbl) format with three name-and-value entries (one per line):
  1. locus.aa and the sequence string;
  2. locus.ss and the smoothed secondary structure as a string of DSSP secondary structure characters; and
  3. locus.ac and solvent accessibility as a string of digits from 0-9.
All three values have identical length (which may be shorter than the actual sequence due to unresolved residues). Chain breaks/terminations are represented explicitly as "!" residues in each string (which means it is possible for the strings to be longer than the sequence).
-pdb
requests output as fake PDB HELIX and SHEET records. Does not do a very thorough job of faking it, though; see the bug description below. [This feature may be removed in the future. -- rgr, 26-Aug-97.]
-ss
generate "ss" format for use by make-dssp-segs in making a table of secondary structures. Because the hydrogen bonding information is meaningless without processing the entire DSSP file, it is an error to specify both -chain and -ss. Instead, grep through the full output.
filename
name of an abbreviated DSSP format input file. If not specified (or specified as "-"), the standard input is read. (Specifying more than one DSSP file at a time will not work.)
-verbose
if specified, an information message is printed to the standard error stream notifying of changes to secondary structure assignments (i.e. H or E to something else, and vice versa).
In the default output format, one tab-delimited line is generated per residue, each of which consists of the residue identifier in DSSP format (the "pdbres" field followed immediately by the chain ID character), the amino acid one-letter code, the (more or less complete) secondary structure field, and finally the exposure, phi, and psi fields as they appeared in the DSSP input.

In order to generate "smoothed DSSP," these steps are performed in the following order:

  1. All S's, B's, and I's are changed to loops. If -turns-are-loops was specified, T's are also changed to loops (which makes steps 3, 4, and 7 moot).
  2. Up to three consecutive G's (3-10 helix residues) at the end of a helix are changed to H's (i.e. the helix is extended in the carboxy direction). If the -verbose option was specified, a message is printed for each such helix that is extended.
  3. Two consecutive residues labelled "T >4" at the start or end of a helix are also turned into H's. If the -verbose option was specified, a message is printed for each such helix that is extended.
  4. If two consecutive residues are labelled "T 3", the first is relabelled "T 2", and flanking residues are labelled "T 1" and "T 4" (but only if they are not helix or strand residues).
  5. Loops of length 1 between two helix residues (resp. two strand residues) are filled in so as to join the two segments, unless the -dont-fill-h-gaps (resp. -dont-fill-e-gaps) option was specified. If -verbose was specified, a message is printed for each pair of secondary structures joined.
  6. Strands and helices with less than the specified minimum length are turned into loops. If the -verbose option was specified, a message is printed for each such secondary structure that is eliminated.
  7. Turns labelled "T 1" through "T 4" are relabelled V, W, X, and Y, respectively.
  8. If the -clean option was specified, all residues that are not labelled one of "HELVWXY" (i.e. helix, strand, loop, or turn), and residues in turns other than those consecutively labelled "VWXY" or "WX", are all turned into loops. In particular, all T and G residues ("random" turns and 3-10 helices, respectively) are turned into loops.
All such steps are effectively done in separate passes, so they don't interfere.

The following table in the perl script describes the resulting single-letter secondary structure states in the output. [The numbers apparently refer to the internal encoding used by the original Matlab ".m" file.]

    STATE       ID NUMBER   CODE   -clean CODE

    helix    	    12       H          H      
    strand   	    14       E          E      
    3-10 helix      17       G          L      
    T1       	     4       V          V      
    T2       	     5       W          W      
    T3       	     6       X          X      
    T4       	     7       Y          Y      
    loop     	     8       L          L      
    random turn		     T          L      
[The "random turn" T states are leftovers from the processing algorithm; I have left them in for fidelity with the original. Use the -clean option to get rid of them. -- rgr, 18-Jul-96.]

Note: DSSP includes N-1 chain terminus "residues" with an amino acid code of "!" for PDB entries with N chains. These "residues" have " " for a secondary structure assignment, and they also appear in cases where the chain is broken by missing residues (e.g. between residues 10 and 20 on chain 1 of 2plv).  [The documentation for the "new" DSSP output format (July 1995) says that interchain discontinuities are denoted "!*", but I have yet to see this. -- rgr, 2-Nov-98.]  dssp4.pl treats these "residues" as loops in the processing steps described above, but it also changes the secondary structure and exposure to "!" to avoid recognizing them as real loop residues. This "!" (or "!*") is output, so those who use the output of this program should be aware.

Known bugs:

  1. If a specific chain is specified, all chain break delimiters ("!") are dropped, including those that identify chain breaks within the selected chain. [Fixed in Release 1.0. -- rgr, 9-Apr-97.]
  2. *** When short strands are eliminated, the resulting output can contain degenerate sheets consisting of only a single strand each. This happens when a strand of length 3 or greater is H-bonded only to eliminated strands.
  3. *** When asked to do -pdb format output, the only "sheets" that are generated consist of single strands. This is not sufficient for (e.g.) make-dssp-segs (though make-dssp-segs does not actually accept -pdb format input). [-- rgr, 18-Jun-96.] [This feature may be removed in the future. -- rgr, 26-Aug-97.]
  4. Some operations, particularly the -clean option, trash information in the secondary structure field other than the summary first letter. -- rgr, 26-Jul-97. [Fixed in Release 1.0. -- rgr, 26-Aug-97.]
  5. If told to select "-chain ' '", dssp4.pl will include irrelevant chain break "residues" (see above). [Fixed in Release 1.3. -- rgr, 2-Nov-98.]
  6. dssp4.pl does not complain about missing chains (i.e. if told to process a chain that does not exist). [Fixed in Release 1.3. -- rgr, 2-Nov-98.]


make-dssp-segs

Runs
dssp4.pl and make-ss-designations.lisp to turn the abbreviated DSSP file named as the first argument into the table of secondary structures named as the second argument.

Usage:

    make-dssp-segs dssp-file seg-file

Note: This is equivalent to the following:

    dssp4.pl -dont-fill-e-gaps -ss dssp-file \
	| make-ss-designations seg-file
This idiom is effectively what
make-core-depends.pl produces; it exposes the internals of make-dssp-segs in order to give control over the options passed to the dssp4.pl program. make-dssp-segs is now deprecated on that account. -- rgr, 17-Sep-97.

Arguments:

dssp-file
name of an abbreviated DSSP format file. At BMERC, such files for most PDB entries are available as /structure/dssp/pdbxxxx.ent.out for locus xxxx. Others can be made via generate-dssp (q.v.).
seg-file
name of the segment definition format file) on which to write output. [The current, confusing convention is to use the extension ".dssp" because that is the source of the SS data. -- rgr, 12-Aug-96.]
Note that there is no chain argument. This is because one cannot generate the correct designations for sheets versus barrels without looking at the entire structure; not only is it necessary to have the whole DSSP file to make sense of the strand-strand bonding information, but sheets and even barrels can cross two or more chains.

make-dssp-segs passes the -dont-fill-e-gaps option to dssp4.pl, defaulting all others. There is no way to alter these defaults, other than using the make-dssp-segs alternative described above.


dssp-ss-states.pl

dssp-ss-states.pl uses 'smoothed DSSP' output from the -t option to the
dssp4.pl program together with the full protein sequence specified on the command line to produce a string of secondary structure assignment letters, one for each residue, with gaps in the abbreviated DSSP file assigned to 'L' states.

Usage:


    dssp-ss-states.pl [-locus name] [-chain L] [-min-strand-length slen]
	[-min-helix-length hlen] [-dont-fill-h-gaps] [-dont-fill-e-gaps]
        full-sequence [dssp-filename]

Arguments:

-locus name
name by which to identify the structure. Passed directly to dssp4.pl, which uses it primarily for warning messages.
-chain chain-letter (a single character)
letter (or sometimes a digit) identifying the chain to select; defaults to "_" (equivalent to " " in PDB terms).
-dont-fill-h-gaps
specifies that one-residue loops between neighboring helices are not to be altered. The default is to convert these residues to H's. Passed directly to dssp4.pl.
-dont-fill-e-gaps
specifies that one-residue loops between neighboring strands are not to be altered. The default is to convert these residues to E's. Passed directly to dssp4.pl.
-min-helix-length hlen (integer)
specifies the minimum length for helices; default is 1. Passed directly to dssp4.pl.
-min-strand-length slen (integer)
specifies the minimum length for strands; default is 1. Passed directly to dssp4.pl.
-clean
specifies that an an extra "cleanup" postpass should be done. Passed directly to dssp4.pl.
-3
produces three-state output. This is done by turning everything that is not an H or E into an L.
-no-gap-p
produces a binary map showing where gaps are not permitted when doing a structure-aware alignment. All L's and the first and last H and E of every segment are turned into 0 (the digit zero), and all other H and E states are turned into 1 (the digit one).
-verbose
enables debugging output (none available at present).
full-sequence
the full SEQRES sequence (not the name of a sequence file!) for the PDB entry, as a string of uppercase AA letters with no embedded blanks.
dssp-filename
the name of an abbreviated DSSP format file to pass to dssp4.pl; defaults to the standard input. "-" may also be used to specify the standard input explicitly.
Note that full-sequence and dssp-filename need not come last; they can be mixed with the other options as long as full-sequence is present, and, if both are present, full-sequence comes before dssp-filename.

dssp-ss-states.pl produces a string on the standard output that is exactly as long as the full-sequence argument, but with each amino acid letter replaced with a letter that describes its "smoothed DSSP" secondary structure assignment. (See above for a description of the letters used.) Portions of the DSSP sequence that are missing from the DSSP description (usually because they are not resolved in the crystal structure) are filled in with loop ('L') characters.

The comparison of full-sequence to the sequence in the DSSP file is done by the globalS program. For most purposes, this is akin to swatting a fly with a sledgehammer, but given the problems with some of the PDB entries, having a sledgehammer comes in quite handy. Aside from inconsistencies in the PDB, DSSP will drop residues with incomplete backbones that other tools will consider present in terms of ATOM records, or include residues that pdb-domain-seq.pl doesn't see by default. And DSSP seems to make chain continuity errors on occasion, inserting gaps for stretched peptide bonds, and leaving them out when the start and end of the gap happen to be close. Some of these cases lead to mismatches that are hard to correct with less general sequence comparison tools.

Generally speaking, the result of this process may not be "correct" in a biological sense, but at least it is consistent (in the database sense) with both the structure and the full protein sequence, which is what the needle tools care about.

(The version of dssp-ss-states.pl prior to Release 1.5 was completely dependent on having the SEQRES sequence match up with the DSSP sequence in order to fill in gaps in the chain. As a result, it was very sensitive to mismatches, usually errors in the PDB entry's SEQRES records.)

Known bugs:

  1. 1bll has a chain break between residues "LYS E 11 " and "GLU E 15 ", but since it is only 1.84Å, DSSP doesn't flag it as such, and dssp-ss-states.pl doesn't realize there needs to be an insertion at that point. -- rgr, 14-Apr-97. [Fixed in Release 1.5 by the globalS implementation. -- rgr, 29-Oct-99.]
  2. 1tre chain A has an error in PDB residue indexing that confuses DSSP, causing it to put a gap in the wrong place, which confuses dssp-ss-states.pl in turn. -- rgr, 24-Apr-97. [Fixed in Release 1.5 by the globalS implementation. -- rgr, 29-Oct-99.]
  3. *** Use of positional arguments for full-sequence and dssp-filename is stupid. -- rgr, 18-May-98.


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Mar 9 10:47:37 EST 2001