DSSP-related programs
BMERC : needle tools : programs : DSSP programs
These include programs for generating and using DSSP secondary structure
definitions.
Table of contents
- DSSP-related programs
- Table of contents
- generate-dssp
- dssp4.pl
- make-dssp-segs
- dssp-ss-states.pl
generate-dssp
generate-dssp runs the
DSSP program of Kabsch and Sander on a given PDB file and then
posprocesses it to produce abbreviated DSSP
format output. Usually, it is not necessary to run this program at
BMERC, since the output for each PDB locus locus is kept online
in the /structure/dssp/pdblocus.ent.out file (with two
important exceptions noted
below).
Usage:
generate-dssp locus < pdb-file > abbreviated-dssp-file
Arguments:
- locus
- arbitrary string that is used to generate file names, and by
filter-pdb-atoms.pl for error messages.
- pdb-file
-
PDB format input file.
- abbreviated-dssp-file
- abbreviated
DSSP format output file.
The PDB input is first checked for overall "reasonableness." If no
ATOM records for backbone N or C atoms are found, then
either (a) the input is a CA trace, or (b) the input contains only
nucleic acids, or (c) the input is not PDB format at all. In none of
these cases can dssp produce meaningful results, so
generate-dssp exits with the following diagnostic message
(printed on the standard output, unfortunately):
generate-dssp: locus has no carbonyl atoms; can't run DSSP.
If these cases were not specifically detected and excluded by
generate-dssp, then filter-pdb-atoms.pl would generate
thousands of repetitive error messages, because its focus is much more
limited.
Then, filter-pdb-atoms.pl
is used to preprocess the PDB input, possibly giving rise to warnings
about missing backbone atoms and such; the DSSP error messages of that
nature are suppressed. In fact, as long as the dssp program
produces output, all messages are ignored, since they are not very
useful.
On the other hand, if dssp fails to produce output (which
usually indicates a bug), the working files are renamed to
locus.filt for the filtered PDB input,
locus.raw-dssp for the unprocessed output from the
dssp program, and the transcript is left in the
locus.dssp.text file, and the diagnostic message:
generate-dssp: DSSP generated no output for locus
is printed (on the standard output, unfortunately). (If the output
exists but is incorrect for some reason, then it will be necessary to
run DSSP by hand in order to find out what went wrong.)
Finally, if all went well, the output is postprocessed by the
util-parse-dssp.pl script (which is not documented) into abbreviated DSSP
format. Most of the postprocessing consists of selecting the
desired subset of fields from the raw DSSP output; the only nontrivial
transformation is the normalization of exposure values. DSSP states
each residue's exposure in terms of square Ångstroms, but the
abbreviated format requires that these be normalized to a fraction of
the effective maximum exposure, a dimensionless number between 0.0 and
(nominally) 1.0. The maximum values used are shown below.
| AA |
Maximum exposure Emax (Ångstroms) |
| A | 124 |
| B | 157.5 (average of N and D) |
| C | 94 |
| D | 154 |
| E | 187 |
| F | 221 |
| G | 89 |
| H | 201 |
| I | 193 |
| K | 214 |
| L | 199 |
| M | 216 |
| N | 161 |
| Q | 192 |
| R | 244 |
| S | 113 |
| T | 151 |
| V | 169 |
| W | 264 |
| Y | 237 |
| X | 179 |
| Z | 189.5 (average of Q and E) |
These values were computed by placing the indicated amino acid at the
center of a 5-residue chain, flanked by two glycines on either side,
then running molecular dynamics and sampling ten arbitrary
conformations, computing the exposure, and picking the largest value.
There are therefore two ways in which an exposure value observed for a
PDB entry might be larger than expected:
- The PDB residue might just happen to have a conformation with
slightly greater exposure that any of those sampled; or
- If at the extreme end of the chain, the exposed backbone atoms
will contribute more than expected to the exposure.
This explains why some of the "normalized" exposure values can be larger
than 1.0.
There are two classes of PDB entries for which abbreviated DSSP files
do not already exist online at BMERC.
- The entry has an incomplete set of
ATOM records for backbone atoms, as for a CA trace.
- The entry contains multiple models bracketed between
MODEL and
ENDMDL records. The dssp program by itself
does not recognize these records, and superimposes ALL atoms from
ALL models.
Not much can be done about the first case, since DSSP requires all
backbone atoms in order to determine the pattern of hydrogen bonding.
The second class does not affect generate-dssp, since it passes
the PDB entry through
filter-pdb-atoms.pl anyway, and
filter-pdb-atoms.pl arbitrarily picks model 1 to pass to the
dssp program. This works, though different models need not
give rise to the exact same secondary structure patterns, and it is not
necessarily true that model 1 is the best representative of the
ensemble. But that's the best that can be done for now, since the abbreviated DSSP
format (and the format format of the
dssp program on which it is based) has no room for
specifying multiple models.
Known bugs:
- *** generate-dssp diagnostics appear on the standard
output. This results in a one-line "abbreviated DSSP" file,
which fakes dssp-ss-states.pl into
believing that the structure is all loop. But I haven't bothered
to fix this because (I confess) it is useful in structure mask
generation. -- rgr, 28-Apr-97.
- Exposure for cysteine residues is normalized incorrectly.
Instead of using the cysteine value in the table above, it (in
effect) normalizes to an arbitrary amino acid, resulting in
values that are usually too low. [Fixed in Release 1.3. -- rgr,
2-Nov-98.]
dssp4.pl
dssp4.pl converts an abbreviated DSSP
format file to a "smoothed DSSP" file, using the scheme developed by
Jim White and Temple Smith. Several output formats are available.
(Originally, this was the dssp3.m Matlab script written by Jim White,
recoded into perl (in 1995?) by
R. Mark Adams, and revised as dssp4.pl in June and July 1996 by
Bob Rogers. [With a great
number of options added subsequently. -- rgr, 28-Apr-97.])
Usage:
dssp4.pl [-locus name] [-chain L] [-min-strand-length slen]
[-min-helix-length hlen] [-keep-short-strands] [ -t | -pdb | -ss ]
[-dont-fill-h-gaps] [-dont-fill-e-gaps] [filename]
Processing arguments:
- -locus name
- name by which to identify the structure. Used primarily for
warning messages.
- -chain L (a single character)
- output results only for the PDB chain identifier
L. If not specified, all chains found in the DSSP file
are processed; to get only the "main" chain, you must explicitly
say "-chain _".
- -turns-are-loops
- specifies that all turns are to be converted to loops. This step
is done first.
- -dont-fill-h-gaps
- specifies that one-residue loops between neighboring helices are
not to be altered. The default is to convert these residues to
H's.
- -dont-fill-e-gaps
- specifies that one-residue loops between neighboring strands are
not to be altered. The default is to convert these residues to
E's.
- -min-helix-length hlen (integer)
- specifies the minimum length for helices; default is 5. After
other "smoothing" operations, if any helices are shorter than
this, their secondary structure assignments are changed to 'L'.
[The default is BMERC policy for core segments, but is not
compatible with Release 0 of the needle tools suite. --
rgr, 28-Apr-97.]
- -min-strand-length slen (integer)
- specifies the minimum length for strands; default is 3. After
other "smoothing" operations, if any strands are shorter than
this, their secondary structure assignments are changed to 'L'.
Note that this can leave
degenerate sheets.
- -keep-short-strands
- used for backward compatibility; equivalent to
-min-strand-length 2, which is the preferred idiom.
(DSSP does not produce strands shorter than 2.)
- -clean
- specifies that an an extra "cleanup" postpass should be done; see below.
Input/output arguments:
The -t, -pdb, and -ss options are mutually
exclusive.
- -t
- indicates to output the data in table (.tbl) format
with three name-and-value entries (one per line):
- locus.aa and the sequence string;
- locus.ss and the smoothed secondary
structure as a string of DSSP secondary structure
characters; and
- locus.ac and solvent accessibility as a
string of digits from 0-9.
All three values have identical length (which may be shorter than
the actual sequence due to unresolved residues). Chain
breaks/terminations are represented explicitly as "!" residues in
each string (which means it is possible for the strings to be
longer than the sequence).
- -pdb
- requests output as fake PDB HELIX and SHEET records. Does not do
a very thorough job of faking it, though; see the bug description below. [This
feature may be removed in the future. -- rgr, 26-Aug-97.]
- -ss
- generate "ss" format for
use by make-dssp-segs in making a
table of secondary structures. Because the hydrogen bonding
information is meaningless without processing the entire DSSP
file, it is an error to specify both -chain and
-ss. Instead, grep through the full output.
- filename
- name of an
abbreviated DSSP format input file. If not specified (or
specified as "-"), the standard input is read.
(Specifying more than one DSSP file at a time will not work.)
- -verbose
- if specified, an information message is printed to the standard
error stream notifying of changes to secondary structure
assignments (i.e. H or E to something else, and vice versa).
In the default output format, one
tab-delimited line is generated per residue, each of which consists of
the residue identifier in DSSP format (the "pdbres" field followed immediately by
the chain ID character), the amino acid one-letter code, the (more or less complete) secondary
structure field, and finally the exposure, phi, and psi fields as they
appeared in the DSSP input.
In order to generate "smoothed DSSP," these steps are performed in
the following order:
- All S's, B's, and I's are changed to loops. If
-turns-are-loops was specified, T's are also changed to
loops (which makes steps 3, 4, and 7 moot).
- Up to three consecutive G's (3-10 helix residues) at the end of a
helix are changed to H's (i.e. the helix is extended in the
carboxy direction). If the -verbose option was
specified, a message is printed for each such helix that is
extended.
- Two consecutive residues labelled "T >4" at the
start or end of a helix are also turned into H's. If the
-verbose option was specified, a message is printed for
each such helix that is extended.
- If two consecutive residues are labelled "T 3", the
first is relabelled "T 2", and flanking residues
are labelled "T 1" and "T 4" (but
only if they are not helix or strand residues).
- Loops of length 1 between two helix residues (resp. two strand
residues) are filled in so as to join the two segments, unless
the -dont-fill-h-gaps (resp. -dont-fill-e-gaps)
option was specified. If -verbose was specified, a
message is printed for each pair of secondary structures joined.
- Strands and helices with less than the specified minimum length
are turned into loops. If the -verbose option was
specified, a message is printed for each such secondary structure
that is eliminated.
- Turns labelled "T 1" through "T 4"
are relabelled V, W, X, and Y, respectively.
- If the -clean option was specified, all residues that
are not labelled one of "HELVWXY" (i.e. helix, strand, loop, or
turn), and residues in turns other than those consecutively
labelled "VWXY" or "WX", are all turned into loops. In
particular, all T and G residues ("random" turns and 3-10
helices, respectively) are turned into loops.
All such steps are effectively done in separate passes, so they don't
interfere.
The following table in the perl script describes the resulting
single-letter secondary structure states in the output. [The numbers
apparently refer to the internal encoding used by the original Matlab
".m" file.]
STATE ID NUMBER CODE -clean CODE
helix 12 H H
strand 14 E E
3-10 helix 17 G L
T1 4 V V
T2 5 W W
T3 6 X X
T4 7 Y Y
loop 8 L L
random turn T L
[The "random turn" T states are leftovers from the processing algorithm;
I have left them in for fidelity with the original. Use the
-clean option to get rid of them. -- rgr, 18-Jul-96.]
Note: DSSP includes N-1 chain terminus "residues" with an
amino acid code of "!" for PDB entries with N chains. These
"residues" have " " for a secondary structure assignment, and they
also appear in cases where the chain is broken by missing residues
(e.g. between residues 10 and 20 on chain 1 of 2plv). [The
documentation for the "new" DSSP output format (July 1995) says that
interchain discontinuities are denoted "!*", but I have yet to see this.
-- rgr, 2-Nov-98.] dssp4.pl treats these "residues"
as loops in the processing steps described above, but it also changes
the secondary structure and exposure to "!" to avoid recognizing them as
real loop residues. This "!" (or "!*") is output, so those who use the
output of this program should be aware.
Known bugs:
- If a specific chain is specified, all chain break delimiters
("!") are dropped, including those that identify chain breaks
within the selected chain. [Fixed in Release 1.0. -- rgr,
9-Apr-97.]
- *** When short strands are eliminated, the resulting output can
contain degenerate sheets consisting of only a single strand
each. This happens when a strand of length 3 or greater is
H-bonded only to eliminated strands.
- *** When asked to do -pdb format output, the only
"sheets" that are generated consist of single strands. This is
not sufficient for (e.g.)
make-dssp-segs (though make-dssp-segs does
not actually accept -pdb format input). [-- rgr,
18-Jun-96.] [This feature may be removed in the future. -- rgr,
26-Aug-97.]
- Some operations, particularly the -clean option, trash
information in the secondary structure field other than the
summary first letter. -- rgr, 26-Jul-97. [Fixed in Release 1.0.
-- rgr, 26-Aug-97.]
- If told to select "-chain ' '",
dssp4.pl will include irrelevant chain break "residues"
(see above). [Fixed in Release
1.3. -- rgr, 2-Nov-98.]
- dssp4.pl does not complain about missing chains (i.e. if
told to process a chain that does not exist). [Fixed in Release
1.3. -- rgr, 2-Nov-98.]
make-dssp-segs
Runs dssp4.pl and
make-ss-designations.lisp to turn the abbreviated DSSP file
named as the first argument into the table of secondary structures named
as the second argument.
Usage:
make-dssp-segs dssp-file seg-file
Note: This is equivalent to the following:
dssp4.pl -dont-fill-e-gaps -ss dssp-file \
| make-ss-designations seg-file
This idiom is effectively what
make-core-depends.pl produces; it exposes the internals of
make-dssp-segs in order to give control over the options passed
to the dssp4.pl program. make-dssp-segs is now
deprecated on that account. -- rgr, 17-Sep-97.
Arguments:
- dssp-file
- name of an
abbreviated DSSP format file. At BMERC, such files for most
PDB entries are available as
/structure/dssp/pdbxxxx.ent.out for locus
xxxx. Others can be made via generate-dssp (q.v.).
- seg-file
- name of the segment
definition format file) on which to write output. [The
current, confusing convention is to use the extension
".dssp" because that is the source of the SS data. --
rgr, 12-Aug-96.]
Note that there is no chain argument. This is because one cannot
generate the correct designations for sheets versus barrels without
looking at the entire structure; not only is it necessary to have the
whole DSSP file to make sense of the strand-strand bonding information,
but sheets and even barrels can cross two or more chains.
make-dssp-segs passes the -dont-fill-e-gaps option
to dssp4.pl, defaulting all others. There
is no way to alter these defaults, other than using the make-dssp-segs
alternative described above.
dssp-ss-states.pl
dssp-ss-states.pl uses 'smoothed DSSP' output from the
-t option to the dssp4.pl program
together with the full protein sequence specified on the command line to
produce a string of secondary structure assignment letters, one for each
residue, with gaps in the abbreviated DSSP file assigned to 'L' states.
Usage:
dssp-ss-states.pl [-locus name] [-chain L] [-min-strand-length slen]
[-min-helix-length hlen] [-dont-fill-h-gaps] [-dont-fill-e-gaps]
full-sequence [dssp-filename]
Arguments:
- -locus name
- name by which to identify the structure. Passed directly to dssp4.pl, which uses it primarily for
warning messages.
- -chain chain-letter (a single character)
- letter (or sometimes a digit) identifying the chain to select;
defaults to "_" (equivalent to " " in PDB
terms).
- -dont-fill-h-gaps
- specifies that one-residue loops between neighboring helices are
not to be altered. The default is to convert these residues to
H's. Passed directly to dssp4.pl.
- -dont-fill-e-gaps
- specifies that one-residue loops between neighboring strands are
not to be altered. The default is to convert these residues to
E's. Passed directly to dssp4.pl.
- -min-helix-length hlen (integer)
- specifies the minimum length for helices; default is 1. Passed
directly to dssp4.pl.
- -min-strand-length slen (integer)
- specifies the minimum length for strands; default is 1. Passed
directly to dssp4.pl.
- -clean
- specifies that an an extra "cleanup" postpass should be done.
Passed directly to dssp4.pl.
- -3
- produces three-state output. This is done by turning everything
that is not an H or E into an L.
- -no-gap-p
- produces a binary map showing where gaps are not permitted when
doing a structure-aware alignment. All L's and the first and
last H and E of every segment are turned into 0 (the digit zero),
and all other H and E states are turned into 1 (the digit one).
- -verbose
- enables debugging output (none available at present).
- full-sequence
- the full SEQRES sequence (not the name of a sequence
file!) for the PDB entry, as a string of uppercase AA letters
with no embedded blanks.
- dssp-filename
- the name of an
abbreviated DSSP format file to pass to dssp4.pl;
defaults to the standard input. "-" may also be used to
specify the standard input explicitly.
Note that full-sequence and
dssp-filename need not come last; they can be mixed with
the other options as long as full-sequence is present,
and, if both are present, full-sequence comes before
dssp-filename.
dssp-ss-states.pl produces a string on the standard output
that is exactly as long as the full-sequence argument,
but with each amino acid letter replaced with a letter that describes
its "smoothed DSSP" secondary structure assignment. (See above for a description of the letters used.)
Portions of the DSSP sequence that are missing from the DSSP description
(usually because they are not resolved in the crystal structure) are
filled in with loop ('L') characters.
The comparison of full-sequence to the sequence in
the DSSP file is done by the globalS program. For most
purposes, this is akin to swatting a fly with a sledgehammer, but given
the problems with some of the PDB entries, having a sledgehammer comes
in quite handy. Aside from inconsistencies in the PDB, DSSP will drop
residues with incomplete backbones that other tools will consider
present in terms of ATOM records, or include residues that pdb-domain-seq.pl
doesn't see by default. And DSSP seems to make chain continuity errors
on occasion, inserting gaps for stretched peptide bonds, and leaving
them out when the start and end of the gap happen to be close. Some of
these cases lead to mismatches that are hard to correct with less
general sequence comparison tools.
Generally speaking, the result of this process may not be "correct"
in a biological sense, but at least it is consistent (in the database
sense) with both the structure and the full protein sequence, which is
what the needle tools care about.
(The version of dssp-ss-states.pl prior to Release 1.5 was
completely dependent on having the SEQRES sequence match up
with the DSSP sequence in order to fill in gaps in the chain. As a
result, it was very sensitive to mismatches, usually errors in the PDB
entry's SEQRES records.)
Known bugs:
- 1bll has a chain break between residues
"LYS E 11 " and
"GLU E 15 ", but since it is only
1.84Å, DSSP doesn't flag it as such, and
dssp-ss-states.pl doesn't realize there needs to be an
insertion at that point. -- rgr, 14-Apr-97. [Fixed in Release
1.5 by the globalS implementation. -- rgr, 29-Oct-99.]
- 1tre chain A has an error in PDB residue
indexing that confuses DSSP, causing it to put a gap in the
wrong place, which confuses dssp-ss-states.pl in turn.
-- rgr, 24-Apr-97. [Fixed in Release 1.5 by the globalS
implementation. -- rgr, 29-Oct-99.]
- *** Use of positional arguments for full-sequence
and dssp-filename is stupid. -- rgr, 18-May-98.
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Fri Mar 9 10:47:37 EST 2001