PDB ATOM programs

BMERC : needle tools : Programs : ATOM programs


Presently, the only "general" PDB ATOM tool is filter-pdb-atoms.pl, which is used by many tools to do preprocessing and consistency checking of PDB ATOM records.

See also:

Table of contents

  1. PDB ATOM programs
    1. Table of contents
    2. filter-pdb-atoms.pl


filter-pdb-atoms.pl

filter-pdb-atoms.pl may be used to extract ATOM records from a PDB file that meet certain criteria, as well as fixing and/or identifying some common screw cases that can disrupt other programs.

Usage:

 
    filter-pdb-atoms.pl [-locus string] [-chains chain-spec] [-new-chain-id letter]
	    [-output [all | backbone | cb]] [-pass-through [chain | all]]
	    [-hcb | -hcb-all] [-ala] [-residue-index-file rif-file-name]
            [-include-hetatms HET ...] [-exclude-hetatms HET ...]
	    [-print-peptide-length-histogram] [ pdb-file-name ]
If pdb-file-name is not specified, or is specified as "-", then the standard input is used. The selected ATOM, TER, and possibly other records (depending on options) are sent to the standard output.

Arguments:

-locus string
optional locus name, used only for warnings at present. If not specified, and an explicit PDB file name is specified, the warning includes the PDB file name instead.
-chain chain-spec
-chains chain-spec
specifies the chain or chains to select, or "all" to get all chains (the default). If not the token "all" (in lower case), the chain-spec must be a comma-separated list of chain IDs in uppercase, followed by optional residue subranges (see the Core chain specification syntax section for details). Regardless of the order or number of times they are specified, multiple chains or chain subranges are always output exactly once in the order in which they appear in the input PDB entry. No complaint is made if a specified chain or subrange does not actually appear. See elsewhere for a hack that finds all chain IDs in a PDB entry. (-chain and -chains are synonyms; at most one may be specified, as they are not cumulative.)
-new-chain-id chain-letter
specifies a new chain ID letter for the selected chain(s). If specified, chain-letter must be a single character.
-model N
for a PDB file that contains a series of models bounded by MODEL and ENDMDL records, selects the model number given by the integer N. By default, the first model is output. In no case is more than one model ever output. For a file without models, this argument is ignored. [probably should take the implicit model 1 if N=1, and skip otherwise, for consistency. -- rgr, 23-Apr-99.]
-output token
specify the set of atoms to output; token must be one of the following: But in no case are hydrogen atoms emitted (so "-output all" is a bit misleading).
-hcb
"hallucinate" beta carbons for glycines; see below for details. The -hcb option is ignored if -output backbone is specified.
-hcb-all
like -hcb, "hallucinate" beta carbons for all residues with missing CB atoms, in addition to glycines.
-ala
replace all residue names with ALA.
-pass-through token
indicate which records, other than ATOM, HETATM, and TER, are to be passed through. The only defined values for token at present are
-include-hetatms [ HET . . . ]
specifies that some or all heterogen atoms (on HETATM records) are to be included in the output. By default, all heterogen atoms are omitted. If -include-hetatms is specified without HET "residue" names, then all HETATM records are included. If one or more specific HET names is given, then only HETATM records for those heterogen "residues" are included. In order to be recognized, HET names must be uppercase alphanumerics exactly three characters long (the "hetID" of the HET record specification of the PDB Contents Guide Version 2.2.
-exclude-hetatms [ HET . . . ]
specifies that heterogen atoms for all but the specified HET groups are to be included in the output. By default, all heterogen atoms are omitted.
-include-unk
specified that UNK residues are to be included in the output. Otherwise, they are ignored.
-no-missing-atoms
if specified, the message that is normally printed enumerating atoms missing for each incomplete residue is suppressed. The total number of atoms missing from all residues is still printing in a warning at the end, if this is not zero.
-residue-index-file rif-file-name
write a list of residue indices, one for each selected residue, onto the named file, one per line. See below for details.
-print-peptide-length-histogram
debugging hack that prints a histogram of peptide bond lengths to stderr, binned by 0.01 Ångstroms. (The distributions can vary markedly from protein to protein; I haven't a clue why.)
Based on its keyword arguments, filter-pdb-atoms.pl processes each residue's atoms sequentially, in the order in which they appear in the input file. Each residue is processed in the following way:
  1. Residues outside the desired set (residues not within selected chains and/or chain subranges, UNK residues unless requested, HETATM residues unless requested) are skipped entirely; no warning messages are printed for such residues.
  2. Illegal residues (i.e. those having ATOM records with names outside the standard set of 20 amino acids) are reported and ignored.
  3. Nucleotides are always ignored but never reported.
  4. Illegal or inappropriate atoms (e.g. an NZ atom on a tyrosine residue) are also reported and ignored. Since filter-pdb-atoms.pl has no built-in knowledge of heterogen groups, it learns the appropriate atom names as it goes (and therefore must assume that an "unknown atom" really corresponds to a "missing atom" in a previous residue). (Beta carbons on glycines are not rejected if -hcb is specified; they are silently used.)
  5. Hydrogens are always ignored, and never reported. This includes deuterium atoms and hydrogens represented as "Q" atoms. [Including these is a possible enhancement idea. -- rgr, 8-Aug-96.] [But I've had no call for hydrogens. -- rgr, 24-May-98.]
  6. Multiple occurrences of an atom with variants are replaced by the one with the highest occupancy, or the first if there are ties. The "atom variant" letter (1-based column 17) is always reset to a space (" "). (The original BMERC technique for dealing with multiple atoms was to pick the first, but George Maalouf suggested this eminently sensible change. -- rgr, 7-Aug-96.)
  7. Beta carbons are hallucinated, for glycines (if -hcb was specified) or for all atoms (if -hcb-all was specified), but only if the residue has no beta carbon, and if the N, CA, and C backbone atoms are present. The new CB atom record is emitted in the usual place (after the backbone and O atom, if present, but before the OXT, if present). The atom number, occupancy, and temperature factor are all zero (in the appropriate formats), and columns 77-80 are " C  " (the element symbol and implicit zero charge, in compliance with PDB version 2.1). Columns 73-76 are copied from the CA atom if its columns 77-80 are " C  " (a heuristic for detecting version 2.1 compliance), else they are left blank.
  8. Missing backbone and sidechain atoms are detected and reported. At most one "Missing atom" message will be generated per residue. Note that filter-pdb-atoms.pl expects a residue's ATOM records to be consecutive. ("Incomplete" residues are still output, though, so the output may still contain broken backbones. Is this a bug? Perhaps it should at least be an option. -- rgr, 12-Aug-96.)
  9. The residue name is changed to ALA if the -ala option was specified.
  10. Regardless of the order in which a residue's ATOM records appear, they are always output in canonical order (backbone from amino to carboxyl ends, followed by sidechain atoms, proximal to distal, usually followed by OXT for the terminal residue's other oxygen).
  11. If the -residue-index-file option was specified, a line is written to this file with the three-character residue name, a space, the chain ID, the four-character index number, and the one-character insertion code. This is identical in format to columns 18 through 27 of an ATOM record.
  12. Chain breaks are detected by checking peptide bond lengths; if greater than 2.0 Ångstroms, the offending residues are flagged. This is only done where the ATOM records are consecutive; filter-pdb-atoms.pl won't complain about apparent chain breaks where other lines are interpolated between residues. (The peptide bond values I've seen for continuous chains cluster around 1.33Å, with a maximum of 1.47Å (except for 4rcr, which has a 1.64Å peptide bond, and 3dfr, which seems to have 8 larger values, ranging up to 1.64Å). Skipping a residue seems to result in values from 2.48Å to more than 4 Å.
  13. [Missing TER records should be flagged. -- rgr, 8-Aug-96.]
Residues are always output in the order in which they appear in the input file, regardless of the order they are mentioned in the -chains option. Furthermore, residues are output at most once, even if implicitly "requested" more than once by using repeated or overlapping chain subrange specifications.

If you just want to find out what's going on with a given PDB file, here's a quick hack that does the trick:

 
	filter-pdb-atoms.pl 3dfr.ent > /dev/null
This gives you all the warnings with none of the atoms (since the warnings go to stderr, and the atoms to stdout). This is also the easiest way to use the -print-peptide-length-histogram option.

Bugs/Problems:

  1. The hallucinate-cbeta program is used internally to implement the -hcb option. [Fixed in Release 1.0. -- rgr, 22-Aug-97.]
  2. The synthesized beta carbon for GLY 1 92 of 1bbt seems to have slightly different coords than those generated by the old hcb.f -- but only that CB, and only Y and Z. First the original version and then the new is shown in diff format:
     
    460c460
    < ATOM    719  CB  GLY 1  92      34.872  -7.279 128.363  1.00  8.88      1BBT 999
    ---
    > ATOM      0  CB  GLY 1  92      34.872  -7.083 128.812  0.00  0.00              
    
    [The all-perl version in Release 1.0 comes closer to the original, but is still significantly off. Here we have the first filter-pdb-atoms.pl version versus the current version:
     
    725c725
    < ATOM      0  CB  GLY 1  92      34.872  -7.083 128.812  0.00  0.00           C  
    ---
    > ATOM      0  CB  GLY 1  92      34.872  -7.233 128.468  0.00  0.00           C  
    
    1pgs also has a large CB GLY variation. Probably not worth worrying about. -- rgr, 22-Aug-97.]
  3. 2hmz seems to have a break of 2.43Å between VAL A 21 and ILE A 21A (and similar breaks at the same spot in the other three chains). [Never mind; see the 2hmz anomaly and is expected. -- rgr, 26-Aug-97.]
  4. Polynucleotide chains are not dealt with at all. filter-pdb-atoms.pl complains that the various bases are undefined amino acids. [All "Undefined AA" messages are now suppressed unconditionally for nucleotides. There is still no way to get nucleotides through the filter. -- rgr, 23-Apr-99.]
  5. Deal with MODEL and ENDMDL records, and the new element and segment fields in columns 73-80. [Columns 73-80 are now version 2.1 compatible. -- rgr, 22-Aug-97.] [MODEL/ENDMDL records are now handled fully. -- rgr, 23-Apr-99.]
  6. Should we hallucinate beta carbons for non-glycines when they are not seen in the structure? -- rgr, 12-Aug-96. [Now an option; caveat emptor. -- rgr, 22-Aug-97.]
  7. *** No complaint is made if a specified chain does not actually appear in the PDB entry. This is easy for the user to detect if requesting only one chain (the output will be empty), but more difficult if only one of a set is missing. -- rgr, 16-Aug-96.
  8. The "atom variant" letter (1-based column 17) was not always reset to " ", e.g. 1hms CB THR 85. [Fixed in Release 1.0. -- rgr, 26-Aug-97.]
  9. The following sidechain atoms are emitted out of order: [Fixed in Release 1.0. -- rgr, 22-Aug-97.]
  10. 2eng (and maybe 1bco) has a GLY with missing backbone atoms; filter-pdb-atoms.pl doesn't check this, and as a result, the CB is placed wildly. [Fixed in Release 1.0. -- rgr, 22-Aug-97.]
  11. *** Specifying -pass-through all gets SIGATM, SIGUIJ, and ANISOU records for atoms and chains that are not selected. These records should come through only if their associated atom (and/or residue for HETATM) fits the criteria for being output. (See the coordinate section of the PDB guide for more info.) -- rgr, 10-Sep-97.


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Thu Apr 6 16:51:54 EDT 2000