Notes on PDB ATOM record format

BMERC : needle tools : File formats : PDB ATOM format


These are just my summary notes with real-world examples (and counter-examples!). For more information, see the detailed ATOM record specification from the PDB Contents Guide Version 2.2 (draft) of 16-Dec-1996 at the Protein Data Bank.

Table of contents

  1. Notes on PDB ATOM record format
    1. Table of contents
    2. ATOM record format
    3. Notes on atom serial numbers
    4. Notes on atom naming
    5. Notes on atom variants
    6. Notes on amino acid abbreviations
    7. Notes on chains
    8. Notes on sequence number and insertion code
    9. Notes on atom occupancies

ATOM record format

Note: Column numbers are 0-based, because so is the emacs C-x = (what-cursor-position) command, and so are C and perl subscripts!

 
0         1         2         3         4         5         6         7
01234567890123456789012345678901234567890123456789012345678901234567890123456789
ATOM     86  CG  ARG    11      -2.455   1.706  24.211  1.00 17.72      1AAK 146
Name Start End Format Description
recname 05A6 a literal "ATOM  " (note two trailing spaces).
serial 610I5 atom serial number, e.g. "   86". See below for details.
11111X space
atom 1215A4 Atom role name, e.g. " CG1;". See below for details.
altLoc 1616A1 atom variant, officially called the "alternate location indicator". This is usually " " for atoms with well-defined positions, as in this case, but sometimes "A", "B", etc. See below for details.
resName 1719A3 amino acid abbreviation, e.g. "ARG". See below for details.
20201X space
chainID 2121A1 chain ID, usually " ", but often "A", "B", etc, for multichain entries. See below for details.
Seqno 2226A5 residue sequence number (I4) and insertion code (A1), e.g. "  11 " or " 256C". See below for details.
27293X three spaces
x 3037F8.3 atom X coordinate
y 3845F8.3 atom Y coordinate
z 4653F8.3 atom Z coordinate
occupancy 5459F6.2 atom occupancy, usually "  1.00". The sum of atom occupancies for all variants in field 4 generally add to 1.0.
tempFactor 6065F6.2 B value or temperature factor, e.g. " 17.72". (I don't use this value, so have nothing to add; see the ATOM record specification discussion of B factors, etc. -- rgr, 8-Oct-96.)
6671A6 ignored. [Some older PDB files have footnote numbers here, but this field is not described in the Format 2.1 specification. -- rgr, 22-Jan-99.]
recID 7279A8 [prior to format version 2.0.] record identification field, e.g. "1AAK 146" (tres FORTRAN, n'est-ce pas?).
segID 7275A4 segment identifier, left-justified. [format version 2.0 and later.]
element 7677A2 element symbol, right-justified. [format version 2.0 and later.]
charge 7879A2 charge on the atom. [format version 2.0 and later.]

Notes on atom serial numbers

Each ATOM and HETATM record (and TER, for some reason) gets a unique serial number, assigned consecutively from 1. (The fact that a single numbering scheme applies to both atoms and heterogen atoms is not explicitly stated in the PDB documentation, but appears to be universally followed.) However, the -hcb option to
filter-pdb-atoms.pl violates this rule by generating all its hallucinated beta carbons with serial number 0.

Notes on atom naming

The atom name is a four-character field that may be subdivided into atomic symbol (A2), "remoteness indicator" (A1), and "branch designator" (A1) subfields, as detailed below. The atom name may be subdivided into the following subfields:

Name Format Description
atomic symbol A2 Right-justified atomic symbol, e.g. " C".
remoteness indicator A1 Greek letter distance abbreviation. In order of increasing distance, these are "A" for alpha, "B" for beta, "G" for gamma, "D" for delta, "E" for epsilon, "Z" for zeta, and "H" for eta.
branch designator A1 digit designating the branch direction; left blank if the sidechain is unbranched.

For the name " CG1;", " C" denotes the species (carbon), "G" identifies it as a gamma atom, and "1" denotes the branch of a beta-branched amino acid. Note that the atomic symbol is right-justified (i.e. " C", "ZN"), so " CA " is an alpha carbon, and "CA  " is a calcium atom.

For amino acids, the traditional order of atom records within a residue is N, CA, C, O, followed by the sidechain atoms (CB, CG1, CG2 . . . ) in order first of increasing remoteness, and then branch. The extra oxygen at the carboxyl terminal is called " OXT", and appears after all sidechain atoms.

Note: The amino-to-carboxyl backbone order of polypeptides is required by the PDB, but the described ordering of sidechain atoms is nowhere specified (that I can find). However, I have not found any exceptions. Some BMERC programs may break if it is not strictly followed, though this can be avoided by preprocessing the ATOM records with filter-pdb-atoms.pl.

Hydrogens, if they appear for a given residue, come after all that residue's heavy atoms, in the same order and with the same nomenclature, e.g. " HA " is the hydrogen on the alpha carbon (the fourth stereospecific bond). However, if there are multiple hydrogens on a given heavy atom, they are given prefix digits to make them unique, e.g. "1HB " and "2HB " for a non-beta-branched residue (so valine has a plain " HB ", and alanine has an additional "3HB "). [I don't know whether the prefixes are specific to the rotamer directions, though. -- rgr, 8-Oct-96.]

Anomalies:

Missing OXT atom.
The OXT atom may not be present in rare cases. For example, in 1aar, a ubiquitin dimer, there is an "isopeptide" bond between the final carbon of the A chain and the NZ nitrogen of the sidechain of a lysine on the other peptide chain.

Hydrogens represented as "Q" atoms.
In at least one case, the NMR structure 1acp, some of the hydrogens are represented as "Q" atoms. Here is REMARK 10 from the 1acp PDB entry:
Pseudo-atoms designated as Q are dimensionless reference points representing a group of hydrogen atoms. They are placed in the center of the positions of the hydrogen atoms they represent. QA represents the two methylene hydrogen atoms of GLY. QB, QG, ... represent beta, gamma, ... methylene or methyl groups in the side chains. In case of branches in the side chains the numbers of the pseudo-atoms are the same as the numbers of the carbons to which the hydrogen atoms are attached. QQG and QQD denote the pseudo-atoms for the 6 hydrogen atoms of the isopropyl methyl groups of VAL and LEU. QR is the pseudo-atom for the delta and epsilon hydrogens of the aromatic rings of TYR and PHE. (K. Wuthrich, M. Billeter and W. Braun, J. Mol. Biol. (1983) vol. 169, 949-961)

CB GLY in the PDB entry.
Out of 43 glycine residues in 1vnc, one of them (GLY 126) has a beta carbon! The occupancy is 0.00, however, and REMARK 6 has this to say:
. . . Residues GLU 119 to ASN 128 were modelled with no side chains. There was some density in the primary MIR-map but the residues did not show up in the refined model. The occupancies of these residues were set to zero.
It appears that the sofware that generated the backbone neglected to consider that glycines do not normally have beta carbons.

Missing branch designator.
1gai ILE 87 has a CD atom (in each of two alternate locations); all other ILE's have a CD1, as is normal. Also, none of the other ILE's have alternate locations, either.

Water with O1 and O2 atoms
1mbd has two water molecules (HOH 243 and HOH 245) that each have O1 and O2 atoms, instead of a single O. None of the other waters are affected. There is no indication that the author really meant "hydrogen peroxide" instead of water.

HET groups with common indices
HET groups SO4 1 and HOH 1 share a chain/index/insertion code in the 1ctf structure. The Heterogen section of the PDB Contents Guide Version 2.1 is not explicit on whether or not this is an error.

OH instead of OXT
The last residue of 2hmx, TYR 133, apparently has a second OH atom instead of an OXT atom. It is not a duplication of a single OH oxygen, and the coordinates appear to be consistent with OXT. The error is repeated on all 20 models in this NMR entry.

Missing remoteness indicator
CYS 50 of 2bb2 has its CB atom mislabelled as C.

Notes on atom variants

We expect the
atom variant fields to always be " " in the .core file; if several variants are present in the PDB entry, filter-pdb-atoms.pl picks the one with the highest occupancy. This is not usually an issue; core segments, especially backbone atoms, usually are heavily constrained in the crystal. See also the discussion of atom occupancy (field 12).

Anomalies:

Digits in the variant field.
[Loredana reports finding digits in this field, but I have not seen any characters other than A and B. -- rgr, 25-Jul-96.]

Missing alternative.
1hms CB THR 85 has a solitary A variant (of occupancy 0.71), with no apparent alternative. The gamma atoms have both A and B variants, with corresponding occupancies of 0.71 and 0.29, so a CG would appear to have been left out, except that the atom numbering is consecutive.

Mislabelled variants.
1hms LEU 108 has paired A and B variants for the delta carbons, but variant A is on CD2 and variant B is on CD1. These are apparently mislabelled.

Variants labelled by the SEGID field
1tgj (a recent TGF-beta structure) has 8 variant atoms that are labelled as variants solely by the SEGID field of the atom record. Most atoms have "TGF " for their SEGID, but the variants have "TGF2" instead and the alternate location indicator is erroneously left blank. Otherwise the atoms appear normal (all with 0.50 occupancy and in the expected order). The following paragraph, which appears as part of REMARK 500, is the only thing that mentions alternate locations:
The following atoms that are related by crystallographic symmetry are in close contact. Some of these may be atoms located on special positions in the cell. Atoms with non-blank alternate location indicators are not included in the calculations.
Ironically, there are no atoms so labelled. It is therefore likely that the text was generated automatically.

On the other hand, there is no REMARK 550, which is supposed to describe the meaning of the SEGID field identifiers.

Variant amino acid types.
2hmz attempts to implement these in the following way (from REMARK 5):
Residue 21 has been modeled both as VAL and as ILE. It is presented in the ATOM records below as VAL 21 followed by ILE 21A. All atoms of this residue have been assigned occupancies of 0.5.
Thus, none of these atoms have the expected non-blank atom variant field. filter-pdb-atoms.pl sees chain breaks between VAL 21 and ILE 21A.

Notes on amino acid abbreviations

Virtually all of the time, this is one of the 20 standard amino acid abreviations. The two principal exceptions are:
  1. Polynucleotide chains. Yes, the PDB, is a protein data bank, but for structures of proteins bound to short DNA or RNA segments, the most sensible way to render the polynucleotide chain is . . . as a chain.
  2. modified amino termini. The most common case is acetylation, which involves an initial "residue" called ACE, with atoms C, O, and CH3; but other amino modifications are possible (1lec has a "PCA" residue, whatever that is). [I think these should be HETATM's, strictly speaking, but the PDB documentation is unclear on the point. And the convention for acetylation, at least, seems to be well established. -- rgr, 28-Aug-96.]

Notes on chains

Each chain represents a contiguous polypeptide (or polynucleotide), though there may be breaks that are not explicitly recorded (except in REMARK records) due to crystallographic ambiguity. If there is only one chain, it is usually denoted with a " " in the chain ID field; multichain entries are usually labelled "A", "B", etc. Proteases and their peptide-containing inhibitors are often labelled "E" and "I", respectively.

And sometimes it is wierder.

Anomalies:

Single chain with non-space chain ID.
Take, for example, Remark 4 from the entry for 5cyt:
Standard Protein Data Bank procedure is to use a null (blank) character for the chain indicator in structures comprising only one chain. In this entry, however, the chain identifier R (for reduced) was used to conform to the depositor's publications and to emphasize the relationship between this structure and the oxidized form which is available from the Protein Data Bank as a separate entry.
(By the way,
REMARK 4 is supposed to be a compliance note, of the form "XXXX COMPLIES WITH FORMAT V. N.M, DD-MMM-YYYY". Clearly, 5cyt does not, though it's choice of chain ID is not out of line.)

Non-alphabetic chain IDs
1bbt uses the digits 1 through 4 for its four chains. This is legitimate.

Chains not collated.
Chains do NOT necessarily appear in alphabetic (or other) order within a PDB entry. 1lts has chains D, E, F, G, and H (a homopentamer) followed by chains A and C. This is legitimate, and therefore not really anomalous, but will break programs that assume alphabetical order.

To find all chain ID's and the lengths of their associated sequences (as represented by atom records), do the following:


    grep '^ATOM........ CA [ A]' $pdb_file_name \
	| cut -c22 | uniq -c
Note that this does not work for PDB entries with multiple models (i.e. that contain MODEL/ENDMDL records), but you can put
filter-pdb-atoms.pl at the start of the pipe to correct this.

For the case of 2plv, the output looks like this:

 
    288 1
    268 2
    235 3
     62 4

Notes on sequence number and insertion code

Together, the sequence number and insertion code uniquely name the residue within the chain. Since neither is any use without the other, I have taken to calling these collectively the
PDBRES field, though the official ATOM record specification considers them separate fields.

The insertion code is usually blank, but values of A, B, C, etc., are used to preserve historical numbering schemes in the presence of insertion mutations. (In rare cases, the insertion code can even be a digit, though the use of digits is now officially deprecated. Personally, I haven't seen any of these. Yet.) The integer sequence number can be zero or negative, consecutive residues can be numbered nonconsecutively, again due to insertions or deletions with respect to some consensus sequence. And finally, residues can be unresolved in the crystal, again resulting in nonconsecutive residues. Therefore, treating this field as an index is a big mistake -- the classic error of PDB parsing, in fact. A fuller explanation, including why they permit this crock, is available as part of the ATOM record format specification.

Note: Sorting by this field in alphabetic order does not result in correct ordering of residues; even ignoring the chain ID issue, the minus sign on negative residues causes them to sort before 0, but then they appear in ascending order by the following digits. Not to mention the following ambiguity: Which is closer to 0: -4A or -4C? I haven't seen any examples of insertion codes below 0, so I couldn't guess.

Anomalies:

Duplicate index.
The last three residues of 1tre's chain A are "LYS A 255 ", "GLN A 255 ", and "ALA A 257 "; the glutamine is clearly mislabelled. (Although the B chain is identical, it does not share this problem because the last two residues are not resolved.) filter-pdb-atoms.pl now catches this, saying "'LYS A 255 ' and 'GLN A 255 ' share an index/insertion code."

Similarly, 1hpl has "TRP A  30 " and "SER A  30 "; the same problem appears on the B chain as well. The numbering is otherwise contiguous, making it less obvious what went wrong. This particular case can be kludged by doing the following:


       perl -pe 's/  30 /  30A/ if /^ATOM  .*SER .  30 /;' 1hpl.ent
           | efa.pl > 1hpl.nexp
[Unfortunately, there's no general approach. -- rgr, 8-Jun-98.]

Aversion to zero?
The numbering for the initial residues of 2bb2 are numbered as follows:
-2, -1, 1, 2, . . .
This is perfectly legitimate, but does it encode an annotation history of an insertion of length three (including a zero residue) followed by a deletion, or is it a simple aversion to using a zero as a residue index?

Notes on atom occupancies

Normally, each atom is "visible" in exactly one location, and therefore has one atom record with an occupancy of 1.00. Sometimes part of the chain will crystallize in two or more stable sets of locations, leading to pairs (usually) of atom records for the same atom, differentiated by a letter in the
alternate location field (field 4); the sum of atom multiplicities for all such variants generally add to 1.00. There are a number of legitmate reasons why this may not be the case: There is no requirement that the higher-occupancy variants appear first. In fact, 7rsa (our old friend) has some sets of variants in the order 0.66/0.33 and some in the order 0.33/0.66. (There is, however, a requirement that related variant sets appear with the same letter and in the same order, so these may in fact be coupled somehow. -- rgr, 8-Oct-96.)

Anomalies:

Occupancy field with leading space.
If the value of the occupancy field is less than 1.00, the field will not necessarily start with a leading zero, as in "  .80".  2aza has many examples of this.

Occupancies that sum to more than 1.00.
1mng has 50 atoms in each of two chains with alternate locations. The letters used are "L" and "U", for the liganded and unliganded state respectively (this use of alternate locations may be unconventional, but is not explicitly forbidden by the PDB specification). For some reason, the alternatives in the first chain have occupancies of 0.51 and 0.52 (resp.), and 0.50 and 0.55 (resp.) in the second chain. While the two chains need not have the same binding affinity, and less reason for the bound and unbound states to be equally common, it makes no sense to me to have them sum to more than one.

In 1ifc, every atom has a pair of alternate locations, very few of which have occupancies that total 1.00. Remark 4 from the PDB entry says the following:

This entry contains two superposed structures. The two structures were used in the last cycle of refinement to determine the occupancies of atoms that displayed two alternate positions. Coordinates for the two structures including solvent molecules have been assigned the alternate location indicators A and B.
Presumably, the author relaxed the constraint that the occupancies of alternate locations sum to one. But since they seem to be mostly equal to each other, displaying a range of values around 0.50, it's not clear what the real constraints were.

Electron counts in the occupancy field
1mbd multiplies the occupancy by the atom's electron count, yielding numbers like 8.00 for oxygens and 6.00 for carbons. Remark 5 says this, almost:
The quantity given in the occupancy field of the ATOM and HETATM records is the electron count.
But the HETATM records have strange numbers like 5.84 and (stranger still) 11.43 for HOH oxygens, so that appears not to be the case.

Bob Rogers <rogers@darwin.bu.edu>
Last modified: Thu Apr 6 16:55:56 EDT 2000