BMERC : needle tools : File formats : PDB ATOM format
These are just my summary notes with real-world examples (and counter-examples!). For more information, see the detailed ATOM record specification from the PDB Contents Guide Version 2.2 (draft) of 16-Dec-1996 at the Protein Data Bank.
Note: Column numbers are 0-based, because so is the emacs C-x = (what-cursor-position) command, and so are C and perl subscripts!
0 1 2 3 4 5 6 7 01234567890123456789012345678901234567890123456789012345678901234567890123456789 ATOM 86 CG ARG 11 -2.455 1.706 24.211 1.00 17.72 1AAK 146
| Name | Start | End | Format | Description |
|---|---|---|---|---|
| recname | 0 | 5 | A6 | a literal "ATOM " (note two trailing spaces). |
| serial | 6 | 10 | I5 | atom serial number, e.g. " 86". See below for details. |
| 11 | 11 | 1X | space | |
| atom | 12 | 15 | A4 | Atom role name, e.g. " CG1;". See below for details. |
| altLoc | 16 | 16 | A1 | atom variant, officially called the "alternate location indicator". This is usually " " for atoms with well-defined positions, as in this case, but sometimes "A", "B", etc. See below for details. |
| resName | 17 | 19 | A3 | amino acid abbreviation, e.g. "ARG". See below for details. |
| 20 | 20 | 1X | space | |
| chainID | 21 | 21 | A1 | chain ID, usually " ", but often "A", "B", etc, for multichain entries. See below for details. |
| Seqno | 22 | 26 | A5 | residue sequence number (I4) and insertion code (A1), e.g. " 11 " or " 256C". See below for details. |
| 27 | 29 | 3X | three spaces | |
| x | 30 | 37 | F8.3 | atom X coordinate |
| y | 38 | 45 | F8.3 | atom Y coordinate |
| z | 46 | 53 | F8.3 | atom Z coordinate |
| occupancy | 54 | 59 | F6.2 | atom occupancy, usually " 1.00". The sum of atom occupancies for all variants in field 4 generally add to 1.0. |
| tempFactor | 60 | 65 | F6.2 | B value or temperature factor, e.g. " 17.72". (I don't use this value, so have nothing to add; see the ATOM record specification discussion of B factors, etc. -- rgr, 8-Oct-96.) |
| 66 | 71 | A6 | ignored. [Some older PDB files have footnote numbers here, but this field is not described in the Format 2.1 specification. -- rgr, 22-Jan-99.] | |
| recID | 72 | 79 | A8 | [prior to format version 2.0.] record identification field, e.g. "1AAK 146" (tres FORTRAN, n'est-ce pas?). |
| segID | 72 | 75 | A4 | segment identifier, left-justified. [format version 2.0 and later.] |
| element | 76 | 77 | A2 | element symbol, right-justified. [format version 2.0 and later.] |
| charge | 78 | 79 | A2 | charge on the atom. [format version 2.0 and later.] |
| Name | Format | Description |
|---|---|---|
| atomic symbol | A2 | Right-justified atomic symbol, e.g. " C". |
| remoteness indicator | A1 | Greek letter distance abbreviation. In order of increasing distance, these are "A" for alpha, "B" for beta, "G" for gamma, "D" for delta, "E" for epsilon, "Z" for zeta, and "H" for eta. |
| branch designator | A1 | digit designating the branch direction; left blank if the sidechain is unbranched. |
For the name " CG1;", " C" denotes the species (carbon), "G" identifies it as a gamma atom, and "1" denotes the branch of a beta-branched amino acid. Note that the atomic symbol is right-justified (i.e. " C", "ZN"), so " CA " is an alpha carbon, and "CA " is a calcium atom.
For amino acids, the traditional order of atom records within a residue is N, CA, C, O, followed by the sidechain atoms (CB, CG1, CG2 . . . ) in order first of increasing remoteness, and then branch. The extra oxygen at the carboxyl terminal is called " OXT", and appears after all sidechain atoms.
Note: The amino-to-carboxyl backbone order of polypeptides is required by the PDB, but the described ordering of sidechain atoms is nowhere specified (that I can find). However, I have not found any exceptions. Some BMERC programs may break if it is not strictly followed, though this can be avoided by preprocessing the ATOM records with filter-pdb-atoms.pl.
Hydrogens, if they appear for a given residue, come after all that residue's heavy atoms, in the same order and with the same nomenclature, e.g. " HA " is the hydrogen on the alpha carbon (the fourth stereospecific bond). However, if there are multiple hydrogens on a given heavy atom, they are given prefix digits to make them unique, e.g. "1HB " and "2HB " for a non-beta-branched residue (so valine has a plain " HB ", and alanine has an additional "3HB "). [I don't know whether the prefixes are specific to the rotamer directions, though. -- rgr, 8-Oct-96.]
Anomalies:
Pseudo-atoms designated as Q are dimensionless reference points
representing a group of hydrogen atoms. They are placed in the
center of the positions of the hydrogen atoms they represent. QA
represents the two methylene hydrogen atoms of GLY. QB, QG,
... represent beta, gamma, ... methylene or methyl groups in the
side chains. In case of branches in the side chains the numbers
of the pseudo-atoms are the same as the numbers of the carbons to
which the hydrogen atoms are attached. QQG and QQD denote the
pseudo-atoms for the 6 hydrogen atoms of the isopropyl methyl
groups of VAL and LEU. QR is the pseudo-atom for the delta and
epsilon hydrogens of the aromatic rings of TYR and PHE.
(K. Wuthrich, M. Billeter and W. Braun, J. Mol. Biol. (1983)
vol. 169, 949-961)
. . . Residues GLU 119 to ASN 128 were modelled with no side
chains. There was some density in the primary MIR-map but the
residues did not show up in the refined model. The occupancies
of these residues were set to zero.
It appears that the sofware that generated the backbone neglected
to consider that glycines do not normally have beta carbons.
Anomalies:
On the other hand, there is no
REMARK 550, which is supposed to describe the
meaning of the SEGID field identifiers.
The following atoms that are related by crystallographic symmetry
are in close contact. Some of these may be atoms located on
special positions in the cell. Atoms with non-blank alternate
location indicators are not included in the calculations.
Ironically, there are no atoms so labelled. It is therefore
likely that the text was generated automatically.
Residue 21 has been modeled both as VAL and as ILE. It is
presented in the ATOM records below as VAL 21 followed by ILE
21A. All atoms of this residue have been assigned occupancies of
0.5.
Thus, none of these atoms have the expected non-blank atom
variant field.
filter-pdb-atoms.pl sees chain breaks between VAL 21
and ILE 21A.
And sometimes it is wierder.
Anomalies:
Standard Protein Data Bank procedure is to use a null (blank) character for the chain indicator in structures comprising only one chain. In this entry, however, the chain identifier R (for reduced) was used to conform to the depositor's publications and to emphasize the relationship between this structure and the oxidized form which is available from the Protein Data Bank as a separate entry.(By the way, REMARK 4 is supposed to be a compliance note, of the form "XXXX COMPLIES WITH FORMAT V. N.M, DD-MMM-YYYY". Clearly, 5cyt does not, though it's choice of chain ID is not out of line.)
To find all chain ID's and the lengths of their associated sequences (as represented by atom records), do the following:
grep '^ATOM........ CA [ A]' $pdb_file_name \
| cut -c22 | uniq -c
Note that this does not work for PDB entries with multiple models
(i.e. that contain MODEL/ENDMDL records), but you can put
filter-pdb-atoms.pl at the start of the pipe to correct
this.
For the case of 2plv, the output looks like this:
288 1
268 2
235 3
62 4
The insertion code is usually blank, but values of A, B, C, etc., are used to preserve historical numbering schemes in the presence of insertion mutations. (In rare cases, the insertion code can even be a digit, though the use of digits is now officially deprecated. Personally, I haven't seen any of these. Yet.) The integer sequence number can be zero or negative, consecutive residues can be numbered nonconsecutively, again due to insertions or deletions with respect to some consensus sequence. And finally, residues can be unresolved in the crystal, again resulting in nonconsecutive residues. Therefore, treating this field as an index is a big mistake -- the classic error of PDB parsing, in fact. A fuller explanation, including why they permit this crock, is available as part of the ATOM record format specification.
Note: Sorting by this field in alphabetic order does not result in correct ordering of residues; even ignoring the chain ID issue, the minus sign on negative residues causes them to sort before 0, but then they appear in ascending order by the following digits. Not to mention the following ambiguity: Which is closer to 0: -4A or -4C? I haven't seen any examples of insertion codes below 0, so I couldn't guess.
Anomalies:
Similarly, 1hpl has "TRP A 30 " and "SER A 30 "; the same problem appears on the B chain as well. The numbering is otherwise contiguous, making it less obvious what went wrong. This particular case can be kludged by doing the following:
perl -pe 's/ 30 / 30A/ if /^ATOM .*SER . 30 /;' 1hpl.ent
| efa.pl > 1hpl.nexp
[Unfortunately, there's no general approach. -- rgr, 8-Jun-98.]
-2, -1, 1, 2, . . .This is perfectly legitimate, but does it encode an annotation history of an insertion of length three (including a zero residue) followed by a deletion, or is it a simple aversion to using a zero as a residue index?
Some atoms in residues 7 and 31 have an occupancy less than 1.0. These atoms were poorly defined in the electron density and their occupancy was lowered.(By the way, REMARK 4 is supposed to be a compliance note, of the form "XXXX COMPLIES WITH FORMAT V. N.M, DD-MMM-YYYY". Clearly, 7rsa does not.)
Anomalies:
In 1ifc, every atom has a pair of alternate locations, very few of which have occupancies that total 1.00. Remark 4 from the PDB entry says the following:
This entry contains two superposed structures. The two structures were used in the last cycle of refinement to determine the occupancies of atoms that displayed two alternate positions. Coordinates for the two structures including solvent molecules have been assigned the alternate location indicators A and B.Presumably, the author relaxed the constraint that the occupancies of alternate locations sum to one. But since they seem to be mostly equal to each other, displaying a range of values around 0.50, it's not clear what the real constraints were.
The quantity given in the occupancy field of the ATOM and HETATM records is the electron count.But the HETATM records have strange numbers like 5.84 and (stranger still) 11.43 for HOH oxygens, so that appears not to be the case.