[First major modification: I have decided to store amino acids and PDB residue indices as internal numbers, rather than 3-letter abbreviations and PDB strings respectively. This should make it easier to compute environments and counts, without overly complicating report generation. -- rgr, 17-Jun-96.]
[I have now moved the scripts to ~thread/code/bin, where the other BMERC software tools under my care live, and have started to keep a separate development directory so I don't step on people's toes. -- rgr, 30-Jul-96.]
[The segment tables and related information are now in place, though not quite complete. -- rgr, 28-Oct-96.]
At the same time, I have no wish to reinvent a whizzy English relational database interface. (I don't even want to reinvent SQL!) Users will have to write snippets of perl in order to do anything useful; my job is just to make it so we don't have to rewrite all our scripts if we want to add a few new fields.
As an example of what we are trying to accomplish, a simple query could be phrased in English as:
Count (i.e. build a histogram of) all residues that see a neighboring tryptophan on the same piece of secondary structure.[I still can't quite phrase this in an executable form; it depends on having the segment stuff in place. -- rgr, 11-Jul-96.] A refinement might be:
Do the same, but build three separate histograms, one each for all alpha, all beta, and mixed alpha-beta proteins.(The definition of "neighboring" may be qualitative, as in "has a pairwise entry in the database"; if quantitative, as in "has an interaction energy of at least X", then the threshold can be adjusted, as long as it is at least as stringent as the minimum value used to decide whether to put a pairwise entry in the database.)
It would probably be useful to treat the information stored as ATOM records in the PDB files as an additional relation keyed on locus, residue, and atom type (e.g. "CD1"). We could then do interesting things with coordinate information as well.
[I think it may be necessary to expose a little perl here. In
particular, the names of computed fields should probably start with "&",
and noncomputed fields with "$", in order to simplify implementation.
-- rgr, 12-Jun-96.]
Note that 0 is not a valid window, and so is used to indicate "none."
It is possible to get 0 as a window for amino acids that are normally
beta-branched if the side chain was not resolved crystallographically.
In other words, no CG1 atom means that the $sing_gamma1_window
will be zero, regardless of whether the residue is alanine or valine.
[Note: Other code calls this the "segment designator," which
may be preferable; the terminology should be standardized. -- rgr,
21-Aug-96.]
Segment types are strings that convey the nature of the segment. All
helices are of type "H"; all other segments are strands of various
sorts. The labels fall into three classes:
[And many score functions care only whether a segment is a helix or a
strand. (What code uses this, anyway? Crippen? Jernigan? MRF?). The
code that generates these labels is also under development; the
assignment is not always unique, so our heuristics may change. -- rgr,
31-Jul-96.]
[Note: The protein table is not implemented yet;
waiting on segment definition. -- rgr, 2-Jul-96.]
[May need special fields for PDB locus vs chain id. -- rgr,
19-Jul-96.]
[Note: The segment table is not implemented yet;
waiting on segment definition. -- rgr, 19-Jul-96.]
[Right now I generate segment files from Sophia's .core files. If we
want to drive core generation from csdb, rather than the reverse at
present, we should build .seg files from .sing DSSP information, then
build the core files from either (or both). Until I get this figured
out, I won't be able to fill in the segment-related fields. -- rgr,
28-Jun-96.]
Loredana's tetrahedral data (based on her descriptions):
Segment data. If available, the
$segment_data_available_p variable will be true, and the
following additional fields will be defined. [not yet implemented. --
rgr, 28-Oct-96.]
. . . [and so on for counts & visible volume. -- rgr, 17-Jun-96.]
. . .
Pairwise energy terms.
Pairwise segment terms. Note that $pair_res*_cti is
really a singleton value, rather than a segment value. For loop
residues, all of these values will be null strings, which is the boolean
value for 'false' in perl.
In a script, only the &evaluate_expression subroutine needs
to be defined. [Should perhaps be called &process_record or
something similar. -- rgr, 12-Jul-96.]
If no loci are supplied on the command line, they are read from the
standard input, one per line. It is also possible to supply filenames,
though the extension is ignored; this is convenient for shell wildcards.
At present, the data files live in the
~thread/structure/data/test/ directory (or wherever the
$CSDB_DATA_DIR environment variable happens to point). This is
assumed to be the current directory for the examples below (but please
don't write anything there!). [bug: this location is obsolete. --
rgr, 20-Aug-97.]
For example,
If we only wanted data for DSSP helices, we could use:
Here's an example that just counts the number of ALA-ALA pairs within
8.5 Ångstroms that see each other in 1aaj. We need two
expressions here, one to count and one to print the result. Note that
we are using the internal aa_index
values for the amino acids.
So the answer is 4, but why should we really believe that? Let's
list the PDB residue numbers for each case (since we know that the
number of them is manageable). The command and resulting output now
looks like this:
One way to build a histogram in perl is to increment the elements of
an array, instead of a single count. (perl takes care of extending the
array to the right size, and treats unitialized elements of the array as
zeros or null strings, depending on context, just as for scalars.) The
following example counts same-residue pairs (regardless of distance or
any measure of interaction) by amino acid. All such pairs are
accumulated into the @count array (which is distinct from the
scalar variable $count, though one must use $count[$i]
to access the $i'th element of @count).
[An example where I turn some random counts into probabilities by
normalizing would be nice to have, too. Probably easier to do this with
a singleton example. On the other hand, I'd like to see what these
numbers mean; why so many LEU-LEU, and so few CYS-CYS? -- rgr,
10-Jul-96.]
The aa_index datatype
Amino acids are assigned an internal code based on the alphabetical
position of their standard single-letter abbreviation, starting with 0
for Alanine. This numbering was chosen to be identical with the
internal encoding used by the stat and needle programs. Even though
neither of these programs rely on sharing a common internal encoding, it
will probably help to avoid human error to use the same convention.
The aa_abbrev datatype
This is the standard 3-letter amino acid abbreviation, in upper case to
conform to PDB usage.
The core total index
The core total index, or CTI, is the 1-based index of a residue
in a core model, counting only those residues that appear in the core.
In other words, if the last residue of a segment has CTI = n,
then the first residue of the next segment has CTI = n+1. The
CTI numbering scheme is used in environment files and
other places that need to concisely refer to specific residues in a core
file.
The pdb_index datatype
The residues that appear in the PDB ATOM records are numbered
sequentially from 0. See the pdbres datatype
for a discussion of why this is necessary.
The pdbres datatype
This is an alphanumeric sequence number, exactly as if taken from
(1-based) columns 23 through 27 of the ATOM field of the PDB file. In
general, this will not be an integer, since there are sequence numbers
like "235A". The first four characters are digits, with blank
padding on the left, and the fifth and last character is either a letter
suffix or blank. (This format definition should make it easier to
extract the correct residues from the PDB entry. Note also that
alphanumeric sorting will reproduce the order the residues appear in the
PDB.)
The window datatype
[Need some explanation of Loredana's window numbering convention. I
know that window 1 points back to the alpha carbon, leaving 2, 3, and 4
as the windows of interest, but that's all I know. -- rgr, 28-Jun-96.]
The segment type
Note that sheet indices and the strand indices used to name barrel
segments are numbered for the PDB entry as a whole, and not
strand-by-strand. This should work fine for most code, which only cares
whether strand residues are part of the same sheet or not. (In
principle, barrels are trickier, since one cannot assume that a core
model includes only one barrel; one must trace the barrel around to find
if two strands are on the same barrel. In practice, 1tie seems to be
the only model with more than one barrel, and it appears to be
pathological in any case. -- rgr, 21-Aug-96.)
Protein table
This contains data for whole protein structures, lives in the
all.prot file, and is keyed only on $prot_locus.
Core segment table
These live in the locus.seg files, with $seg_locus and
$seg_segment constituting the key. There are special-purpose scripts to
extract subsets of the core segments from singleton and pairwise tables.
(And probably other scripts to extract loops . . . )
Singleton table
These are stored in locus.sing files, with sing_locus and
sing_pdbres making up the key. [Right now the key is implicit, composed
of the first two fields; do I need an explicit key so I can more easily
use join? Should I have the extraction script manufacture such a key on
output? Or maybe that should just be a computed field that a user can
ask for if needed? -- rgr, 11-Jun-96.]
Pairwise tables
These are kept in locus.pair files, with $pair_locus,
$pair_res1_pdb_index, and $pair_res2_pdb_index making up the key. The
extract-pairwise-data.pl script
should be used to access values in these files.
In addition, the following "pairwise" fields beginning "pair_res1_*" and
"pair_res2_*" are defined (via implicit joins) as the values for the
corresponding singleton records. All singleton fields are defined,
using the convention that "sing" in the singleton field name is replaced
by "pair_res1" for the first residue, and "pair_res2" for the second.
Scripts
Setting up
Before doing anything else, it is necessary to do
source ~thread/code/bin/csdb-setup
in order to establish environment variables used by the other scripts.
It puts the ~thread/code/bin/ directory on your PATH if it is not
already there, which also makes the core
generation tools available.
perl expressions
[useful references? perl expressions can be tough to figure out for the
uninitiated . . . -- rgr, 9-Jul-96.]
Expression keyword arguments
These keyword arguments are accepted by both the extract-singleton-data.pl and extract-pairwise-data.pl scripts.
They are listed in the order in which they are first evaluated. Any of
these arguments can be repeated any number of times; all expressions are
evaluated at the appropriate time in the order specified. At least one
-e parameter must be supplied.
perl scripts
[notes on writing your own scripts -- skip on first reading. -- rgr,
12-Jul-96.]
Extracting singleton data
extract-singleton-data.pl takes one or more perl expressions
that are supplied as initial keyword arguments, and which can include
conditionals, print statements, random computation, etc.
extract-singleton-data.pl [ keyword . . . ] [ locus . . . ]
So far, the only keywords defined are the expression keywords, which are evaluated at the
appropriate times while processing the singleton records for the
specified loci; the principal one of interest is the
-e keyword, which is evaluated for each singleton record.
extract-singleton-data.pl -e 'print(join("\t", $sing_locus, $sing_aa_index, \
$secondary_structure_codes{$sing_dssp_ss}), "\n");' *.sing
produces a tab-delimited table of locus, amino acid internal index, and
numeric code for the smoothed DSSP secondary structure assignments, for
all residues in all proteins, on the standard output. The expression at
the start of the second line is interesting:
"$secondary_structure_codes{$sing_dssp_ss}" maps
$sing_dssp_ss (a string containing a single character) to a
numeric value.
extract-singleton-data.pl \
-e 'if ($sing_dssp_ss eq "H") { \
print(join("\t", $sing_locus, $sing_aa_index, \
$secondary_structure_codes{$sing_dssp_ss}), \
"\n"); }' \
*.sing
The only real difference between this and the last example (besides the
improved indentation) is the presence of the conditional "if
($sing_dssp_ss eq "H") {" at the start of the expression -- and of
course the matching } at the end. Note that perl uses "eq" for
string comparison (as distinguished from "==" for numeric comparison).
Extracting pairwise data
To examine or extract pairwise records from the database, use the
extract-pairwise-data.pl script. It's usage is similar to that
of the
extract-singleton-data.pl script:
extract-pairwise-data.pl [ keyword . . . ] [ locus . . . ]
So far, the only keywords defined are expression
keywords. The locus arguments
are as for extract-singleton-data.pl.
Pairwise extraction examples
gamow% extract-pairwise-data.pl \
-e 'if ($pair_visible_p && $pair_cb_distance <= 8.5 \
&& $pair_res1_aa_index == 0 && $pair_res2_aa_index == 0) \
{ $count++ }' \
-finally 'print "$count\n";' 1aaj.pair
4
gamow%
Two peculiarities of perl are immediately apparent. The nicer quirk is
that we did not need a -initially expression to initialize
$count. Somewhat less convenient is that fact that although
the perl if statement resembles the C if statement,
even the simple $count++ expression that we wanted evaluated
when the test is true must be enclosed in curly braces. Also, to get
the count value printed on a line of its own, it was necessary to
include an explicit newline -- the "\n" inside the double
quotes in the -finally expression.
gamow% extract-pairwise-data.pl \
-e 'if ($pair_visible_p && $pair_cb_distance <= 8.5 \
&& $pair_res1_aa_index == 0 && $pair_res2_aa_index == 0) \
{ print "$pair_res1_pdbres $pair_res2_pdbres\n"; }' \
1aaj.pair
12 14
17 20
20 77
59 66
gamow%
Looking in the PDB file for 1aaj, we see that the indicated residues are
all alanines. (The 'extra' spaces appear in the output because they are
part of the pdbres fields.)
gamow% extract-pairwise-data.pl \
-e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
{ $count[$pair_res1_aa_index]++; }' \
-finally 'print join(" ", @count), "\n";' 1aaj.pair
26 1 9 1 8 6 3 7 5 10 1 11 3 13 44 6
gamow%
Notice that the unitialized elements are still uninitialized, and show
up as null strings rather than zero. To correct this, we can explicitly
initialize them using the -initially keyword:
gamow% extract-pairwise-data.pl \
-initially 'foreach $i (0..19) { $count[$i] = 0; }' \
-e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
{ $count[$pair_res1_aa_index]++; }' \
-finally 'print join(" ", @count), "\n";' 1aaj.pair
26 0 1 9 1 8 6 3 7 5 10 1 11 0 0 3 13 44 0 6
gamow%
Alternatively, we could leave the @count array uninitialized,
and then print a more readable table of only those values that are
defined:
gamow% extract-pairwise-data.pl \
-e 'if ($pair_res1_aa_index == $pair_res2_aa_index) \
{ $count[$pair_res1_aa_index]++; }' \
-finally 'foreach $i (0..19) { \
if (defined($count[$i])) { \
print "$aa_3_letter_codes[$i] $count[$i]\n"; \
} \
}' \
1aaj.pair
ALA 26
ASP 1
GLU 9
PHE 1
GLY 8
HIS 6
ILE 3
LYS 7
LEU 5
MET 10
ASN 1
PRO 11
SER 3
THR 13
VAL 44
TYR 6
gamow%
(But since unitialized values are treated as zero, "if ($count[$i])
{ . . . }" would have worked in the conditional as well.) Here's
what it looks like for the whole database:
ALA 1161
CYS 365
ASP 315
GLU 235
PHE 601
GLY 728
HIS 112
ILE 991
LYS 215
LEU 2208
MET 157
ASN 308
PRO 286
GLN 177
ARG 268
SER 567
THR 494
VAL 1318
TRP 92
TYR 384
Extracting segment data
Like the
extract-singleton-data.pl script,
extract-segment-data.pl takes one or more expression keywords supplied as the initial
arguments, followed optionally by the loci of interest:
extract-segment-data.pl [ keyword . . . ] [ locus . . . ]
The segment table fields are defined as
described above for each successive evaluation of the
-e keyword.
For example, the following invocation prints a histogram of segment lengths:
extract-segment-data.pl -e '$hist[$seg_length]++;' \
-finally 'for ($len = 0; $len <= $#hist; $len++) { \
if ($hist[$len]) { \
print "$len\t$hist[$len]\n"; } }' \
*.seg
[Note that $#hist returns the index of the last element of the
$hist[] array.]
[should also present the ~thread/code/csdb/scripts/example-3.pl
script. -- rgr, 23-Oct-96.]
So let us recast our pairwise example as a pure perl script.
Note that only evaluate_expression and finally
subroutines are defined; the others (initially,
before_protein, and after_protein) are not called if
not defined (though &do_pairwise_records will refuse to proceed
if evaluate_expression is not defined).
The final line, a call to the &do_pairwise_records
subroutine, results in the same treatment of locus arguments as for
extract-pairwise-data.pl. Executing the script is simple, and
happily produces the same result for 1aaj:
Pairwise script examples
When the amount of perl code passed to extract-pairwise-data.pl
grows past a certain limit, the script interface becomes cumbersome. It
turned out to be somewhat self-defeating to break the last example over
9 lines in order to make it more readable; the added backslashes didn't
help. Finally, keeping shell versus perl quoting conventions straight
in order to make sure that the dollar signs are interpreted in the right
place can be difficult and hard-to-read as well.
#!/usr/local/bin/perl
# Sample pairwise perl script.
push(@INC, $ENV{"CSDB_SCRIPTS"});
require "CSDB-pairwise.pl";
sub evaluate_expression {
if ($pair_visible_p && $pair_cb_distance <= 8.5
&& $pair_res1_aa_index == $pair_res2_aa_index) {
$count[$pair_res1_aa_index]++;
}
}
sub finally {
foreach $i (0..19) {
if (defined($count[$i])) {
print "$aa_3_letter_codes[$i] $count[$i]\n";
}
}
}
&do_pairwise_records();
If we put this in a file called example-1.pl and change its
permissions so that it is executable, the initial #! line will
allow us to execute it directly. Blank lines have been inserted for
readability, and a certain amount of "boilerplate" has been added at the
top to load the supporting CSDB pairwise code. (If you have forgotten
to source the csdb-setup script before invoking one of these scripts,
perl will complain that it "Can't locate CSDB-pairwise.pl in @INC
. . .".) The -e and -finally expressions have
been turned into evaluate_expression and finally
subroutines, respectively. And the backslashes in front of the newlines
have been eliminated altogether.
gamow% example-1.pl 1aaj
ALA 4
HIS 2
LYS 1
LEU 3
MET 2
PRO 2
SER 2
THR 5
VAL 9
TYR 2
gamow%
We can easily extend this example to print per-protein subtotals of the
same-AA pairwise counts by defining the before_protein and
after_protein subroutines. Using the tabular format of an earlier example, we need a loop in
before_protein to reset a distinct per-protein array to zero,
and an after_protein that prints the result:
sub before_protein {
foreach $i (0..19) { $protein_count[$i] = 0; }
}
sub after_protein {
print $pair_locus, "\t", join(" ", @protein_count), "\n";
}
Note that we've added $pair_locus followed by a tab character
to the print statement so that each line gets a label.
Finally, in addition to this new code, we must modify evaluate_expression to update @protein_count:
sub evaluate_expression {
if ($pair_res1_aa_index == $pair_res2_aa_index) {
$count[$pair_res1_aa_index]++;
$protein_count[$pair_res1_aa_index]++;
}
}
If we let the &finally subroutine stand as is, this is a
truncated version of what we get by running this example on the full
pairwise database:
gamow% example-2.pl *.pair
1aaj 4 0 0 0 1 0 2 0 1 3 2 0 2 0 0 2 5 9 0 2
1abmA 3 0 0 1 2 4 8 5 1 14 0 4 2 1 0 0 1 1 3 3
1add 2 0 2 6 5 1 4 8 3 12 1 2 4 1 1 2 2 15 0 1
1aep 12 0 0 2 0 0 0 2 1 19 0 0 0 3 0 2 1 1 0 0
1ahc 9 0 0 2 2 1 0 10 1 23 0 4 1 0 1 3 2 4 0 3
1arb 6 3 1 0 0 9 2 8 0 8 0 3 1 1 1 11 8 2 1 1
1ast 1 2 2 0 1 2 3 9 0 2 2 2 0 0 1 2 1 2 0 5
1atr 16 0 3 2 10 8 0 8 3 14 0 3 2 1 2 3 4 23 0 0
1avhA 2 0 2 3 5 2 0 3 2 31 1 0 0 2 4 3 3 3 0 3
1babB 2 0 0 1 3 4 0 0 1 18 0 0 1 1 0 0 0 10 0 0
1bbpA 0 2 2 0 0 1 1 0 0 1 0 2 1 0 0 3 1 13 1 2
1bet 1 12 0 0 1 0 0 1 1 1 0 0 0 0 1 5 8 4 0 0
1bfg 2 0 0 2 0 4 0 1 3 10 0 0 0 1 2 0 0 0 0 1
1bllE 27 0 2 4 6 12 1 10 5 21 7 2 9 2 2 3 5 22 0 1
1bmdA 28 0 2 2 4 5 0 4 0 23 1 4 2 0 3 2 3 14 0 0
. . .
7pti 2 2 0 0 1 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0
8acn 15 3 5 4 4 13 2 24 6 40 4 7 5 2 8 11 10 17 0 4
8atcA 4 0 1 0 3 1 0 4 1 43 2 1 1 2 3 5 5 8 0 0
8i1b 0 0 0 1 2 0 0 0 1 19 1 4 0 2 1 1 1 1 0 1
8rxnA 0 6 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2
8tlnE 11 0 3 1 3 6 2 8 0 5 0 1 1 0 0 4 6 11 0 9
9ldtA 3 0 1 3 0 4 0 17 2 20 1 1 1 1 0 6 2 26 1 1
ALA 1207
CYS 366
ASP 320
GLU 241
PHE 607
GLY 737
HIS 112
ILE 1007
LYS 216
LEU 2236
MET 158
ASN 313
PRO 288
GLN 177
ARG 272
SER 572
THR 501
VAL 1348
TRP 104
TYR 385
gamow%
In order to add a database field, the following steps must be taken. You should also search for "CHANGE-SINGLETON-FIELDS" or "CHANGE-PAIRWISE-FIELDS" or "CHANGE-SEGMENT-FIELDS" in all source files; either the code or the documentation may not be completely up-to-date.
Might it become necessary to support multiple PDB versions of a single locus? In that case we might want to always qualify the locus with the PDB version number . . . and might want our own "captive" copy of the PDB entry in any case. -- rgr, 11-Jun-96.