PIMA Profile Summary

    We have developed a diagnostic profile representation of protein families that can be obtained from only a few defining members. The defining members were drawn the widest available taxonomic spread. These profiles exploit the information from the known positional variation within the conserved regions of a large set of homologous proteins of previously determined structure. This information is used to calculate prior conditional probability densities that are then used to estimate the posterior expected positional variation within conserved regions represented by these profiles.

    An iterative local, or Smith-Waterman dynamic programming procedure was used to construct profiles with maximum profile information content. A similar dynamic programming method is used to search the library of profiles against a query sequence, but with a different objective function. In this case the objective function is a maximum likelihood.

    The high sensitivity and specificity of these profiles was validated on fifty-seven diverse functional families by drawing small random samples of representative sequences, by constructing profiles, and by investigating the distribution of scores on positive and negative controls. When compared with other methods of profile construction, including Hidden Markov Models with Dirichlet priors, the method yielded profiles near full functional domain length and of higher average specificity and sensitivity.

    We have constructed a library of profiles starting with two sources: 1) the pfam database of defining sequences and 2) a set of sequences displaying significant similarity to the yeast mitochondria proteins.

    The results of a match to a given query are returned in terms of a z-score relative to a carefully defined set of negative control sequences. These negative controls were chosen to be representative of all major structural, enzymatic and length classes. The annotation associated with each profile has been obtained directly from the sequence databases for the defining sequences and, as such, may contain errors and/or inconsistencies.

    In general, these profiles match one or more functional domains and, thus, any region match may be assumed, to a first approximation to be a local region assignable to the annotated functional.

Amino Acid Classes used in constructing profile.

PRIOR-BASED PROFILES

    The BMERC is in the process of generating profiles for each of the protein domains that can be recognized in a genome from each of the three kingdoms (archea, eubacteria, and eukaryotes). For a large number of these, a careful dissection of the many multidomain proteins is required. The following figure demonstrates the combined profile/domain dissection results--which will soon be available for the majority of ORF families found in the set of completely sequenced genomes.


Excerpt from Smith, T.F. and Zhang, X. (1997) Nature Biotechnology Vol. 15(12):1222-1223.

    Figure 1. An example of genes having the potential for annotation inheritance transitivity. The three two-domain proteins, b1262, YKL211C, and b1263, share no single domain in common. Domains are labeled by colors: red, N-phosphoribosyl anthranilate isomerase; green, indole-3-glycerol phosphate synthase; yellow, anthranilate synthase, subunit II; blue, anthranilate phosphoribosyltransferase.