The psa-request server refused to process my request, saying something about a limit for non-academic organizations. But I am affiliated with a university [or other nonprofit research organization]. Could you please tell me how I can obtain access to your service?The problem is that you need to use a university return addess; the psa-request server cannot tell which "aol.com" (or "hotmail.com", or "yahoo.com", etc.) users are academic, and which are commercial. If your department or institute doesn't have computers with e-mail accounts, then you should be able to get an e-mail account from your University Information Technology group; if you like, they can probably also help you set it up to forward to your free web e-mail account.
Failing that, we can set up a "commercial license" that provides short-term access for an arbitrary e-mail account for academic research. There is no fee for this service, but we do require a written request on the appropriate institutional letterhead that describes the research in one sentence (as if for a paper title), and specifies the email account, the name of the researcher who will be submitting the requests (normally the owner of the account), and the duration of the proposed research.
We apologize for the bureaucratic hassle, but the terms of the
agreement that permits us to run this server do not allow us to provide
unlimited access to non-academic users.
Web summary page Perl error
I am getting the following error when trying to look at the page that
summarizes all of my results:
No, this is an intermittent bug somewhere in the page and/or Web server.
It seems to go away eventually if you click 'Reload' once or twice. To
our great embarrassment, we must confess that we have been unable to
eliminate this nuisance, partly because we don't have the resources to
make an all-out assault on a bug that only happens occasionally. (If
you happen to be a mod_perl wizard and think these symptoms
sound familiar, please feel free to clue us in!)
Error in Perl code: Undefined subroutine
&psa_request_summary::make_summary_table called . . .
Am I doing something wrong?
Further analysis/search with PSA results?
. . . I was wondering whether any programs exist that can search for
protein homologues using the secondary structure data derived from the
PSA server as an input.
We are not aware of any such programs. However, the macroclass
predicted by Type I analysis should
suggest a starting point for browsing a structural classification
database (e.g. SCOP, CATH, DALI, FSSP).
Unfortunately, our present crop of DSMs are designed to model structures
rather generically, and therefore are not close enough to real
structures to provide a starting point for more detailed modeling. In
other words, it is not possible in general to provide the PDB ID of the
"best" known structure corresponding to the DSM predicted for a given
sequence. But that will change in the near future. We are presently
working on a set of models constructed directly from PDB entries, so the
secondary structure prediction will also imply an alignment to a
specific tertiary structure. Unfortunately, we can't give an estimate
of when this enhancement might become generally available; the best we
can say is to keep an eye on the psa-request server
home page.
Analyzing long or short sequences
How should I analyze my long sequence?
If your sequence is longer than the limit for the desired type of
analysis (350 amino acids for Type-1 analysis and 1000 amino acids for
all others), then you will need to break it into smaller subsequences in
order to submit them. Since psa-request is geared toward
domains, these subsequences should each correspond to a single
structural domain in so far as it is possible to do this;
psa-request insists on this for Type-1 analysis. For help in
chopping your sequence up into probable domains, you can try our
profile library search tool; many of these profiles match entire
domains.
The other possibility is to send overlapping 1000-mers for Type-2 analysis, and try to identify plausible domains based on the resulting secondary structure predictions. You should then resubmit the resulting domain candidates for Type-1 analysis, since Type-1 DSMs are sensitive to variations in sequence length (more on that below).
In any case, if there are more than five such subsequences, we would
appreciate it if you would refrain from sending them between the hours
of 9 a.m. to 6 p.m. Eastern time, Monday thru Friday, so that they do
not delay results from other researchers. Thank you for your
consideration.
The minimum length limit for Type-1 analysis, 35 AA, was chosen
because we don't have enough models that can generate sequences that
short, which in turn is because there aren't any short single-domain
examples in the PDB on which to base a model.
Type-2 generalizes this restriction to multidomain sequences, but it
still uses statistics from single-domain proteins, since we don't have
any 20 AA "domains." For that reason, results for short sequences
should be taken with a big grain of salt. Even so, analyses of very
short sequences may be quite reliable, or (more likely) they may be
completely worthless; we have no way of testing them.
One reason such results might be less than reliable concerns the
underlying data. The probability that an amino acid might appear in a
given structural state is based on statistics gathered from
single-domain structures exhibiting a range of sizes, but all of them at
least 40 AA or so in length. Even assuming that the same set of
structural states were applicable to short sequences, it is easy to
believe that comparable statistics gathered from a set of very small
structures might be significantly different.
Finally, before you send a very short sequence to the
psa-request server, you would do well to ask your self the
following question: Do you really believe that such a short sequence
has a well-defined secondary or tertiary structure at all?
A loop is defined as anything that is neither helix nor turn nor strand;
you'll notice that all four probabilities for any given residue always
add up to 1.
These secondary structure probabilities reflect the likelihood that a
given amino acid could have a particular secondary structure state,
according to the generic model. They are computed by summing over all
possible assignments of consecutive residues in the sequence to
consecutive states in the model, weighted by the probability that the
model could generate that assignment. The states for strands and
helices always occur in series in the models, with some alternate paths
to allow for length variation. The probability for an amino acid to be
assigned
to one of these states is therefore influenced by the assignment of its
neighbors. Below a certain threshold, a short sequence of "helix-like"
residues would have lower helix probabilities than a longer sequence
that is otherwise similar. Since there are minimum strand and helix
lengths, a sufficiently short sequence won't find any strands or helices
in the model short enough to fit, so the server will predict zero strand
and helix probability for all residues.
So that should explain why Type-2 analysis believes you have a loop;
it doesn't have enough plausible alternative ways to explain a sequence
that short.
The probable reason for being more "decisive" about smaller fragments
is the underlying assumption that each whole sequence folds into a
single domain in its entirety (water-soluble for Type-1).
So the length-dependence of DSMs in general is an expected property
of the system. In fact, if you were to add (or subtract) a tail with
arbitrary secondary structure to a sequence you had previously analyzed,
you should worry if the result doesn't change. Put another way,
your confidence in the PSA results should be a function of your
confidence that the sequence you submitted comprises a domain, the whole
domain, and nothing but the domain.
The PSA Sequence Analysis web server cannot always produce a 3D
structure prediction for the beta propeller region of a WD-repeat
sequence. This is because we only constructed models for 4 through 10
repeats -- we doubt that an eleven-bladed beta propeller is even
possible. Of course, nature will probably prove us wrong, eventually.
Until it does, and we have a structure to look at, we hesitate to
attempt to construct an eleven-bladed beta propeller model.
However, there is another possible interpretation for a WD-repeat
structure with a high number of predicted repeats. It is possible that
the eleven repeats are for two smaller beta propeller domains, perhaps
one of five repeats and one of six. Of course, it is difficult to be
certain where to break the sequence in half -- and if it could be two
propellers, why not three? If the repeats cluster into two groups
separated by a long (i.e. domain-sized) loop, then the hypothesis of two
WD-repeat domains separated by an intermediate domain seems more likely.
Although the intermediate domain could have been inserted into a loop of
a single WD-repeat domain, the long loop would be the likeliest place to
divide the sequence. (And you'd still have to believe in eleven-bladed
beta propellers.)
One (relatively) simple test for multiple WD-repeat domains would be
to split the sequence in half at an appropriate point, and use the
dynamic programming local alignment algorithm of your choice to align
the sequences. If you find significant similarity, then the likeliest
hypothesis is a genetically recent duplication of a single WD-repeat
domain. Unfortunately, lack of self-similarity can't rule out other
evolutionary scenarios, such as concatenating two evolutionarily very
distant WD-repeat domains.
[also need to say something about inconsistency between most likely
model and number of reported repeats. -- rgr, 21-Dec-99.]
But here are the probabilities assigned to the individual amino
acids, given the alignment:
These are determined by selecting the appropriate value based on its
position. Note that the two G residues have different
probabilities as result of being aligned to different positions.
These probabilities are then multiplied together to get the overall
probability of 2.4e-07. This should be compared to the "best possible"
sequence GHTGSV with a scores 8.6e-04, the "worst possible"
score for WWWWMD (among others) of 1.1e-18, and the median
score of 5.7e-11 for sequence HQAHKF (among others). On a
logarithmic scale, the score for QGSGAI is slightly better than
halfway between the best score and the median score, which
psa-request may well find plausible in the right context.
The next question is which repeat(s) to throw out in the case of
overprediction. If a homolog of known structure exists on the WD repeat home page,
and the homology is good, then finding the answer should be
straightforward.
If that is not the case, then the following suggestions from
Dr. Lihua Yu, author of the WD repeat predictor, may help.
References
[1] Neer EJ, Schmidt CJ, Nambudripad R & Smith TF: "The
ancient regulatory-protein family of WD-repeat proteins," Nature
371, 297-300 (1994)
PMID: 8090199
Go to:
Please direct your questions and comments about these Web pages and
the PSA e-mail server to:
What about very short sequences?
How reliable are the results using Type-2 analysis for sequences
between 10 and 20 AA?
Why are very short sequences predicted as mostly loops?
I have analysed several 10 to 20 AA sequences using Type-2
analysis, and I always get that the probability of being in a loop is
the highest. I am unsure what "loop" means as a secondary structure,
and what this implies for my peptides.
Different results for different lengths?
. . . Depending on the window of a large protein that I am studying, the
structure would be predicted strongly as a beta propeller or as a mixed
alpha helical/beta structure. The program was much more decisive about
small segments of the protein (~300aa) than with large pieces (1000aa)
. . .
Except for Type-2 DSMs, which are quite generic, each DSM is designed to
cover a specific range of sequence lengths, in order to control the
composition and length distributions of component secondary structures.
In particular, DSMs are not constructed for lengths that do not make
sense for the structure. For example, the DSMs in the WD4 macro class cover
a domain length range of 187 to 279 amino acids, reflecting constraints
on internal loop length. (External loops are handled with leader and
trailer models, so longer sequences can be considered as WD4 candidates.
The same is not true for Type I models, some of which have strict length
limits.)
WD repeat questions
Why no 3D structure prediction?
Why didn't I get a 3D structure prediction for my WD repeat sequence of
11 repeats?
Why does it find so many WD repeats?
I notice that the yeast WD repeat sequence YPL183C is
listed on the "YPL183C
aligned repeats" page as having eight aligned repeats. However, when
I submit this particular sequence for WD-repeat analysis, I get eleven
predicted repeats. Why are so many repeats predicted . . . ?
The psa-request server's WD repeat prediction program is known
to overpredict; that is, it errs in the direction of sensitivity to
marginal repeat sequences at the cost of specificity. So it often finds
more repeats in a sequence than human experts would. And the alignments
shown on the WD repeat Web pages were
constructed manually by experts, namely Dr. Chrysanthe Gaitatzes and
Dr. Eva Neer. This means that if the server says that a given sequence
is not a WD repeat, then you can trust its answer. Unfortunately, if it
says that it is a WD repeat, you are faced with the problem of
determining which repeats are "real." More on that below.
Why does it find this WD repeat?
Sometimes, WD-repeat analysis finds potential repeats that do not match
the consensus sequence published in [1]. This is
because the server uses probabilities computed from the profile found on
the WD repeat home page,
and not the consensus sequence. For instance, suppose
psa-request reports that the protein subsequence
"QGSGAI" matches the first portion of the first repeat. This
snippet doesn't match the consensus pattern at all:
GHxxxV
QGSGAI
Q G S G A I
0.0207 0.0193 0.1186 0.1715
0.1186 0.2443
So which predicted repeats are real?
As explained above, the
psa-request server's WD repeat prediction program is known to
overpredict. Therefore, if homology leads you to believe that one or
more repeats predicted by the server are chimerical, then you should
probably trust the homology data over the server.
In terms of how to interpret the smoothing results and how
to make judgement of the predicted repeats, I used the following rules
when I looked at the smoothing results myself:
To this I would add:
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Mon Mar 12 13:41:10 EST 2001
BioMolecular Engineering Research
Center
Boston University, Boston Massachusetts