BLAST programs
BMERC : needle tools : Programs : BLAST programs
These programs are utilities for running BLAST [hyperlink] and parsing
and processing its output.
Table of contents
- BLAST programs
- Table of contents
- cross-blast.pl
- parse-blast.pl
- filter-blast.pl
- blast-to-clique.pl
cross-blast.pl
The cross-blast.pl script performs a series of BLAST searches
[need hyperlink] against a BLAST sequence database, one
per sequence supplied in sequence table (.tbl) file
format, producing results on the standard output in BLAST summary
format. This amounts to a Cartesian product of two sets of
sequences; the two sets may be identical, or disjoint, or anything in
between.
Usage:
cross-blast.pl [-blast program-name] [-parse parse-blast-args]
blast-database-name [ tbl-database-name . . . ]
Arguments:
- -blast program-name
- specifies the name of the blast program to use,
e.g. "blastn"; default is "blastp".
- -parse parse-blast-args
- specifies an option to be passed to the
parse-blast.pl script.
- blast-database-name
- name of an FA format
sequence database for use by blast. Required for
input.
- tbl-database-name
- name of one or more sequence table (.tbl)
files to blast against blast-database-name.
If omitted, sequences are read from the standard input.
Blast results are sent to the standard output, with one line of BLAST summary
results per .tbl file input sequence.
parse-blast.pl
The parse-blast.pl script turns the output of the BLAST program
[need hyperlink] into a single-line summary, essentially a one-line BLAST summary
file. This program is invoked internally by the cross-blast.pl script.
Usage:
parse-blast.pl [-verbose]
Arguments:
- -verbose
- include all results in summary lines. Normally they are
truncated at P values that don't require exponential notation
(other than identically 0.0).
filter-blast.pl
Given a set of BLAST summary
results on the standard input, the filter-blast.pl program
eliminates hits that fail the specified criteria.
Usage:
filter-blast.pl [-max-p-score P] [-min-homologs k]
[-include-self]
Arguments:
- -max-p-score P
- specifies the maximum P value (a floating point number) for a hit
to be included. By default, this is 1.0, which causes all
hits to be included.
- -include-self
- whether to include any self-hits in the result, regardless of P
value; by default, these are omitted. A self-hit is one where
both sequence names (target and database hit) are identical.
- -min-homologs k
- specifies the minimum number of homologs for a target sequence to
be considered interesting; otherwise, the target is filtered out.
The default value is 1.
filter-blast.pl treats each target (line of input)
independently. First, all hits with P values greater that the value of
the -max-p-score argument are eliminated. Then, unless
-include-self was specified, any hit with the same name as the
target is eliminated. Finally, if more than -min-homologs hits
remain, the target line is emitted on the standard output in its reduced
form, using the same format.
blast-to-clique.pl
Given a set of BLAST summary
results on the standard input, the blast-to-clique.pl program
produces a clique format file [need hyperlink] on the standard output,
where each clique represents an equivalence class of related sequences.
Usage:
blast-to-clique.pl [-max-p-score P] [-prefix string]
Arguments:
- -max-p-score P
- specifies the maximum P value (a floating point number) for a hit
to be included. By default, this is 1.0, which causes all
hits in the input to be included.
- -prefix string
- specifies a prefix string for naming cliques. The default is
"c", i.e. cliques are named "c1",
"c2", etc.
blast-to-clique.pl reads all input before producing any output.
All hits with P values greater that the value of the
-max-p-score argument are eliminated. Then, the remaining hits
are used to define equivalence classes. If sequence A hits sequence B,
then these two sequences must be put into the same clique. (If "A hits
B", then we assume that "B hits A", whether the database reflects this
or not.) Therefore, if B hits C, then A, B, and C will all wind up in
the same clique, regardless of whether A hits C or not. If the
threshold is set too low, then pairs of sequences may wind up in the
same clique that have close to zero similarity (though this may happen
anyway for multidomain sequences). [This would make a good enhancement;
a -min-neighbor-p-score that is used to generate warnings if
two sequences in the same score are less related than this. -- rgr,
30-Jun-98.]
Once the equivalence classes are generated, they are output in
arbitrary order, using names with indices assigned from zero,
i.e. "c1", "c2", etc. The loci within each clique are
output alphabetically.
If a sequence has no hits, then no equivalence class is generated for
it.
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:46:02 EST 1999