BLAST programs

BMERC : needle tools : Programs : BLAST programs


These programs are utilities for running BLAST [hyperlink] and parsing and processing its output.

Table of contents

  1. BLAST programs
    1. Table of contents
    2. cross-blast.pl
    3. parse-blast.pl
    4. filter-blast.pl
    5. blast-to-clique.pl


cross-blast.pl

The cross-blast.pl script performs a series of BLAST searches [need hyperlink] against a
BLAST sequence database, one per sequence supplied in sequence table (.tbl) file format, producing results on the standard output in BLAST summary format. This amounts to a Cartesian product of two sets of sequences; the two sets may be identical, or disjoint, or anything in between.

Usage:


    cross-blast.pl [-blast program-name] [-parse parse-blast-args]
		blast-database-name [ tbl-database-name . . . ]

Arguments:

-blast program-name
specifies the name of the blast program to use, e.g. "blastn"; default is "blastp".
-parse parse-blast-args
specifies an option to be passed to the parse-blast.pl script.
blast-database-name
name of an FA format sequence database for use by blast. Required for input.
tbl-database-name
name of one or more sequence table (.tbl) files to blast against blast-database-name. If omitted, sequences are read from the standard input.
Blast results are sent to the standard output, with one line of BLAST summary results per .tbl file input sequence.


parse-blast.pl

The parse-blast.pl script turns the output of the BLAST program [need hyperlink] into a single-line summary, essentially a one-line
BLAST summary file. This program is invoked internally by the cross-blast.pl script.

Usage:

	parse-blast.pl [-verbose]

Arguments:

-verbose
include all results in summary lines. Normally they are truncated at P values that don't require exponential notation (other than identically 0.0).


filter-blast.pl

Given a set of
BLAST summary results on the standard input, the filter-blast.pl program eliminates hits that fail the specified criteria.

Usage:

	filter-blast.pl [-max-p-score P] [-min-homologs k]
                [-include-self]

Arguments:

-max-p-score P
specifies the maximum P value (a floating point number) for a hit to be included. By default, this is 1.0, which causes all hits to be included.
-include-self
whether to include any self-hits in the result, regardless of P value; by default, these are omitted. A self-hit is one where both sequence names (target and database hit) are identical.
-min-homologs k
specifies the minimum number of homologs for a target sequence to be considered interesting; otherwise, the target is filtered out. The default value is 1.
filter-blast.pl treats each target (line of input) independently. First, all hits with P values greater that the value of the -max-p-score argument are eliminated. Then, unless -include-self was specified, any hit with the same name as the target is eliminated. Finally, if more than -min-homologs hits remain, the target line is emitted on the standard output in its reduced form, using the same format.


blast-to-clique.pl

Given a set of
BLAST summary results on the standard input, the blast-to-clique.pl program produces a clique format file [need hyperlink] on the standard output, where each clique represents an equivalence class of related sequences.

Usage:

    blast-to-clique.pl [-max-p-score P] [-prefix string]

Arguments:

-max-p-score P
specifies the maximum P value (a floating point number) for a hit to be included. By default, this is 1.0, which causes all hits in the input to be included.
-prefix string
specifies a prefix string for naming cliques. The default is "c", i.e. cliques are named "c1", "c2", etc.
blast-to-clique.pl reads all input before producing any output. All hits with P values greater that the value of the -max-p-score argument are eliminated. Then, the remaining hits are used to define equivalence classes. If sequence A hits sequence B, then these two sequences must be put into the same clique. (If "A hits B", then we assume that "B hits A", whether the database reflects this or not.) Therefore, if B hits C, then A, B, and C will all wind up in the same clique, regardless of whether A hits C or not. If the threshold is set too low, then pairs of sequences may wind up in the same clique that have close to zero similarity (though this may happen anyway for multidomain sequences). [This would make a good enhancement; a -min-neighbor-p-score that is used to generate warnings if two sequences in the same score are less related than this. -- rgr, 30-Jun-98.]

Once the equivalence classes are generated, they are output in arbitrary order, using names with indices assigned from zero, i.e. "c1", "c2", etc. The loci within each clique are output alphabetically.

If a sequence has no hits, then no equivalence class is generated for it.


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 19:46:02 EST 1999