The following is a short tutorial on how to fetch a sequence and run a blast search within the GCG package. Then we show you alternate ways of running blast outside of GCG. Finally, an example of a fasta search is shown. All output files in this tutorial will be hot-linked so that you can compare your examples with ours.
cyrus% gcg
Welcome to the WISCONSIN PACKAGE
Version 9.0-UNIX, December 1996
Installed on solaris
Copyright 1982-1996, Genetics Computer Group, Inc. All rights reserved.
Published research assisted by this software should cite:
Wisconsin Package Version 9.0, Genetics Computer Group (GCG), Madison, Wisc.
Databases available:
GenBank Release 98.0 (12/96)
GenPept Release 98.0 (12/96)
EMBL (Abridged) Release 48.0 ( 9/96)
PIR-Protein Release 50.0 ( 9/96)
SWISS-PROT Release 34.0 ( 9/96)
PROSITE Release 14.0 ( 9/96)
Restriction Enzymes (REBASE) (10/96)
EPD Release 45.0 (12/95)
Kabat (kab_dna) Release 6.0 (08/96)
Kabat (kab_aa) Release 6.0 (08/96)
Technical support: phone (608) 231-5200 or e-mail Help@GCG.Com
Online help: % genhelp or http://www.gcg.com/genhelp/
First, Fetch a sequence to use as a query sequence.
cyrus% fetch graf_mouse
Fetch copies GCG sequences or data files from the GCG database
into your directory or displays them on your terminal screen.
graf_mouse.swissprot
cyrus% blast
BLAST searches for sequences similar to a query sequence. The query and the
database searched can be either peptide or nucleic acid in any combination.
BLAST can search databases on your own computer or databases maintained at
the National Center for Biotechnology Information (NCBI) in Bethesda,
Maryland, USA.
BLAST search with what query sequence? graf_mouse.swissprot
Search for query in what sequence database:
1) nr p Non-redundant GenBank CDS translations+PDB+SwissProt+PIR
2) pdb p PDB protein sequences
3) swissprot p SwissProt sequences
4) yeast p Saccharomyces cerevisiae protein sequences
5) kabat p Kabat Sequences of Proteins of Immunological Interest
6) alu p Translations of Select Alu Repeats from REPBASE
7) month p All new or revised GenBank CDS translation+PDB+SwissProt+PI
8) nr n Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST's
9) pdb n PDB nucleotide sequences
10) vector n Vector subset of GenBank
11) yeast n Saccharomyces cerevisiae genomic nucleotide sequences
12) est n Non-redundant Database of GenBank+EMBL+DDBJ EST Division
13) sts n Non-redundant Database of GenBank+EMBL+DDBJ STS Division
14) gss n Genome Survey Sequences
15) mito n Database of mitochondrial sequences, Rel. 1.0, July 1995
16) kabat n Kabat Sequences of Nucleic Acid of Immunological Interest
17) epd n Eukaryotic Promotor Database
18) alu n Select Alu Repeats from REPBASE
19) month n All new or revised GenBank+EMBL+DDBJ+PDB sequences released
Please choose one (* 1 *): 1
Ignore hits expected to occur by chance more than (* 10.0 *) times?
Limit the number of sequences in my output to (* 250 *) ?
What should I call the output file (* graf_mouse.blastp *) ?
Trying cruncher.nlm.nih.gov (130.14.25.175)
Connected to cruncher.nlm.nih.gov
Waiting for 7 of 8 other BLAST requests on the system to finish.
Still waiting
Search in progress on the network server.
............................................
Retrieving results.
....
WARNING: Descriptions of 668 database sequences were not reported due to the
limiting value of parameter V = 250.
..........................................
WARNING: HSPs involving 818 database sequences were not reported due to the
limiting value of parameter B = 100.
.
Example 2: Using the Network Blast client program.
To use the network blast client to search databases at NCBI, your query sequence should be in fasta format. You can use the readseq program to convert your sequence to fasta format. Advantage: You can search any of the NCBI databases listed at the top of a blast output file from NCBI. This program runs on all of our machines.
Disadvantage: You must know the name of the database you wish to search--there is no menu of choices as in the above example. However, as mentioned previously, this list of databases is at the top of every NCBI blast output file. The most common database searched is "nr", which contains sequences from all of the major databases. Also, your query sequence must be in fasta format, if it is not, you can use readseq as mentioned below.
cyrus% readseq graf_mouse.swiss
ld.so.1: readseq: warning: /usr/4lib/libc.so.1.8 has older revision than expected 9
Name of output file (?=help, defaults to display):
graf_Mouse mouse.fa
1. IG/Stanford 10. Olsen (in-only)
2. GenBank/GB 11. Phylip3.2
3. NBRF 12. Phylip
4. EMBL 13. Plain/Raw
5. GCG 14. PIR/CODATA
6. DNAStrider 15. MSF
7. Fitch 16. ASN.1
8. Pearson/Fasta 17. PAUP/NEXUS
9. Zuker (in-only) 18. Pretty (out-only)
Choose an output format (name or #):
8
Name an input sequence or -option:
graf_mouse.swissprot
Name an input sequence or -option:
cyrus% cat graf_mouse.fa
>GRAF_MOUSE, 248 bases, F9242F12 checksum.
MPPILILLTLLLPLRAGAEEIIGGHEVKPHSRPYMARVRFVKDNGKRHSC
GGFLVQDYFVLTAAHCTGSSMRVILGAHNIRAKEETQQIIPVAKAIPHPA
YDDKDNTSDIMLLKLESKAKRTKAVRPLKLPRPNARVKPGHVCSVAGWGR
TSINATQRSSCLREAQLIIQKDKECKKYFYKYFKTMQICAGDPKKIQSTY
SGDSGGPLVCNNKAYGVLTYGLNRTIGPGVFTKVVHYLPWISRNMKLL
To use the network blast client to search databases at NCBI, your query sequence should be in fasta format. You can use the readseq program to convert your sequence to fasta format. In this example, "nr" is the database I am searching, "graf_mouse.fa" is the name of my protein sequence in fasta format, and "graf_mouse.netblast" is my output file. If I wanted to search with a DNA sequence, I would use "blastn.remote" instead of "blastp.remote".
cyrus% blastp.remote nr graf_mouse.fa > graf_mouse.netblast
Waiting for 2 of 3 other BLAST requests on the system to finish.
Still waiting
Still waiting
WARNING: Descriptions of 418 database sequences were not reported due to the
limiting value of parameter V = 500.
WARNING: HSPs involving 668 database sequences were not reported due to the
limiting value of parameter B = 250.
cyrus%
Example 3: Doing a Local Blast search on our local BMERC cluster. ADVANTAGE: You do not have to wait in the blast queue at the NCBI.
Disadvantage: If you want to search Genbank locally, you will have to search each subdivision separately. We maintain the latest copy of Genbank, but we do not have the daily updates in blastable format. If you want to search against the Genbank updates since the last full release of Genbank, try doing a local fasta search in GCG against the gb_new:* division, or do a network blast search against the "month" database.
First, use the db-aliases command to see the name and location of our local database.
cyrus% db-aliases
DATABASE SEARCHES:
1 - GenBank-DNA-98
2 - GenBank-Translated-98
3 - NBRF-PIR-50
4 - SWISSPROT-34
5 - HIV98-DNA
6 - HIV98-Protein
7 - Other
Choice (0 = exit):
Enter Choice[1-7]: 2
GenBank-Translated-98
TRANSLATED_GENBANK_Sequences: /seq/gb-98.pep/all.pep
TRANSLATED_GENBANK_Short_Dir: /seq/gb-98.pep/all.sdir
TRANSLATED_GENBANK_BLAST_Format: /seq/gb-98.pep.blast/gb-98
To run BLAST with Translated-GenBank use:
$GB_pep
To specify the IG format of the Translated-GenBank use:
$GB_ig
Print Main Menu? [Y]n
Enter Choice[1-7]: 0
exit
Now, do the search locally.
cyrus% blastp $GB_pep graf_mouse.fa > graf_mouse.blastlocal
or
cyrus% blastp /seq/gb-98.pep.blast/gb-98 graf_mouse.fa > graf_mouse.blastlocal
WARNING: Descriptions of 242 database sequences were not reported due to the
limiting value of parameter V = 500.
WARNING: HSPs involving 492 database sequences were not reported due to the
limiting value of parameter B = 250.
Example 4: Doing a blast search by Email. ADVANTAGE: Very easy to use Accepts many sequence formats. Frees up your terminal window. Allows you to easily see your many choices and select the database of your choice. The results are returned by email very quickly, within 5 to 10 minutes for a blast search. Disadvantage: You must be comfortable with reading email to view your search. An example is at the end of this example.
cyrus% mailfasta graf_mouse.swissprot
ld.so.1: readseq: warning: /usr/4lib/libc.so.1.8 has older revision than expected 9
ld.so.1: readseq: warning: /usr/4lib/libc.so.1.8 has older revision than expected 9
ld.so.1: cid: warning: /usr/4lib/libc.so.1.8 has older revision than expected 9
>GRAF_MOUSE, 248 bases, F9242F12 checksum.
MPPILILLTLLLPLRAGAEEIIGGHEVKPHSRPYMARVRFVKDNGKRHSC
GGFLVQDYFVLTAAHCTGSSMRVILGAHNIRAKEETQQIIPVAKAIPHPA
YDDKDNTSDIMLLKLESKAKRTKAVRPLKLPRPNARVKPGHVCSVAGWGR
TSINATQRSSCLREAQLIIQKDKECKKYFYKYFKTMQICAGDPKKIQSTY
SGDSGGPLVCNNKAYGVLTYGLNRTIGPGVFTKVVHYLPWISRNMKLL
Sequence looks like PROTEIN. OK? ([yes]/no):
***** Now using PROTEIN file: graf_mouse.swissprot
>GRAF_MOUSE, 248 bases, F9242F12 checksum.
Which service(s) do you want to use ?
FASTA search
31 Sw-Prot All 32 Sw-Prot New 33 PIR 34 PIR Only
35 Sw-Prot + PIR 53 PDB (3D) 54 NRL (3D)
BLAST search DNA (protein query vs. 6 frames translated database)
18 All db's 19 GB 20 New GB 21 GB vector
22 EMBL 23 New EMBL 24 EST 41 Kabat
45 EPD 47 PDB
BLAST search PROTEIN
25 all db's 26 SWISS-PROT 46 SWISS-PROT New 27 PIR
28 GenPept 29 New GenPept 30 TFD 42 Alu
43 Kabat 48 PDB
Other services
36 BLOCKS 44 Predict 2D 49 BLITZ (Sw-Prot)
Enter the number(s) of your choice (separated by a) : 25
Enter parameters for the BLAST search
Enter the expectation cutoff value (0 - 1000) [10]:
Enter the cutoff value (optional):
Enter the number of matched sequences to be listed [50]:
Enter the number of aligned sequences to be listed [20]:
Do you want the histogram ? yes/[no]:
|
+--=> You have selected PROTEIN file 'graf_mouse.swissprot'
to be used with the following parameters:
BLAST search:
Expectation cutoff: 10
Maximum number of matched sequences: 50
Maximum number of aligned sequences: 20
Listing of histogram: no
Databank sections to be searched :
Non-redundant-PROTEIN
Is this OK? [yes]/no/exit:
Now mailing graf_mouse.swissprot for BLAST search in section 'Non-redundant-PROTEIN' to NCBI
Fri Jan 24 13:34:50 EST 1997
Search results will be returned by e-mail shortly.
To retrieve search results, read your email. It should take less than 5 minutes to get your search back.
cyrus% mail
mailx version 5.0 Fri Jul 15 21:21:05 PDT 1994 Type ? for help.
"/var/mail/tom": 1 message 1 unread
>U 1 tom Fri Jan 24 13:17 25/582 HP LaserJet III
& q
Held 1 message in /var/mail/tom
Search not back yet, try again:
cyrus% mail
mailx version 5.0 Fri Jul 15 21:21:05 PDT 1994 Type ? for help.
"/var/mail/tom": 2 messages 1 new 2 unread
U 1 tom Fri Jan 24 13:17 25/582 HP LaserJet III
>N 2 blastsvc@ncbi.nlm. Fri Jan 24 13:39 803/39387 BLAST E-mail: GRAF_MOUSE,
& s2 graf_mouse.mailfasta
"graf_mouse.mailfasta" [New file] 804/39409
& q
Held 1 message in /var/mail/tom
cyrus% more graf_mouse.mailfasta *** To read your search ****
Example 5: Doing a FASTA search locally with the GCG package. Advantage: The fasta algorithm will insert gaps in the alignment. You will see the alignment of your sequence against the target sequence in the database as one alignment with gaps inserted rather than several separate alignment areas. You can do a fasta search locally against our daily updated Genbank database by selecting the database gb_new:*
Disadvantages: The fasta program is *much* slower than blast, and uses much more CPU resources than blast. A blast search done locally will take only a minute or two, while a fasta search done locally will take a half hour to an hour.
cyrus% fasta
FastA does a Pearson and Lipman search for similarity between a query
sequence and a group of sequences of the same type (nucleic acid or
protein). For nucleotide searches, FastA may be more sensitive than BLAST.
FASTA with what query sequence ? [?1h=graf_mouse.swissprot
Begin (* 1 *) ?
End (* 248 *) ?
Search for query in what sequence(s) (* SwissProt:* *) ?
What word size (* 2 *) ?
Don't show scores whose E() value exceeds: (* 10.0 *):
What should I call the output file (* graf_mouse.fasta *) ?
1 Sequences 924 aa searched SW:104K_THEPA
101 Sequences 31,577 aa searched SW:1A12_LYCES
Note, you will see lines like this printed to your screen for about 30 minutes or so.....
...
58,801 Sequences 21,121,532 aa searched SW:YZ_SHEEP
Show more scores (* yes *) ? no
The list contains 120 entries.
How many alignments would you like to see (* 120 *) ?
Aligning...
CPU time used:
Database scan: 0:06:36.1
Post-scan processing: 0:00:58.6
Total CPU time: 0:07:35.1
Output File: graf_mouse.fasta
cyrus% more graf_mouse.fasta
script done on Fri Jan 24 14:07:14 1997