File formats related to protein sequence

BMERC : needle tools : File formats : Sequence formats


Table of contents

  1. File formats related to protein sequence
    1. Table of contents
    2. IG sequence format
    3. Table (.tbl) file format
    4. FA sequence format
    5. BLAST database format
    6. BLAST summary file format


IG sequence format

At BMERC, protein sequences are traditionally kept in what is called "IG format" (for Intelligenetics, I think -- rgr, 13-Jan-97), usually one sequence per file, using the naming convention "locus.seq". The structure is quite similar to that of
FA sequence format (q.v.).

The format is vaguely line-oriented, requiring at least three lines per sequence, in the following order:

  1. One or more comment lines, each of which starts with a semicolon (";") character. The content of these is arbitrary.
  2. Exactly one label line, which is usually the same as the locus that appears in the file name for single-sequence files. Spaces are significant in the label, though the psa-request server at least strips off leading and trailing whitespace. (See the psa-request server discussion of email message syntax for details.)
  3. One or more sequence lines, consisting of single-letter amino acid abbreviations with optional embedded whitespace.
IG files with multiple sequences presumably just repeat this structure, but since table (.tbl) file format is preferred for this purpose, I have not actually seen a multisequence IG file.

Note that there are bugs in the IG file reading code used by the mrf-envs and mrf-counts programs; see the "Known bugs in the MRF programs" section for details.


Table (.tbl) file format

Table (or .tbl) file format is used for many things at BMERC. In fact, the name reflects the fact that it is just the UNIX-style tab-delimited representation of database tables. Usually the table just holds a single value for each unique key.

Where table files are used to hold sequence information, there are just two fields: the sequence name (locus), and the sequence string. The sequence has no embedded whitespace, and uses uppercase letters exclusively; the exact legal alphabet depends on which programs are intended to use the values.

Here is a sample from the first 10 lines of the E.coli genome; the complete file has 4285 sequences. The sequences have been truncated after 50 amino acids for brevity. Note that there are no whitespace characters other than the tab between the locus and sequence, and the newline at the end of each line.


	EC0001	MKRISTTITTTITITTGNGAG
	EC0002	MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAM...
	EC0003	MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLG...
	EC0004	MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLD...
	EC0005	MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGH...
	EC0006	MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLM...
	EC0007	MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNS...
	EC0008	MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRK...
	EC0009	MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDE...
	EC0010	MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQ...


FA sequence format

Another format for protein sequences that is in wide use is called "FA format," after the FASTA sequence search program. Its popularity is due at least in part to the fact that it is used for preparing
BLAST databases. The conventional file suffix is ".fa"; when storing one sequence per file, the naming convention "locus.fa" is used.

Here's a sample:


    > 1egdA
    KANRQREPGLGFSFEFTEQQKEFQATARKFAREEIIPVAAEYDKTGEYPVPLIRRAWELGLMNTHIPENC
    GGLGLGTFDACLISEELAYGCTGVQTAIEGNSLGQMPIIIAGNDQQKKKYLGRMTEEPLMCAYCVTEPGA
    GSDVAGIKTKAEKKGDEYIINGQKMWITNGGKANWYFLLARSDPDPKAPANKAFTGFIVEADTPGIQIGR
    KELNMGQRCSDTRGIVFEDVKVPKENVLIGDGAGFKVAMGAFDKERPVVAAGAVGLAQRALDEATKYALE
    RKTFGKLLVEHQAISFMLAEMAMKVELARMSYQRAAWEVDSGRRNTYYASIAKAFAGDIANQLATDAVQI
    LGGNGFNTEYPVEKLMRDAKIYQIYGGTSQIQRLIVAREHIDKYKN
After the locus may be a list of other attributes (e.g. sequence length, checksum, etc.), separated by a comma or a space (or both). We do not use or generate any of these.


BLAST database format

BLAST [need hyperlink] also uses
FA format for its sequence databases (and search targets). The sequence database is a single .fa file that includes all sequences, plus some extra files created by the setdb program from the original FA database. For example, if a set of sequences exist in the database.tbl file, then one must do the following in order to set up for running BLAST against this database:
        tbl2fa.pl < database.tbl > database.fa
        setdb database.fa
The first step runs tbl2fa.pl to convert the database, creating the database.fa file, and the second step runs setdb creates the database.fa.ahd, database.fa.atb, and database.fa.bsq files for use by the BLAST program. (Note that setdb is part of the BLAST suite; neither setdb nor the blast program itself is documented here.)


BLAST summary file format

BLAST summary files don't actually contain any sequences; they contain summary results (loci and scores) of hits reported by the BLAST program [need hyperlink]. It is produced from BLAST output by the
parse-blast.pl script; other related utilities may be used to manipulate these files. The conventional file suffix is ".blast". (This format was developed at BMERC by Jim Freeman.)

Each line represents a single invocation of BLAST, wherein a target sequence is "BLASTed" against a database, resulting in zero or more hits. The first field is always the locus of the target sequence, and it is followed by zero or more triples (groups of three fields), one triple per hit. The three subfields of the triple that describe the hit are:

Regardless of how many HSP's ("High-scoring Segment Pairs") are found for a given database sequence, at most one "hit" triple appears in the output.

Here is a small example of 12 sequences "BLASTed" against a database using the cross-blast.pl script:


	1bfg	4fgf	765	7.4e-79
	1hnf
	1lkkA	1lkkA	562	2.4e-57	1mil	89	3.2e-07
	1plc	1aac	80	2.9e-06
	1rcb
	1thx	1erv	103	1.0e-08
	1ubi	1ubi	381	3.6e-38
	1ubsA
	2fal	1hbiA	109	2.4e-09	1hlb	97	4.8e-07
	2mhr	2hmqA	272	1.3e-26
	2pgd_2
	5nul	1rcf	102	1.3e-08
Note that 1lkkA and 1ubi have self-hits, but none of the others do, because those sequences were not present in the blast database. 1hnf, 1rcb, 1ubsA, and 2pgd_2 did not hit anything, but cross-blast.pl still generates a line for them in the output.


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Fri Nov 26 21:38:34 EST 1999