BMERC : needle tools : File formats : Sequence formats
The format is vaguely line-oriented, requiring at least three lines per sequence, in the following order:
Note that there are bugs in the IG file reading code used by the mrf-envs and mrf-counts programs; see the "Known bugs in the MRF programs" section for details.
Where table files are used to hold sequence information, there are
just two fields: the sequence name (locus), and the sequence string.
The sequence has no embedded whitespace, and uses uppercase letters
exclusively; the exact legal alphabet depends on which programs are
intended to use the values.
Here is a sample from the first 10 lines of the
E.coli genome; the complete file has 4285 sequences. The
sequences have been truncated after 50 amino acids for brevity. Note
that there are no whitespace characters other than the tab between the
locus and sequence, and the newline at the end of each line.
EC0001 MKRISTTITTTITITTGNGAG
EC0002 MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAM...
EC0003 MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLG...
EC0004 MKLYNLKDHNEQVSFAQAVTQGLGKNQGLFFPHDLPEFSLTEIDEMLKLD...
EC0005 MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGH...
EC0006 MLILISPAKTLDYQSPLTTTRYTLPELLDNSQQLIHEARKLTPPQISTLM...
EC0007 MPDFFSFINSVLWGSVMIYLLFGAGCWFTFRTGFVQFRYIRQFGKSLKNS...
EC0008 MTDKLTSLRQYTTVVADTGDIAAMKLYQPQDATTNPSLILNAAQIPEYRK...
EC0009 MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDE...
EC0010 MGNTKLANPAPLGLMGFGMTTILLNLHNVGYFALDGIILAMGIFYGGIAQ...
FA sequence format
Another format for protein sequences that is in wide use is called "FA
format," after the FASTA sequence search program. Its popularity is due
at least in part to the fact that it is used for preparing BLAST databases. The conventional file
suffix is ".fa"; when storing one sequence per file, the naming
convention "locus.fa" is used.
Here's a sample:
> 1egdA
KANRQREPGLGFSFEFTEQQKEFQATARKFAREEIIPVAAEYDKTGEYPVPLIRRAWELGLMNTHIPENC
GGLGLGTFDACLISEELAYGCTGVQTAIEGNSLGQMPIIIAGNDQQKKKYLGRMTEEPLMCAYCVTEPGA
GSDVAGIKTKAEKKGDEYIINGQKMWITNGGKANWYFLLARSDPDPKAPANKAFTGFIVEADTPGIQIGR
KELNMGQRCSDTRGIVFEDVKVPKENVLIGDGAGFKVAMGAFDKERPVVAAGAVGLAQRALDEATKYALE
RKTFGKLLVEHQAISFMLAEMAMKVELARMSYQRAAWEVDSGRRNTYYASIAKAFAGDIANQLATDAVQI
LGGNGFNTEYPVEKLMRDAKIYQIYGGTSQIQRLIVAREHIDKYKN
After the locus may be a list of other attributes (e.g. sequence length,
checksum, etc.), separated by a comma or a space (or both). We do not
use or generate any of these.
tbl2fa.pl < database.tbl > database.fa
setdb database.fa
The first step runs tbl2fa.pl to convert the
database, creating the database.fa file, and the second step
runs setdb creates the database.fa.ahd,
database.fa.atb, and database.fa.bsq files for use by
the BLAST program. (Note that setdb is part of the BLAST
suite; neither setdb nor the blast program itself is
documented here.)
Each line represents a single invocation of BLAST, wherein a target sequence is "BLASTed" against a database, resulting in zero or more hits. The first field is always the locus of the target sequence, and it is followed by zero or more triples (groups of three fields), one triple per hit. The three subfields of the triple that describe the hit are:
Here is a small example of 12 sequences "BLASTed" against a database using the cross-blast.pl script:
1bfg 4fgf 765 7.4e-79 1hnf 1lkkA 1lkkA 562 2.4e-57 1mil 89 3.2e-07 1plc 1aac 80 2.9e-06 1rcb 1thx 1erv 103 1.0e-08 1ubi 1ubi 381 3.6e-38 1ubsA 2fal 1hbiA 109 2.4e-09 1hlb 97 4.8e-07 2mhr 2hmqA 272 1.3e-26 2pgd_2 5nul 1rcf 102 1.3e-08Note that 1lkkA and 1ubi have self-hits, but none of the others do, because those sequences were not present in the blast database. 1hnf, 1rcb, 1ubsA, and 2pgd_2 did not hit anything, but cross-blast.pl still generates a line for them in the output.