Sequence format conversion tools
BMERC : needle tools : Programs : Sequence tools
These tools do simple conversions either from one sequence format to another, or produce a
sequence format from some other format.
ig2tbl.pl and tbl2ig.pl
convert sequence data between IG format and sequence table (.tbl)
format. pdb-to-seq.pl converts
sequence information in a
PDB format file to IG
format.
Table of contents
- Sequence format conversion tools
- Table of contents
- pdb-to-seq.pl
- pdb-domain-seq.pl
- make-seq-file.pl
- ig2tbl.pl
- tbl2ig.pl
- fa2tbl.pl
- tbl2fa.pl
pdb-to-seq.pl
pdb-to-seq.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the
command line) to one of several sequence output formats on the standard
output.
[Note: This is an updated version of make-seq-file.pl, with which it
is not fully compatible. -- rgr, 14-May-98.]
Usage:
pdb-to-seq.pl [-locus locus] [-chain [first|all|string]]
[-include-unk] [-format [ig|fa|tbl|just-seq]]
[-line-length num] [-conc]
[-use-seqres] [-use-atoms] [-use-both]
< pdb-file > sequence-file
Arguments:
- -locus locus
- supplies a string that is used as the locus name. The
-locus argument is required for all output formats
except just-seq. But if the PDB file is named on the
command line (rather than being supplied through the standard
input), then this file name is used as the default.
- -chain chain-spec
- -chains chain-spec
- specify the chain or chains to select. (-chain and
-chains are synonyms; at most one may be specified.)
The following values are legal for chain-spec:
- first to get only the first chain (the default);
- all to get all chains; or
- a string of the PDB chain
identifier letters of the desired chains (must be in
uppercase; may be separated by commas).
- -include-unk
- include all non-standard amino acids in the sequence as "X"
residues. By default, they are left out of the sequence, and a
warning is printed.
- -format [ig|fa|tbl|just-seq]
- select the output format, as defined below. The default is IG format, which is
what the needle and MRF programs expect.
- -line-length integer (default 70)
- specifies the number of sequence characters to output per line.
The value of this option is ignored if tbl format is
selected. (The default is the historically preferred number for
IG format at BMERC.)
- -conc
- if specified, all requested chains are concatenated together.
- -use-seqres
- take sequence information from the
PDB SEQRES records; this is the default.
- -use-atoms
- take sequence information from the
PDB ATOM records instead of the
SEQRES records. Normally, the SEQRES data
are more complete, since they include unresolved loop residues,
but sometimes they have errors.
- -use-both
- produce sequences from both the
ATOM records and
SEQRES records. These are disambiguated in the
output by appending "C-atom" and
"C-seqres" (respectively) to the locus name,
where "C" is the chain ID.
The output format options are as follows:
By default, only the first chain is output, using the specified locus
as the sequence identifier. If more than one chain is requested (by
name or by "-chains all") or -use-both was
specified, and if the -conc option was not specified, then
pdb-to-seq.pl must expect to emit multiple chains. (Sometimes
there might only be a single chain if "-chains all" were
requested, but the program can't know that without reading the entire
file first.) When pdb-to-seq.pl expects multiple chains, it
disambiguates them by constructing new locus names in the following
manner.
- If multiple chains are expected but the -use-both option
was not specified, then each chain is output using the specified
locus with the chain ID immediately appended. Any chain IDs of
" " are turned into "_" first.
- If -use-both was specified, but only a single chain was
requested (by name or by "-chain first"), then each
chain is output a locus of the form "locus-atom"
or "locus-seqres", depending on the source of the
sequence.
- If multiple chains are expected and -use-both was
specified, then each chain is output using a locus of the form
"locuschain-atom" or
"locuschain-seqres", depending on the
source of the sequence. Note that the chain ID immediately
follows the locus. Any chain IDs of " " are turned
into "_" first.
In any case, multiple chains are always output in the order encountered
in the PDB file -- regardless of the order specified in the
-chains argument. See the "PDB ATOM
format" page for a hack
that finds all chains in a PDB entry.
If a residue name is not one of the 20 standard amino acid names
(ACE, for instance), pdb-to-seq.pl normally prints a
warning and omits the offending residue from the resulting string. Use
the -include-unk option if you wish to include non-standard
amino acids in the sequence as "X" characters.
If a chain is missing (i.e. no residues are found) for any chain that
was explicitly requested, pdb-to-seq.pl dies with an error
message; this is often due to an incorrect chain designator.
Known bugs:
- pdb-to-seq.pl should recognize GLX and ASX residues,
though perhaps there's nothing intelligent we can do with them.
-- rgr, 18-Feb-97. [at least in release 1.3 the
-include-unk arg can be used to get them in the sequence
as 'X' residues, though that's not usually what's desired. --
rgr, 7-Dec-98.]
- *** pdb-to-seq.pl can't handle nucleotide sequences.
-- rgr, 18-Feb-97.
- *** If -use-both is specified and multiple chains are
explicitly requested, but some chains are not present (in either
form), this is not detected. -- rgr, 14-May-98.
- If -use-atoms is requested for a PDB file with multiple
models, all models are considered, resulting in duplicates (if
each model has two or more chains) or concatenation (if only a
single chain exists). [kludged in release 1.3 by exiting when an
ENDMDL record is encountered. -- rgr, 6-Jan-99.]
pdb-domain-seq.pl
pdb-domain-seq.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the
command line) to one of several sequence output formats on the standard
output. pdb-domain-seq.pl does some of the same things as pdb-to-seq.pl (which it uses internally), but
always outputs only a single sequence. Its contribution is that it also
supports the extended core chain specification
syntax for chain subranges.
Usage:
pdb-domain-seq.pl [-locus locus] [-format { ig | fa | tbl | just-seq } ]
[-chains chain-spec] [-force-align]
[-line-length num] < pdb-file > sequence-file
Arguments:
- -chain chain-spec
- -chains chain-spec
- defines which chain or chains or portions thereof to use; see the
"Core chain
specification syntax" section for chain-spec
details. The default is "_", which selects the chain
with a chain ID of space (" "). Note that this is
different from the
pdb-to-seq.pl default, which selects the first
chain.
- -format
[ ig | fa | tbl | just-seq ]
- select output format, as defined below. The default is IG format, which is
what is expected by the
needle and MRF programs.
- -locus locus
- supplies a string that is used as the locus name. The
-locus argument is required for all output formats
except just-seq. But if the PDB file is supplied on the
command line (rather than being provided through the standard
input), then this file name is used as the default.
- -line-length integer (default 70)
- specifies the number of sequence characters to output per line.
The value of this option is ignored if tbl format is
selected. (The default is the historically preferred number for
IG format at BMERC.)
- -force-align
- if specified, forces a global Smith-Waterman alignment to be done
even for chains selected in their entirety. This can be used to
correct SEQRES record errors; see
below.
The output format options are as follows:
The specified chain or chains are taken from the
SEQRES records and output concatenated into a single
sequence, using the specified locus as the sequence ID. Multiple chains
are always output in the order specified in the -chain
argument, regardless of the order in which they are encountered in the
PDB file. Since partial chains are numbered according to the PDB
ATOM record indices, pdb-domain-seq.pl must also
extract the ATOM sequence and use globalS to align them globally
so that the ATOM indices can be applied to the SEQRES
sequence. Since residues not present as PDB ATOM records have
undefined indices, these SEQRES residues are apportioned as
follows:
- Residues that lie entirely within a subrange naturally belong to
that subrange.
- Residues that come before the first or last subrange
(i.e. starting with 1 or ending with the last ATOM
index) belong to that subrange; and
- Residues that fall between subranges (i.e. the subrange ends just
before a gap or starts just after one) are divided equally
between them.
In short, each ATOM-less residue belongs to whatever subrange
contains the nearest residue that has ATOM records.
If there is a discrepancy between the SEQRES and
ATOM records, this will appear as a mismatch in the global
alignment. In this case, pdb-domain-seq.pl always silently
uses the ATOM version, producing a sequence that is consistent
with the structure. For this reason, using the -force-align
option will fix such errors for chains selected in their entirety; these
chains are not normally aligned, since it is usually sufficient to take
the whole SEQRES sequence as-is.
If a residue name is not one of the 20 standard amino acid names
(ACE, for instance), pdb-domain-seq.pl prints a
warning and omits the offending residue from the resulting string. If a
chain is missing (i.e. no residues are found),
pdb-domain-seq.pl dies with an error message; this is often due
to an incorrect chain designator (but see bug 3 below).
Known bugs:
- *** pdb-domain-seq.pl should recognize GLX and ASX
residues, though perhaps there's nothing intelligent we can do
with them. -- rgr, 18-Feb-97.
- *** pdb-domain-seq.pl can't handle nucleotide sequences.
-- rgr, 18-Feb-97.
- *** Missing chains are only detected when alignment is required.
-- rgr, 19-Sep-98.
- For the -chains option, only old-style (sequential) SCOP
residue indices are supported. SCOP stopped using these sometime
between July 1998 and March 1999. [fixed in release 1.6. --
rgr, 10-Jan-00.]
make-seq-file.pl
make-seq-file.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the
command line) to one of several sequence output formats on the standard
output. [Note: This is semi-obsolete; see
pdb-to-seq.pl, which is a more robust version of
essentially the same thing.
Usage:
make-seq-file.pl [-locus locus] [-chain L]
[-format ig | fa | tbl | just-seq ] [-line-length num]
[-use-atoms] < pdb-file > ig-file
Arguments:
- -locus locus
- supplies a string that is used as the locus name. If the PDB
file is supplied on the command line (rather than being provided
through the standard input), then this file name is used as the
default. The -locus argument is required for all output
formats except just-seq.
- -chain string
- -chains string
- specify the chain or chains to select, or "all" to get all chains
(the default). If not the token "all" (in lower case),
the string contains the chain IDs of the desired chains,
normally in uppercase. (-chain and -chains are
synonyms; at most one may be specified.)
- -format
[ ig | fa | tbl | just-seq ]
- select output format, as defined below. The default is IG format, which is
what is expected by the needle
and MRF programs.
- -line-length integer (default 70)
- specifies the number of sequence characters to output per line.
The value of this option is ignored if tbl format is
selected. (The default is the historically preferred number for
IG format at BMERC.)
- -use-atoms
- take sequence information from the
ATOM records instead of the
SEQRES records (the default). Normally, the
SEQRES data are more complete, since they include
unresolved loop residues, but sometimes they have errors. See
also the -use-both option of pdb-to-seq.pl, above.
The output format options are as follows:
Multiple chains are always output in the order encountered in the PDB
file, regardless of the order specified. See the "PDB ATOM format" page for a hack that finds all chains
in a PDB entry.
If a residue name is not one of the 20 standard amino acid names
(ACE, for instance), make-seq-file.pl prints a warning
and omits the offending residue from the resulting string. If no
residues are found, make-seq-file.pl dies with an error
message; this is often due to an incorrect chain designator.
Known bugs:
- *** make-seq-file.pl should recognize GLX and ASX
residues, though perhaps there's nothing intelligent we can do
with them. -- rgr, 18-Feb-97.
- *** make-seq-file.pl can't handle nucleotide sequences.
-- rgr, 18-Feb-97.
- *** The -chains option is misleading, for selecting more
than one chain. Instead of getting each chain separately,
the result is all chains concatenated together. -- rgr,
14-Apr-97. [if this is not what you want, use pdb-to-seq.pl instead. -- rgr,
14-May-98.]
ig2tbl.pl
ig2tbl.pl converts one or more IG
format sequences to sequence table (.tbl)
file format.
Usage:
ig2tbl.pl [ file-name . . . ]
One can supply one or more file name arguments on the command line, in
which case the files are implicitly concatenated. With no file names,
the standard input is read. If one of the file names is a "-",
the standard input is read at that point in the list of files. The .tbl
format sequences are sent to the standard output.
Known bugs:
- ig2tbl.pl will not complain if the sequences have bogus
characters in them, or if loci are repeated.
tbl2ig.pl
tbl2ig.pl converts sequences in one or more sequence table (.tbl) format files to IG format sequences in separate files, using
the "locus.seq" convention to name the files.
Usage:
tbl2ig.pl [-line-length num] [-write-stdout]
[ file-name . . . ]
Arguments:
- -line-length num
- specifies the maximum number of amino acid (or nucleotide)
characters that should appear on each line in the output. The
default is 70 characters.
- -write-stdout
- specifies that all sequences found should be written together on
the standard output, instead of being put into individual files.
One can supply one or more file name arguments on the command line,
in which case the files are implicitly concatenated. With no file
names, the standard input is read. If one of the file names is a
"-", the standard input is read at that point in the list of
files. Unless the -write-stdout option is specified, one
"locus.seq" file is written in the current directory for
each sequence found. With -write-stdout, all sequences are
written to the standard output instead.
Known bugs:
- tbl2ig.pl will not complain if the sequences have bogus
characters in them, or if loci are repeated.
- It is possible to specify absurd values for
-line-length.
fa2tbl.pl
fa2tbl.pl converts one or more FA format sequences to sequence table (.tbl) file
format.
Usage:
fa2tbl.pl [ file-name . . . ]
One can supply one or more file name arguments on the command line, in
which case the files are implicitly concatenated. With no file names,
the standard input is read. If one of the file names is a "-",
the standard input is read at that point in the list of files. The .tbl
format sequences are sent to the standard output.
Known bugs:
- fa2tbl.pl will not complain if the sequences have bogus
characters in them, or if loci are repeated.
tbl2fa.pl
tbl2fa.pl converts sequences in one or more sequence table (.tbl) format
files on the standard input to Fasta (or "FA") format
sequences on the standard output.
Usage:
tbl2fa.pl [ -line-length num ] [ tbl-file . . . ]
> fa-file
One can supply one or more file name arguments on the command line, in
which case the files are implicitly concatenated. With no file names,
the standard input is read. If one of the file names is a "-",
the standard input is read at that point in the list of files. Each
sequence encountered is written on the standard output in FA format. Note that
tbl2fa.pl does not care whether it is given peptide or
nucleotide sequences.
Arguments:
- -line-length num
- specifies the maximum number of amino acid (or nucleotide)
characters that should appear on each line in the output. The
default is 70 characters.
Since the input format has no information other than locus and sequence,
the output is similarly bare. Here is a sample from the E.coli
genome; the input file is used elsewhere as an example of .tbl file
format. Only a fragment is shown; the complete FA format output
would be 1.4MB.
gamow% tbl2fa.pl -line-length 50 /seq/genome/ecoli/ecoli.tbl
>EC0001
MKRISTTITTTITITTGNGAG
>EC0002
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAM
IEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQ
IKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDP
VEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELV
VLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMS
YQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIIS
VVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATT
GVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRV
CGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPV
IVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEG
MSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNID
EDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGND
VTAAGVFADLLRTLSWKLGV
>EC0003
MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLG
RFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSA
CSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGG
MQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDC
IAHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVA
EIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRL
DTAGARVLEN
. . .
Known bugs:
- tbl2fa.pl will not complain if the sequences have bogus
characters in them, or if loci are repeated.
Bob Rogers
<rogers@darwin.bu.edu>
Last modified: Tue Apr 4 22:38:29 EDT 2000