Sequence format conversion tools

BMERC : needle tools : Programs : Sequence tools


These tools do simple conversions either from one sequence format to another, or produce a sequence format from some other format. ig2tbl.pl and tbl2ig.pl convert sequence data between IG format and sequence table (.tbl) format. pdb-to-seq.pl converts sequence information in a PDB format file to IG format.

Table of contents

  1. Sequence format conversion tools
    1. Table of contents
    2. pdb-to-seq.pl
    3. pdb-domain-seq.pl
    4. make-seq-file.pl
    5. ig2tbl.pl
    6. tbl2ig.pl
    7. fa2tbl.pl
    8. tbl2fa.pl


pdb-to-seq.pl

pdb-to-seq.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the command line) to one of several sequence output formats on the standard output.

[Note: This is an updated version of make-seq-file.pl, with which it is not fully compatible. -- rgr, 14-May-98.]

Usage:


	pdb-to-seq.pl [-locus locus] [-chain [first|all|string]]
                [-include-unk] [-format [ig|fa|tbl|just-seq]]
		[-line-length num] [-conc]
		[-use-seqres] [-use-atoms] [-use-both]
		< pdb-file > sequence-file

Arguments:

-locus locus
supplies a string that is used as the locus name. The -locus argument is required for all output formats except just-seq. But if the PDB file is named on the command line (rather than being supplied through the standard input), then this file name is used as the default.
-chain chain-spec
-chains chain-spec
specify the chain or chains to select. (-chain and -chains are synonyms; at most one may be specified.) The following values are legal for chain-spec:
-include-unk
include all non-standard amino acids in the sequence as "X" residues. By default, they are left out of the sequence, and a warning is printed.
-format [ig|fa|tbl|just-seq]
select the output format, as defined below. The default is IG format, which is what the needle and MRF programs expect.
-line-length integer (default 70)
specifies the number of sequence characters to output per line. The value of this option is ignored if tbl format is selected. (The default is the historically preferred number for IG format at BMERC.)
-conc
if specified, all requested chains are concatenated together.
-use-seqres
take sequence information from the PDB SEQRES records; this is the default.
-use-atoms
take sequence information from the PDB ATOM records instead of the SEQRES records. Normally, the SEQRES data are more complete, since they include unresolved loop residues, but sometimes they have errors.
-use-both
produce sequences from both the ATOM records and SEQRES records. These are disambiguated in the output by appending "C-atom" and "C-seqres" (respectively) to the locus name, where "C" is the chain ID.
The output format options are as follows:

-format option Meaning
ig for IG format;
fa for FASTA format;
just-seq for just the raw sequence (on a line by itself without the locus); or
tbl for table file format.

By default, only the first chain is output, using the specified locus as the sequence identifier. If more than one chain is requested (by name or by "-chains all") or -use-both was specified, and if the -conc option was not specified, then pdb-to-seq.pl must expect to emit multiple chains. (Sometimes there might only be a single chain if "-chains all" were requested, but the program can't know that without reading the entire file first.) When pdb-to-seq.pl expects multiple chains, it disambiguates them by constructing new locus names in the following manner.

In any case, multiple chains are always output in the order encountered in the PDB file -- regardless of the order specified in the -chains argument. See the "PDB ATOM format" page for a hack that finds all chains in a PDB entry.

If a residue name is not one of the 20 standard amino acid names (ACE, for instance), pdb-to-seq.pl normally prints a warning and omits the offending residue from the resulting string. Use the -include-unk option if you wish to include non-standard amino acids in the sequence as "X" characters.

If a chain is missing (i.e. no residues are found) for any chain that was explicitly requested, pdb-to-seq.pl dies with an error message; this is often due to an incorrect chain designator.

Known bugs:

  1. pdb-to-seq.pl should recognize GLX and ASX residues, though perhaps there's nothing intelligent we can do with them. -- rgr, 18-Feb-97. [at least in release 1.3 the -include-unk arg can be used to get them in the sequence as 'X' residues, though that's not usually what's desired. -- rgr, 7-Dec-98.]
  2. *** pdb-to-seq.pl can't handle nucleotide sequences. -- rgr, 18-Feb-97.
  3. *** If -use-both is specified and multiple chains are explicitly requested, but some chains are not present (in either form), this is not detected. -- rgr, 14-May-98.
  4. If -use-atoms is requested for a PDB file with multiple models, all models are considered, resulting in duplicates (if each model has two or more chains) or concatenation (if only a single chain exists). [kludged in release 1.3 by exiting when an ENDMDL record is encountered. -- rgr, 6-Jan-99.]


pdb-domain-seq.pl

pdb-domain-seq.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the command line) to one of several sequence output formats on the standard output. pdb-domain-seq.pl does some of the same things as pdb-to-seq.pl (which it uses internally), but always outputs only a single sequence. Its contribution is that it also supports the extended core chain specification syntax for chain subranges.

Usage:


	pdb-domain-seq.pl [-locus locus] [-format { ig | fa | tbl | just-seq } ]
		[-chains chain-spec] [-force-align]
		[-line-length num] < pdb-file > sequence-file

Arguments:

-chain chain-spec
-chains chain-spec
defines which chain or chains or portions thereof to use; see the "Core chain specification syntax" section for chain-spec details. The default is "_", which selects the chain with a chain ID of space (" "). Note that this is different from the pdb-to-seq.pl default, which selects the first chain.
-format [ ig | fa | tbl | just-seq ]
select output format, as defined below. The default is IG format, which is what is expected by the needle and MRF programs.
-locus locus
supplies a string that is used as the locus name. The -locus argument is required for all output formats except just-seq. But if the PDB file is supplied on the command line (rather than being provided through the standard input), then this file name is used as the default.
-line-length integer (default 70)
specifies the number of sequence characters to output per line. The value of this option is ignored if tbl format is selected. (The default is the historically preferred number for IG format at BMERC.)
-force-align
if specified, forces a global Smith-Waterman alignment to be done even for chains selected in their entirety. This can be used to correct SEQRES record errors; see below.
The output format options are as follows:

-format option Meaning
ig for IG format;
fa for FASTA format;
just-seq for just the raw sequence (on a line by itself without the locus); or
tbl for table file format.

The specified chain or chains are taken from the SEQRES records and output concatenated into a single sequence, using the specified locus as the sequence ID. Multiple chains are always output in the order specified in the -chain argument, regardless of the order in which they are encountered in the PDB file. Since partial chains are numbered according to the PDB ATOM record indices, pdb-domain-seq.pl must also extract the ATOM sequence and use globalS to align them globally so that the ATOM indices can be applied to the SEQRES sequence. Since residues not present as PDB ATOM records have undefined indices, these SEQRES residues are apportioned as follows:

In short, each ATOM-less residue belongs to whatever subrange contains the nearest residue that has ATOM records.

If there is a discrepancy between the SEQRES and ATOM records, this will appear as a mismatch in the global alignment. In this case, pdb-domain-seq.pl always silently uses the ATOM version, producing a sequence that is consistent with the structure. For this reason, using the -force-align option will fix such errors for chains selected in their entirety; these chains are not normally aligned, since it is usually sufficient to take the whole SEQRES sequence as-is.

If a residue name is not one of the 20 standard amino acid names (ACE, for instance), pdb-domain-seq.pl prints a warning and omits the offending residue from the resulting string. If a chain is missing (i.e. no residues are found), pdb-domain-seq.pl dies with an error message; this is often due to an incorrect chain designator (but see bug 3 below).

Known bugs:

  1. *** pdb-domain-seq.pl should recognize GLX and ASX residues, though perhaps there's nothing intelligent we can do with them. -- rgr, 18-Feb-97.
  2. *** pdb-domain-seq.pl can't handle nucleotide sequences. -- rgr, 18-Feb-97.
  3. *** Missing chains are only detected when alignment is required. -- rgr, 19-Sep-98.
  4. For the -chains option, only old-style (sequential) SCOP residue indices are supported. SCOP stopped using these sometime between July 1998 and March 1999. [fixed in release 1.6. -- rgr, 10-Jan-00.]


make-seq-file.pl

make-seq-file.pl converts sequence information in a
PDB format file on the standard input (or from a file named on the command line) to one of several sequence output formats on the standard output. [Note: This is semi-obsolete; see pdb-to-seq.pl, which is a more robust version of essentially the same thing.

Usage:

 
	make-seq-file.pl [-locus locus] [-chain L]
		[-format ig | fa | tbl | just-seq ] [-line-length num]
		[-use-atoms] < pdb-file > ig-file

Arguments:

-locus locus
supplies a string that is used as the locus name. If the PDB file is supplied on the command line (rather than being provided through the standard input), then this file name is used as the default. The -locus argument is required for all output formats except just-seq.
-chain string
-chains string
specify the chain or chains to select, or "all" to get all chains (the default). If not the token "all" (in lower case), the string contains the chain IDs of the desired chains, normally in uppercase. (-chain and -chains are synonyms; at most one may be specified.)
-format [ ig | fa | tbl | just-seq ]
select output format, as defined below. The default is IG format, which is what is expected by the needle and MRF programs.
-line-length integer (default 70)
specifies the number of sequence characters to output per line. The value of this option is ignored if tbl format is selected. (The default is the historically preferred number for IG format at BMERC.)
-use-atoms
take sequence information from the ATOM records instead of the SEQRES records (the default). Normally, the SEQRES data are more complete, since they include unresolved loop residues, but sometimes they have errors. See also the -use-both option of pdb-to-seq.pl, above.

The output format options are as follows:

-format option Meaning
ig for IG format;
fa for FASTA format;
just-seq for just the raw sequence (on a line by itself without the locus); or
tbl for table file format.

Multiple chains are always output in the order encountered in the PDB file, regardless of the order specified. See the "PDB ATOM format" page for a hack that finds all chains in a PDB entry.

If a residue name is not one of the 20 standard amino acid names (ACE, for instance), make-seq-file.pl prints a warning and omits the offending residue from the resulting string. If no residues are found, make-seq-file.pl dies with an error message; this is often due to an incorrect chain designator.

Known bugs:

  1. *** make-seq-file.pl should recognize GLX and ASX residues, though perhaps there's nothing intelligent we can do with them. -- rgr, 18-Feb-97.
  2. *** make-seq-file.pl can't handle nucleotide sequences. -- rgr, 18-Feb-97.
  3. *** The -chains option is misleading, for selecting more than one chain. Instead of getting each chain separately, the result is all chains concatenated together. -- rgr, 14-Apr-97. [if this is not what you want, use pdb-to-seq.pl instead. -- rgr, 14-May-98.]


ig2tbl.pl

ig2tbl.pl converts one or more
IG format sequences to sequence table (.tbl) file format.

Usage:

 
	ig2tbl.pl [ file-name . . . ]
One can supply one or more file name arguments on the command line, in which case the files are implicitly concatenated. With no file names, the standard input is read. If one of the file names is a "-", the standard input is read at that point in the list of files. The .tbl format sequences are sent to the standard output.

Known bugs:

  1. ig2tbl.pl will not complain if the sequences have bogus characters in them, or if loci are repeated.


tbl2ig.pl

tbl2ig.pl converts sequences in one or more
sequence table (.tbl) format files to IG format sequences in separate files, using the "locus.seq" convention to name the files.

Usage:

 
	tbl2ig.pl [-line-length num] [-write-stdout]
		[ file-name . . . ]

Arguments:

-line-length num
specifies the maximum number of amino acid (or nucleotide) characters that should appear on each line in the output. The default is 70 characters.
-write-stdout
specifies that all sequences found should be written together on the standard output, instead of being put into individual files.

One can supply one or more file name arguments on the command line, in which case the files are implicitly concatenated. With no file names, the standard input is read. If one of the file names is a "-", the standard input is read at that point in the list of files. Unless the -write-stdout option is specified, one "locus.seq" file is written in the current directory for each sequence found. With -write-stdout, all sequences are written to the standard output instead.

Known bugs:

  1. tbl2ig.pl will not complain if the sequences have bogus characters in them, or if loci are repeated.
  2. It is possible to specify absurd values for -line-length.


fa2tbl.pl

fa2tbl.pl converts one or more
FA format sequences to sequence table (.tbl) file format.

Usage:

	fa2tbl.pl [ file-name . . . ]
One can supply one or more file name arguments on the command line, in which case the files are implicitly concatenated. With no file names, the standard input is read. If one of the file names is a "-", the standard input is read at that point in the list of files. The .tbl format sequences are sent to the standard output.

Known bugs:

  1. fa2tbl.pl will not complain if the sequences have bogus characters in them, or if loci are repeated.


tbl2fa.pl

tbl2fa.pl converts sequences in one or more
sequence table (.tbl) format files on the standard input to Fasta (or "FA") format sequences on the standard output.

Usage:

	tbl2fa.pl [ -line-length num ] [ tbl-file . . . ]
		  > fa-file
One can supply one or more file name arguments on the command line, in which case the files are implicitly concatenated. With no file names, the standard input is read. If one of the file names is a "-", the standard input is read at that point in the list of files. Each sequence encountered is written on the standard output in FA format. Note that tbl2fa.pl does not care whether it is given peptide or nucleotide sequences.

Arguments:

-line-length num
specifies the maximum number of amino acid (or nucleotide) characters that should appear on each line in the output. The default is 70 characters.
Since the input format has no information other than locus and sequence, the output is similarly bare. Here is a sample from the E.coli genome; the input file is used elsewhere as an example of .tbl file format. Only a fragment is shown; the complete FA format output would be 1.4MB.
 
	gamow% tbl2fa.pl -line-length 50 /seq/genome/ecoli/ecoli.tbl
	>EC0001
	MKRISTTITTTITITTGNGAG
	>EC0002
	MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAM
	IEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQ
	IKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDP
	VEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELV
	VLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMS
	YQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
	EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLIT
	QSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIIS
	VVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATT
	GVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRV
	CGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPV
	IVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
	RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEG
	MSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIE
	IEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNID
	EDGVCRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGND
	VTAAGVFADLLRTLSWKLGV
	>EC0003
	MVKVYAPASSANMSVGFDVLGAAVTPVDGALLGDVVTVEAAETFSLNNLG
	RFADKLPSEPRENIVYQCWERFCQELGKQIPVAMTLEKNMPIGSGLGSSA
	CSVVAALMAMNEHCGKPLNDTRLLALMGELEGRISGSIHYDNVAPCFLGG
	MQLMIEENDIISQQVPGFDEWLWVLAYPGIKVSTAEARAILPAQYRRQDC
	IAHGRHLAGFIHACYSRQPELAAKLMKDVIAEPYRERLLPGFRQARQAVA
	EIGAVASGISGSGPTLFALCDKPETAQRVADWLGKNYLQNQEGFVHICRL
	DTAGARVLEN
	. . .

Known bugs:

  1. tbl2fa.pl will not complain if the sequences have bogus characters in them, or if loci are repeated.


Bob Rogers <rogers@darwin.bu.edu>
Last modified: Tue Apr 4 22:38:29 EDT 2000