sequence

(c) Copyright 1986-2005, 2009 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

These programs estimate phylogenies from protein sequence or nucleic acid sequence data. Protpars uses a parsimony method intermediate between Eck and Dayhoff's method (1966) of allowing transitions between all amino acids and counting those, and Fitch's (1971) method of counting the number of nucleotide changes that would be needed to evolve the protein sequence. Dnapars uses the parsimony method allowing changes between all bases and counting the number of those. Dnamove is an interactive parsimony program allowing the user to rearrange trees by hand and see where character states change. Dnapenny uses the branch-and-bound method to search for all most parsimonious trees in the nucleic acid sequence case. Dnacomp adapts to nucleotide sequences the compatibility (largest clique) approach. Dnainvar does not directly estimate a phylogeny, but computes Lake's (1987) and Cavender's (Cavender and Felsenstein, 1987) phylogenetic invariants, which are quantities whose values depend on the phylogeny. Dnaml does a maximum likelihood estimate of the phylogeny (Felsenstein, 1981a). Dnamlk is similar to Dnaml but assumes a molecular clock. Dnadist computes distance measures between pairs of species from nucleotide sequences, distances that can then be used by the distance matrix programs Fitch and Kitsch. Restml does a maximum likelihood estimate from restriction sites data. Seqboot allows you to read in a data set and then produce multiple data sets from it by bootstrapping, delete-half jackknifing, or by permuting within sites. This then allows most of these methods to be bootstrapped or jackknifed, and for the Permutation Tail Probability Test of Archie (1989) and Faith and Cranston (1991) to be carried out.

The input and output format for Restml is described in its documentation files. In general its input format is similar to those described here, except that the one-letter codes for restriction sites is specific to that program and is described in that documentation file. Since the input formats for the eight DNA sequence and two protein sequence programs apply to more than one program, they are described here. Their input formats are standard, making use of the IUPAC standards.

INTERLEAVED AND SEQUENTIAL FORMATS

The sequences can continue over multiple lines; when this is done the sequences must be either in "interleaved" format, similar to the output of alignment programs, or "sequential" format. These are described in the main documentation file. In sequential format all of one sequence is given, possibly on multiple lines, before the next starts. In interleaved format the first part of the file should contain the first part of each of the sequences, then optionally a line containing nothing but a carriage-return character, then the second part of each sequence, and so on. Only the first parts of the sequences should be preceded by names. Here is a hypothetical example of interleaved format:

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
Salmo gairAAGCCTTGGC AGTGCAGGGT
H. SapiensACCGGTTGGC CGTTCAGGGT
Chimp     AAACCCTTGC CGTTACGCTT
Gorilla   AAACCCTTGC CGGTACGCTT

GAGCCCGGGC AATACAGGGT AT
GAGCCGTGGC CGGGCACGGT AT
ACAGGTTGGC CGTTCAGGGT AA
AAACCGAGGC CGGGACACTC AT
AAACCATTGC CGGTACGCTT AA

  5    42
Turkey    AAGCTNGGGC ATTTCAGGGT
GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT
GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT
ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT
AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT
AAACCATTGC CGGTACGCTT AA

is perfectly legal, assuming that the species name has gone before, and is filled out to full length by blanks. The above digits and blanks will be ignored, the sequence being taken as starting at the first base symbol (in this case an A). This should enable you to use output from many multiple-sequence alignment programs with only minimal editing.

In interleaved format the present versions of the programs may sometimes have difficulties with the blank lines between groups of lines, and if so you might want to retype those lines, making sure that they have only a carriage-return and no blank characters on them, or you may perhaps have to eliminate them. The symptoms of this problem are that the programs complain that the sequences are not properly aligned, and you can find no other cause for this complaint.

INPUT FOR THE DNA SEQUENCE PROGRAMS

The input format for the DNA sequence programs is standard: the data have A's, G's, C's and T's (or U's). The first line of the input file contains the number of species and the number of sites. As with the other programs, options information may follow this. Following this, each species starts on a new line. The first 10 characters of that line are the species name. There then follows the base sequence of that species, each character being one of the letters A, B, C, D, G, H, K, M, N, O, R, S, T, U, V, W, X, Y, ?, or - (a period was also previously allowed but it is no longer allowed, because it sometimes is used in different senses in other programs). Blanks will be ignored, and so will numerical digits. This allows GENBANK and EMBL sequence entries to be read with minimum editing.

Gaps are indicated by "-", unknown nucleotides by N or "?". Alignment programs sometimes create sequence files that have periods (dots) for gaps. This is not allowed in PHYLIP, because some programs also use dots in another sense, having them indicate that the sequence is the same as the first sequence. Because of this ambuguity, we have banned dots, and the programs will produce an error message if they encounter them. If they truly are gaps you should replace them by "-", if they are unknown sequence you should replace them by "?", and if they mean "the same as the first sequence" you should replace each dot by the nucleotide that is implied by that.

These characters can be either upper or lower case. The algorithms convert all input characters to upper case (which is how they are treated). The characters constitute the IUPAC (IUB) nucleic acid code plus some slight extensions. They enable input of nucleic acid sequences taking full account of any ambiguities in the sequence.

Symbol	Meaning

A	Adenine
G	Guanine
C	Cytosine
T	Thymine
U	Uracil
Y	pYrimidine	(C or T)
R	puRine	(A or G)
W	"Weak"	(A or T)
S	"Strong"	(C or G)
K	"Keto"	(T or G)
M	"aMino"	(C or A)
B	not A	(C or G or T)
D	not C	(A or G or T)
H	not G	(A or C or T)
V	not T	(A or C or G)
X,N,?	unknown	(A or C or G or T)
O	deletion
-	deletion

INPUT FOR THE PROTEIN SEQUENCE PROGRAMS

The input for the protein sequence programs is fairly standard. The first line contains the number of species and the number of amino acid positions (counting any stop codons that you want to include). These are followed on the same line by the options. The only options which need information in the input file are U (User Tree) and W (Weights). They are as described in the main documentation file. If the W (Weights) option is used there must be a W in the first line of the input file.

Next come the species data. Each sequence starts on a new line, has a ten-character species name that must be blank-filled to be of that length, followed immediately by the species data in the one-letter code. The sequences must either be in the "interleaved" or "sequential" formats. The I option selects between them. The sequences can have internal blanks in the sequence but there must be no extra blanks at the end of the terminated line. Note that a blank is not a valid symbol for a deletion.

Gaps are indicated by "-", unknown amino acids by "X" or "?". For some programs such as Dnapars, "?" implies that it also might be a gap. For others, it doesn't matter whether "X" or "?" are used. Alignment programs sometimes create sequence files that have periods (dots) for gaps. This is not allowed in PHYLIP, because some programs also use dots in another sense, having them indicate that the sequence is the same as the first amino acid sequence. Because of this ambuguity, we have banned dots, and the programs will produce an error message if they encounter them. If they truly are gaps you should replace them by "-", if they are unknown amino acids you should replace them by "?" or "X", and if they mean "the same as the first sequence" you should replace each dot by the amino acid that is implied by that.

The protein sequences are given by the one-letter code used by the late Margaret Dayhoff's group in the Atlas of Protein Sequences, and consistent with the IUB standard abbreviations. Our programs can recognize and produce these one-letter codes:

Symbol	Stands for

A	ala
B	asx
C	cys
D	asp
E	glu
F	phe
G	gly
H	his
I	ileu
J	(not used)
K	lys
L	leu
M	met
N	asn
O	(not used)
P	pro
Q	gln
R	arg
S	ser
T	thr
U	(not used)
V	val
W	trp
X	unknown amino acid
Y	tyr
Z	glx
*	nonsense (stop)
?	unknown amino acid or deletion
-	deletion

where "nonsense", and "unknown" mean respectively a nonsense (chain termination) codon and an amino acid whose identity has not been determined. The state "asx" means "either asn or asp", and the state "glx" means "either gln or glu" and the state "deletion" means that alignment studies indicate a deletion has happened in the ancestry of this position, so that it is no longer present. Note that if two polypeptide chains are being used that are of different length owing to one terminating before the other, they can be coded as (say)

The one-letter codes above are standard ones, except for the "*" (stop) and "?" (unknown or gap) codes, which we have invented. These may not be interpreted correctly by other software. The unused codes "J", "O", and "U" are used in some other versions of the one-letter code to mean "leucine or isoleucine", "pyrrollysine" and "selenocysteine", but we have not allowed this.

Amino acid	One-letter code

ala	A
arg	R
asn	N
asp	D
asx	B
cys	C
gln	Q
glu	E
gly	G
glx	Z
his	H
ileu	I
leu	L
lys	K
met	M
phe	F
pro	P
ser	S
thr	T
trp	W
tyr	Y
val	V
deletion	-
nonsense (stop)	*
unknown amino acid	X
unknown (incl. deletion)	?

THE OPTIONS

The programs allow options chosen from their menus. Many of these are as described in the main documentation file, particularly the options J, O, U, T, W, and Y. (Although T has a different meaning in the programs Dnaml and Dnadist than in the others).

The U option indicates that user-defined trees are provided at the end of the input file. This happens in the usual way, except that for Protpars, Dnapars, Dnacomp, and Dnamlk, the trees must be strictly bifurcating, containing only two-way splits, e. g.: ((A,B),(C,(D,E)));. For Dnaml and Restml it must have a trifurcation at its base, e. g.: ((A,B),C,(D,E));. The root of the tree may in those cases be placed arbitrarily, since the trees needed are actually unrooted, though they look different when printed out. The program Retree should enable you to reroot the trees without having to hand-edit or retype them. For Dnamove the U option is not available (although there is an equivalent feature which uses rooted user trees).

A feature of the nucleotide sequence programs other than Dnamove is that they save time and computer memory space by recognizing sites at which the pattern of bases is the same, and doing their computation only once. Thus if we have only four species but a large number of sites, there are (ignoring ambiguous bases) only about 256 different patterns of nucleotides (4 x 4 x 4 x 4) that can occur. The programs automatically count how many occurrences there are of each and then only needs to do as much computation as would be needed with 256 sites, even though the number of sites is actually much larger. If there are ambiguities (such as Y or R nucleotides), these are also handled correctly, and do not cause trouble. The programs store the full sequences but reserve other space for bookkeeping only for the distinct patterns. This saves space. Thus the programs will run very effectively with few species and many sites. On larger numbers of species, if rates of evolution are small, many of the sites will be invariant (such as having all A's) and thus will mostly have one of four patterns. The programs will in this way automatically avoid doing duplicate computations for such sites.

Ambiguities, missing data, and gaps

One issue that needs careful explanation is what happens when there are ambiguities in a sequence. For example, what if the sequence has a state such as R (purine), or N (unknown nucleotide)? And what if it has a gap ("-") or a state such as "?" that could be either a gap or an unknown nucleotide? It can be shown that the parsimony, distance, and maximum likelihood methods handle these cases correctly. Two species that have state N do not tend to be placed close together on the tree -- instead the underlying base or amino acid sequence is (in effect) assumed to be whatever will fit best in the part of the tree where the species are currently located. There is no tendency for two species that have N in the same site to move toward each other in any of these methods. Moreover this does not mean that the site is dropped in all species. The program will use all other information at that site. Thus if species 8 and 10 out of 20 species have "N" as the state in a particular site, the states in all other species at that site are used. So you need not delete such sites -- to do so is an overreaction. They are used in the calculations where possible.

For many of the PHYLIP programs if there is a gap ("-") the site is considered as if the nucleotide were there but unknown. The presence or absence of gaps is not used as information -- it is instead considered equivalent to failure to record the nucleotide or amino acid. Partial ambiguities, such as the base R, are taken into account in a sensible way, and part of their information used.

However two programs are different in this respect. Dnapars and Protpars treat the gap state as if it were (respectively) a fifth kind of nucleotide, or an amino acid that is three steps away from all other amino acids. In those programs gap states will have some tendency to be considered similar.

Molecular Sequence Programs

INTERLEAVED AND SEQUENTIAL FORMATS

INPUT FOR THE DNA SEQUENCE PROGRAMS

INPUT FOR THE PROTEIN SEQUENCE PROGRAMS

THE OPTIONS

Ambiguities, missing data, and gaps