version 3.7a

Contrast -- Computes contrasts for comparative method

© Copyright 1991-2010 by the University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

This program implements the contrasts calculation described in my 1985 paper on the comparative method (Felsenstein, 1985d). It reads in a data set of the standard quantitative characters sort, and also a tree from the treefile. It then forms the contrasts between species that, according to that tree, are statistically independent. This is done for each character. The contrasts are all standardized by branch lengths (actually, square roots of branch lengths).

The method is explained in the 1985 paper. It assumes a Brownian motion model. This model was introduced by Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967) as an approximation to the evolution of gene frequencies. I have discussed (Felsenstein, 1973b, 1981c, 1985d, 1988b) the difficulties inherent in using it as a model for the evolution of quantitative characters. Chief among these is that the characters do not necessarily evolve independently or at equal rates. This program allows one to evaluate this, if there is independent information on the phylogeny. You can compute the variance of the contrasts for each character, as a measure of the variance accumulating per unit branch length. You can also test covariances of characters.

Java Interface

[No Java Interface has yet been created for Contrast]

INPUT, OUTPUT, AND OPTIONS

The input file is as described in the continuous characters documentation file above, for the case of continuous quantitative characters (not gene frequencies). Options are selected using a menu:

Continuous character comparative analysis, version 3.7a Settings for this run: W Within-population variation in data? No, species values are means R Print out correlations and regressions? Yes C Print out contrasts? No T LRT test of no correlation? No X Get Reduced Major Axes? No D Analyze multiple data sets? No M Analyze multiple trees? No 0 Terminal type (IBM PC, ANSI, none)? ANSI 1 Print out the data at start of run No 2 Print indications of progress of run Yes Y to accept these or type the letter for one to change

Option W makes the program expect not means of the phenotypes in each species, but phenotypes of individual specimens. The details of the input file format in that case are given below. In that case the program estimates the covariances of the phenotypic change, as well as covariances of within-species phenotypic variation. The model used is similar to (but not identical to) that of Lynch (1990). The algorithms used differ from the ones he gives in that paper. They are described in a recent paper (Felsenstein, 2008). In the case that has within-species samples contrasts are used by the program, but it does not make sense to write them out to an output file for direct analysis. They are of two kinds, contrasts within species and contrasts between species. The former are affected only by the within-species phenotypic covariation, but the latter are affected by both within- and between-species covariation. Contrast infers these two kinds of covariances and writes the estimates out.

The M option allows you to analyze either multiple data sets, or multiple trees, or both. It is very flexible. Here is a detailed discussion (if you are not going to use these options you might want to skip it). As you toggle the choices of the M option you encounter these settings:

One data set, one tree: This is the default setting and analyzes one data set with one input tree.
One data set, multiple trees: This analyzes the one data set a number of times, once with each of the trees in the input tree file. Thus you could analyze the same data set with each of 100 trees that were produced by bootstrapping (or by sampling from a Bayesian poterior). You will be asked for the number of trees when the program runs.
Multiple data sets, same tree: This takes an input file containing a number of data sets and analyzes each of them with the same tree. You will be asked for the number of data sets when the program runs.
Data sets x trees: For a number d of data sets and a number t of trees, the program will do d x t analyses, analyzing each of the d data sets with each of the t trees. The analyses will be all t trees with the first data set, then all of the same t trees with the second data set, and so on. You will be asked for the number of data sets and the number of trees.
Trees x data sets: For a number d of data sets and a number t of trees, the program will do d x t analyses, analyzing each of the d data sets with each of the t trees. The analyses will be all d data sets with the first tree, then all of the same d data sets with the second tree, and so on. You will be asked for the number of data sets and the number of trees.
Multiple trees per data set: This requires a number of trees in the input tree file which is a multiple (say n) of the number of data sets. The first data set is analyzed with the first n trees, the second data set is analyzed with the second n trees, and so on. Note that this can be used to analyze each data set with a different single tree, by having exactly as many trees as there are data sets. You will be asked for the number of data sets and the number of trees, and the program will complain if the number of trees is not a whole multiple of the number of data sets.
Multiple data sets per tree: This requires a number of data sets that is a multiple (say n) of the number of trees. The first n data sets are analyzed with the first tree, the next n data sets are analyzed with the second tree, and so on. This too can also be used to analyze each data set with a different single tree, and in fact in that case there is no difference in the set of analyses that will be done with this option and with the previous one. You will be asked for the number of data sets and the number of trees, and the program will complain if the number of data sets is not a whole multiple of the number of trees.

These options give you enormous flexibility. Not all of them are going to be of interest. Note that in all these cases the different data sets must all be in the same input data file, end-to-end. The multiple input trees are also in a single input tree file.

One important limitation is that, when there are multiple data sets being read, all of them must have the same number of species and the same number of characters.

The R option allows you to turn off or on the printing out of the statistics. If it is off only the contrasts will be printed out (unless option 1 is selected). With only the contrasts printed out, they are in a simple array that is in a form that many statistics packages should be able to read. The contrasts are rows, and each row has one contrast for each character. Any multivariate statistics package should be able to analyze these (but keep in mind that the contrasts have, by virtue of the way they are generated, expectation zero, so all regressions must pass through the origin). If the W option has been set to analyze within-species as well as between-species variation, the R option does not appear in the menu as the regression and correlation statistics should always be computed in that case.

As usual, the tree file has the default name intree. It should contain the desired tree or trees. These can be either in bifurcating form, or may have the bottommost fork be a trifurcation (it should not matter which of these ways you present the tree). The tree can also contain multifurcations.

The tree must, of course, have branch lengths. These cannot be negative. Trees from some distance methods, particularly Neighbor-Joining, are sometimes inferred to have negative branch lengths, so be sure to choose options in those programs that prevent negative branch lengths.

If you have a molecular data set (for example) and also, on the same species, quantitative measurements, here is how you can allow for the uncertainty of your estimate of the tree. Use Seqboot to generate multiple data sets from your molecular data. Then, whichever method you use to analyze it (the relevant ones are those that produce estimates of the branch lengths: Dnaml, Dnamlk, Fitch, Kitsch, and Neighbor -- the latter three require you to use Dnadist to turn the bootstrap data sets into multiple distance matrices), you should use the Multiple Data Sets option of that program. This will result in a tree file with many trees on it. Then use this tree file with the input file containing your continuous quantitative characters, choosing the “One data set, multiple trees” choice in the M menu option. You will get one set of contrasts and statistics for each tree in the tree file. At the moment there is no overall summary: you will have to tabulate these yourself. A similar process can be followed if you have restriction sites data (using Restml) or gene frequencies data.

The statistics that are printed out include the covariances between all pairs of characters, the regressions of each character on each other (column j is regressed on row i), and the correlations between all pairs of characters. In assessing degress of freedom it is important to realize that each contrast was taken to have expectation zero, which is known because each contrast could as easily have been computed x_i-x_j instead of x_j-x_i. Thus there is no loss of a degree of freedom for estimation of a mean. The degrees of freedom are thus the same as the number of contrasts, namely one less than the number of species (tips). If you feed these contrasts into a multivariate statistics program make sure that it knows that each variable has expectation exactly zero.

The X menu item enables the Reduced Major Axis (RMA) analysis. Currently it is only available in the between-species case (i.e. not when within-species analysis is enabled). It takes the covariance matrix and carries out a spectral decomposition of it (obtaining its eigenvectors and eigenvalues). The eigenvectors show the coefficients of the reduced major axes of the covariance matrix (each a linear combination of the original characters). These axes vary independently along the tree. The eigenvalues estimate the variance of each major axis. RMA analysis is usually more appropriate than regressions when one has characters that each evolve. It is the correct statistical analysis for allometry of characters, solving the problem that regressions of characters have because one is regressing onto a variable that itself has error, and the problem that one does not know which variable to regress on which. It has been enabled as the default when the regressions and correlations are being reported.

Within-species variation

With the W option selected, Contrast analyzes data sets with variation within species, using a model like that proposed by Michael Lynch (1990). The method is described in vague terms in my book (Felsenstein, 2004, pp. 441), and more completely in a recent paper (Felsenstein, 2008). If you select the W option for within-species variation, the data set should have this structure (on the left are the data, on the right my comments:

10 5 Alpha 2 2.01 5.3 1.5 -3.41 0.3 1.98 4.3 2.1 -2.98 0.45 Gammarus 3 6.57 3.1 2.0 -1.89 0.6 7.62 3.4 1.9 -2.01 0.7 6.02 3.0 1.9 -2.03 0.6 ...

number of species, number of characters name of 1st species, # of individuals data for individual #1 data for individual #2 name of 2nd species, # of individuals data for individual #1 data for individual #2 data for individual #3 (and so on)

The covariances, correlations, and regressions for the "additive" (between-species evolutionary variation) and "environmental" (within-species phenotypic variation) are printed out (the maximum likelihood estimates of each). The program also estimates the within-species phenotypic variation in the case where the between-species evolutionary covariances are forced to be zero. The log-likelihoods of these two cases are compared and a likelihood ratio test (LRT) is carried out. The program prints the result of this test as a chi-square variate, and gives the number of degrees of freedom of the LRT. You have to look up the chi-square variable on a table of the chi-square distribution. The A option is available (if the W option is invoked) to allow you to turn off the doing of this test if you want to.

The program prints out the log-likelihood of the data under the models with and without between-species variation. It shows the degrees of freedom and chi-square value for a likelihood ratio test of the absence of between-species variation. For the moment the program cannot handle the case where within-species variation is to be taken into account but where only species means are available. (It can handle cases where some species have only one member in their sample).

We hope to fix this soon. We are also on our way to incorporating full-sib, half-sib, or clonal groups within species, so as to do one analysis for within-species genetic and between-species phylogenetic variation.

The data set used as an example below is the example from a paper by Michael Lynch (1990), his characters having been log-transformed. In the case where there is only one specimen per species, Lynch's model is identical to our model of within-species variation (for multiple individuals per species it is not a subcase of his model).

TEST SET INPUT

5 2 Homo 4.09434 4.74493 Pongo 3.61092 3.33220 Macaca 2.37024 3.36730 Ateles 2.02815 2.89037 Galago -1.46968 2.30259

TEST SET INPUT TREEFILE

((((Homo:0.21,Pongo:0.21):0.28,Macaca:0.49):0.13,Ateles:0.62):0.38,Galago:1.00);

TEST SET OUTPUT (with all numerical options and option C on )

Continuous character contrasts analysis, version 3.69 5 Populations, 2 Characters Name Phenotypes ---- ---------- Homo 4.09434 4.74493 Pongo 3.61092 3.33220 Macaca 2.37024 3.36730 Ateles 2.02815 2.89037 Galago -1.46968 2.30259 Contrasts (columns are different characters) --------- -------- --- --------- ----------- 0.74593 2.17989 1.58474 0.71761 1.19293 0.86790 3.35832 0.89706 Covariance matrix ---------- ------ 3.9423 1.7028 1.7028 1.7062 Regressions (columns on rows) ----------- -------- -- ----- 1.0000 0.4319 0.9980 1.0000 Correlations ------------ 1.0000 0.6566 0.6566 1.0000