Evolutionary rate tests

Purpose of this lab

  1. Calculate rate tests for sets of taxa
  2. Conclude: Discuss finding in terms of rates of evolution for your gene

Can’t wait to get started? Scroll or click The assignment: apply Tajima rate test to your gene tree

Background

relativeRateTest03182019.mp4

These videos do not have audio.

Evolution is change in the heritable characteristics of biological populations over successive generations. Speciation, from one interbreeding population to two separate populations, leads to lineages isolated from each other. Isolation means that the two populations do not exchange genes, and, therefore, any changes that may occur over generations for the sequence of a gene in one species, whether by genetic drift or natural selection, are independent of changes that may have occurred in the sister species.

The forces* that create genetic variation (mutation, recombination) and filter genetic variation (natural selection, genetic drift) in populations may occur at the same rates in the two distinct lineages, but by no means is this assumption likely to be true for all genes or DNA elements. For example, if one population undergoes a bottleneck, a severe reduction in population size, then this population would see significant reduction in genetic variation because of random genetic drift on the kinds of alleles present in the population. In the previous, related exercise, Tree reconciliation, you compared the topology of your gene tree is compared to a species consensus phylogeny. This approach highlights any inconsistencies between genes, and the the evolutionary histories of species.

In this exercise you are going to apply a simple statistical test, Tajima’s rate test, on the numbers of amino acid differences between pairs of taxa compared to an outgroup. The null hypothesis is we expect the amount of differences since two taxa last shared a common ancestor to be the same (i.e., equal branch lengths). This is derived from the Molecular Clock hypothesis, which is that changes in sequences occur more or less at a constant rate.

*Note: The use of the term “force” is an analogy, not to be confused with the forces of Newtonian mechanics.

A rate test

So far we have descriptions, e.g., tree difference estimates Tree Reconciliation, of whether or not your gene shows lineage differences for rates of evolution. We need to test whether or not the differences are large enough to be statistically different from chance. Remember, we expect differences, but differences proportional to time. Evidence of differential rates are suggestive of natural selection, not chance (genetic drift). We have one way to test the assumption of equal rates, Tajima’s rate test (Fig. 1). Compared to an outgroup, test for accumulation of amino acid changes between two species; if the rates are equal, then the differences between the outgroup and species should be equal. This is Tajima’s relative rate test (1993).

It’s straight-forward to represent how the test works. We have two species, Species 1 and Species 2. Set the rates for evolution for your gene as a for Species 1, and b for Species 2. Species 1 and 2 last shared a common ancestor some time ago, which means that prior to speciation, we can assume that their was an ancestral protein sequence, perhaps represented by an outgroup; this rate is represented by c . After speciation, what happens to the gene is lineage specific leading to Species 1, and conversely, independent of changes leading to Species 2.

Tajima's rate test

Figure 1. Tajima’s rate test statistic is distributed as a chi square.

Thus, we have two rates to compare: m1 = a + c accounts for all of the changes since the outgroup and Species 1 last shared a common ancestor, plus m2 = b + c, which accounts for all of the changes since the outgroup and Species 2 last shared a common ancestor.

The rate test statistic behaves like a chi-square test, the equation shown in Fig. 2 should look familiar.

goodness of fit chi-square equation

Figure 2. Equation of the chi-square, where O is Observed and E is expected.

The differences in the equations are just how Observed and Expected are calculated. Otherwise, the concept is the same. The null hypothesis for the rate test is that m1=m2; for the familiar chi-square test, the null hypothesis is that O = E.

First, you need to create the data set. You will be working with three taxa at a time; your two ingroup taxa vs one outgroup (e.g., Fig. 3).

trees

Figure 3. Examples of how to set up the comparisons for collecting amino acid differences. The ingroup for (A) Human vs Chimpanzee, for (B) Human vs Macaque, and for (C) Human vs. Mouse. Same outgroup, “Alligator”, for all comparisons.

How did I make the plot in Figure 3? In R, with ape library installed. Here’s the code. You can run this code at rdrr.io “Run R in your browser”

Copy/paste the following code into rdrr.io (the code works, but a bunch of warnings pop up — these can be ignored)

library(ape)
par(mfrow=c(1,3))
tre1 <- read.tree(text="((Human,Chimpanzee),Alligator);")
addEdgeText1 <- c("c","a","b")
plot(tre1,type="cladogram", edge.width=2, direction="upwards", cex=1.7, srt=-40, adj=0.05, x.lim=5)
title("A", cex.main= 4, adj = 0.3, line=-2)
edgelabels(addEdgeText1, cex=2)
tre2 <- read.tree(text="((Human,Macaque),Alligator);")
addEdgeText2 <- c("c","a","b")
plot(tre2,type="cladogram", edge.width=2, direction="upwards", cex=1.7, srt=-40, adj=0.05, x.lim=5)
title("B", cex.main= 4, adj=0.3, line=-2)
edgelabels(addEdgeText2, cex=2)
tre3 <- read.tree(text="((Human,Mouse),Alligator);")
addEdgeText3 <- c("c","a","b")
plot(tre3,type="cladogram", edge.width=2, direction="upwards", cex=1.7, srt=-40, adj=0.05, x.lim=5)
title("C", cex.main= 4, adj=0.3, line=-2)
edgelabels(addEdgeText3, cex=2)
#closes the graph window
dev.off()

The assignment: apply Tajima rate test to your gene tree

Since we need the numbers of amino acid differences among the taxa, we can collect this information from UGENE. Open the multiple sequence alignment for your protein in UGENE. Click on the statistics tab (the little bar chart, Fig. 4). Choose a Reference sequence (= your outgroup), e.g., click on the line for Alligator, then click on the > under reference sequence to select Alligator sequence for outgroup.

After setting the reference sequence, select also the following

  • Show distances column
  • Set to Hamming dissimilarity
  • Count
  • Exclude gaps
  • Automatic updating

UGENE Select statistics

Figure 4. Select statistics by clicking on the “bar graph” tab (red arrow), then select Reference group (blue arrow), then s.

After updating the selections, go back to the alignment window and visually scan the distances in the “score” column (left hand side of Fig.4). Collect amino acid differences between Species 1 (e.g., Chimpanzee) against an outgroup (e.g., Alligator). Repeat for each species, one at a time (Pig vs Alligator, Lizard vs Alligator, etc.) and collect the results in a table (

For our comparisons shown in Figure 4, we have the following results in Table 1.

Table 1. Summary of counts of pairwise differences for the protein

Fig. 4 Compare Count
A Human - Alligator m1 1769
Chimpanzee - Alligator m2 = 1769
B Human - Alligator m1 = 1769
Macaque - Alligator m2 = 1769
C Human - Alligator m1 = 1769
Mouse - Alligator m2 = 1769

Question. If you change the outgroup, will you get the same results for the ingroup comparison?

Now, complete the calculation of the rate test by solving for the test statistic = (m1-m2)2/(m1+m2). Note that for this example, it’s going to be a boring test! For this gene, DOCK2, none of the comparison with human and pig vs outgroup turned out to be statistically significant.

Question. Since the rate tests all failed to reject the null hypothesis of equal rates, what do we conclude about the DOCK2 gene family?

A. The molecular clock does not apply

B. The molecular clock applies

Here’s another example, this time with amino acid differences

Table 2. Summary of counts of pairwise differences for the protein DSCAM

A B C D
1 Outgroup difference
2 A Human Alligator 136
3 B1 Chimpanzee Alligator 119
4 B2 Macaque Alligator 120
5 B3 Mouse Alligator 134

Calculate and get p-value of test statistic

Below, you’ll find comprehensive R scripts for carrying out these tests. Here, I’ll share spreadsheet code; works with Microsoft Excel, LibreOffice Calc, and Google Sheets.

Assume your spreadsheet is set up like Table 2. To get the chi-square test statistic for A vs B1, and A vs B2, and A vs B3, we enter into our spreadsheet two formulas, the first calculates the chi-square value (the test statistic), and the second calculates the p-value (degrees of freedom are one for this type of problem). I extended our worksheet in Table 3 to reflect how you would enter the formulas. Note use of $ makes the reference to the cell absolute and not relative (for more, see Microsoft help).

Table 3. Summary of counts of pairwise differences for the protein DSCAM

A B C D chi.sqr P-value
1 Outgroup difference
2 A Human Alligator 136
3 B1 Chimpanzee Alligator 119 =((D$2-D3)^2)/(D$2+D3) =CHIDIST(E3,1)
4 B2 Macaque Alligator 120 =((D$2-D4)^2)/(D$2+D4) =CHIDIST(E4,1)
5 B3 Mouse Alligator 134 =((D$2-D5)^2)/(D$2+D5) =CHIDIST(E5,1)

Confirm that your numbers are the same as mine (Table 4).

Table 4. Calculated chi-square and associated p-values for Table 3.

chi.sqr P-value
1.13 0.287
1.00 0.317
0.01 0.903

Optional

There are other ways to get this data. One indirect way would be to use the branch lengths (=distance, see How to get the distances from a distance tree), and apply the equation = 500 X (branch length)/100 = number of amino acid substitutions (the two methods won’t necessarily agree, so which ever method you choose, be consistent and report what method you used!

Here’s another example

Our HBB example from Sequence alignment — The basics explained .

The sequences, in FASTA format

>Alligator XP_014460905.1
MVQWSAEEKQLITSIWGRINVEEVGGDALARLLIVYPWTQRFFSNFGNLSSPTAIINNPKVRAHGKKVLTSFGDAVKNLDNVKGTFAKLSELHCDKLHVDPENFRLLGDILNIVLAANLGKEFTPSCQATWQKLVGVVAHALARKYH
>Chicken NP_990820.1
MVHWTAEEKQLITGLWGKVNVAECGAEALARLLIVYPWTQRFFASFGNLSSPTAILGNPMVRAHGKKVLTSFGDAVKNLDNIKNTFSQLSELHCDKLHVDPENFRLLGDILIIVLAAHFSKDFTPECQAAWQKLVRVVAHALARKYH
>Human NP_000509.1
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
>Mouse AAB59638.1
MVHFTAEEKAAITSIWDKVDLEKVGGETLGRLLIVYPWTQRFFDKFGNLSSALAIMGNPRIRAHGKKVLTSLGLGVKNMDNLKETFAHLSELHCDKLHVDPENFKLLGNMLVIVLSTHFAKEFTPEVQAAWQKLVIGVANALSHKYH
>Xenopus NP_001091375.1
MVHWTAEEKAAITSVWQEVNQEQDGHDALTRLLVVYPWTQRYFSSFGNLGNATAIAGNVKVRAHGKKVLSAVGDAIAHLDNVKGTLHDLSVVHAFKLYVDPENFKRLGEVLVIVLASKLGSAFTPQVQGAWEKFVAVLVDALSQGYN

Load the HBB fasta sequences into UGENE (see How to import FASTA files into UGENE for help), export the sequences as alignment, then perform calculations as above (Fig. 4, Table 1), and obtain the chi-square statistics and p-value Tajima Rate test for

  1. Human vs mouse, with Alligator as outgroup
  2. Human vs mouse, with Chicken as outgroup
  3. Human vs mouse, with Xenopus as outgroup
  4. Alligator vs Chicken, with Xenopus as outgroup

Answers

  1. chi-square = 0.061, p-value = 0.805
  2. chi-square = 0.263, p-value = 0.608
  3. chi-square = 0.007, p-value = 0.932
  4. chi-square = 1.889, p-value = 0.169

Test assumption of equal rates

The example I used did not generate large differences, so applying a statistical test was overkill.  Nevertheless, you’ll want to be know how to able to apply the statistical test. Your rate test is distributed as a chi-square function. That means we can consider all possible outcomes of the test against the probability of that estimate given only chance. That’s what the chi-square distribution is, a collection of all possible outcomes of

Using R, we can get the probability value for our rate test statistic. The functions are part of the base installation for R, so you don’t need to install libraries. After calculating your Tajima rate test statistic (Fig. 1), assign the result to x and proceed to test the statistical significance of the result. If x is large, then the rate of evolution is relatively large; if x is small, then we conclude little evolution has occurred.

To calculate these rates tests one at a time, the commands you submit to R are

#Calculate your rate test statistics and the probability of your chi-square. Replace m1 and m2 with your values
m1 = 130
m2 = 159
x =(m1-m2)^2/(m1+m2)
pchisq(c(x),df=1,lower.tail=FALSE)

After submitting the command, R returns with what we want, the p-value of the test.

[1] 0.08802999

or 8%. Again, you can run this code below or at the https://rdrr.io/snippets/ website.

To calculate the rates for a set, use the following R code

#Calculate your rate test statistics and the probability of your chi-square. 
#Replace a and b with your values
#This example, a = human v. opossum, and b = opossum v cat, chimp, cow, dog, macaque, mouse, pig, rat, rabbit
m1 = 130
m2 = c(127, 129, 168, 126, 126, 159, 134, 203, 139)
x =(m1-m2)^2/(m1+m2)
results<-pchisq(c(x),df=1,lower.tail=FALSE)
#prints out the p-values in order
cat(results, sep = "\n")

which returns a column of p-values for each comparison

Thus, p-values for rate tests for human vs cow and human vs. rat are less than 5% and we could conclude the rates are not equal. Because the p-values were greater for the other comparisons, we would conclude rates equal.

Why 5% for decision criteria to reject/accept the null hypothesis? For those of you who remember your basic statistics, we set statistical significance at 5%. If the result of your rate test yields a p-value less than 5%, then we say our results are likely different from the null hypothesis. If the p-value is greater than 5%, we say our results are not likely different than the null hypotheses.

So, completing our example, we have the following probability calculations given our estimates of the rate tests for each comparison.

Table 5. A portion of the rate test results.

Fig. 4 Compare (m1-m2)2/(m1+m2) p-value P < 0.05?
A Human – Alligator
Chimpanzee – Alligator 0 1 No
B Human – Alligator
Macaque – Alligator 0 1 No
C Human – Alligator No
Mouse – Alligator 0 1 No

Note: We typically judge statistical significance if the p-value is less than or equal to 5%.

Troubleshooting

If you get results like my Table 6, change your outgroup Reference sequence to the Marsupial, e.g., opossum. You then make rate tests for the Eutherian mammals (humans, chimps, rats, mice, etc.) against the Marsupial outgroup, instead of against the alligator.

Table 6. A portion of the rate test results using Outgroup=opossum.

Fig. 4 Compare (m1-m2)2/(m1+m2) X2 p-value P < 0.05?
A Human - Opossum  
Chimpanzee - Opossum (130-129)2/(130+129) 0.00386  No
B Human - Opossum  
Macaque - Opossum (130-126)2/(130+126)  0.0625 No
C Human - Opossum
Mouse - Opossum (130-159)2/(130+159) 2.910  No

No

Proteins? Why not look directly at CDS?

The best approach follows work by comparing aligned CDS sequences for synonymous vs nonsynonymous changes.  Instead of NP_ sequences, download the appropriate NM_ sequences for your GOI for the different species in your study. After alignment, translate the sequences and count the missense mutations. The idea is that if, as we suspect, the substitutions in a DNA sequence follow a Poisson distribution, then the variance of substitutions and the mean of substitutions will be equal to one (Gillespie 1989), called the Index of Dispersion, R (variance to mean ratio).

What to turn in

Create  a table of results like Table 2 above, but for your own protein. Copy/paste the table at Submit results: Rate tests

References

Pybus, O. G. (2006). Model selection and the molecular clock. PLoS Biology4 (5).

About R

You have two options for using R.

  1. Install R onto your computer.
  2. As an alternative, you can take advantage of rdrr.io, which allows you to run small amounts of R code online.
  3. Or you can run your code snippets from this page, see Run your R code!

Let’s discuss the second option first. You can run small amounts of code at https://rdrr.io/snippets/ (or from the homepage, click on “Run R code online”), or even from this page (scroll to the bottom, or click here, and you’ll find an embedded R snippet). Simply copy/paste or enter your commands in the box, replacing the example code there. An example of one of our code snippets is shown in Fig 3. Another option — You may want to explore using Jupyter Notebooks, which allow you to mix R code snippets with other text or graphs (in other words, a nice way to do bioinformatics! A way to combine your code methods with results for your lab notebook).

Screenshot rdrr.io

Figure 6. Screenshot from rdrr.io with R code snippet.

The first option is the best option: go ahead and get for yourself an up-to-date copy of R installed on your computer. Once you install R on your own computer, you’ll also need to download and install two additional packages, ape and phytools.

Regardless of the option you choose, proceed by copy/paste the code I provide below, one line at a time at the R prompt. For , or better, create a script file (from the R GUI, File → New document). Run each command one a time: simply hit the Enter key to submit the command; if using a script file, then Ctrl+R for windows PC, Command + Enter for macOS.

R works line by line; lines that begin with # are ignored by R and are provided as comments or notes about the code. Text in blue represents the code you enter; text in red represents the output R provides in response to the commands.

You can run your R code directly from this page, or go to rdrr.io


Run your R code!

Copy and paste R code here. Edit as needed.

/MD