Molecular clock

Molecular clock workflow: data > plot > analysis > interpret > submit

The purpose of this exercise is for you to construct a molecular clock model for the gene you have been working with (i.e., the one you selected in the PheGenI exercise)

  1. Collect the data for the molecular clock: divergence times and pairwise genetic distance among the 14 taxa.
  2. Produce scatter plot of genetic distance by divergence time.
  3. Estimate the rate of evolutionary change for the gene by estimating the slope of a linear regression (through the origin) from the data set used to construct the molecular clock.
  4. Relate the rates of change to the gene ontologies obtained in the earlier labs for these genes.
  5. Share your slope estimates with the class to compile rate estimates for the different genes.

Quick to do

  1. Setup a spreadsheet file
    • I recommend use of desktop Microsoft Excel or LibreOffice Calc; Although Google Sheets and Numbers can do most of the work needed, getting the regression equation is trickier.
  2. Run a simple program on your ML newick tree (or NJ tree, if that’s the only one you made), to extract pairwise distances from the branch lengths (How to get the distances from a distance tree). Copy/paste results to your spreadsheet file.
  3. Obtain pairwise divergence times (Finding divergence times, this page); merge or copy/paste results to your molecular clock spreadsheet file.
  4. Make a scatterplot (Distances by divergence in MYA)
  5. Report the regression equation (regression through the origin, i.e., Y-intercept equal to zero)
  6. You turn in your spreadsheet with the data, plot, and equation — see protocol instructions below in To do.

Recourses, this page

In this handout you will find

  • brief background about why you are producing a molecular clock analysis (see for more in depth discussion)
  • how to obtain the data for the molecular clock
  • protocol to help make the spreadsheet file
  • protocol for calculating the regression equation (through the origin) by either Microsoft Excel, LibreOffice Calc, Google Sheets, Apple Numbers, or via R Statistical programming language (which, by the way, is the easiest option!).

Background

OK. Now you should have a gene tree for the product of your GOI and based on the best multiple sequence alignment you can get. What now?

Although you can construct phylogenetic trees for a set of taxa (Links to an external site.) without times assigned to the evolutionary events to accompany the divergence from a common ancestor to new taxa, adding this information permits us to ask fundamental questions.  We want to know rates of evolution, the amount of genetic change in a species or a gene compared among species over a unit of time. Evolutionary change results from evolutionary processes at play in shaping the species: genetic drift? Natural selection?  Because existing species today are the descendants of processes acting in the distant past (millions of years ago for speciation) and some of those processes (e.g., genetic drift, selection, mutation) are ongoing processes that affect genetic differences, assigning time to the branching events in a tree is an important, but challenging task.

How do we get divergence times?  Divergence times are for the events of speciation or cladogenesis (Links to an external site.). In the absence of fossil evidence, divergence times can be estimated for species by assuming a molecular clock (Kumar 2005, Yi 2013).  When fossils are available, these can be used to calibrate the clock, although at best, fossils provide a minimum time since divergence and are therefore a means of relative and not absolute divergence calibration. The concept that differences in numbers of substitutions for a sequence between two species may be a function of the time since the divergence of those species is not a new idea (Zuckerlandl and Pauling 1965), not is it an uncontroversial idea (Takahasha 2007), but it has been applied in many disciplines of biology, not just in evolutionary biology.  For example, use of molecular clock has been applied to determining the origins of HIV pandemic (Korber et al 2000 (Links to an external site.)).

I could write a lot more here, but instead will point you to these additional resources listed below (see References)

Finding divergence times

In order to calibrate the molecular clock we need to use evidence for the speciation event independent from the genetic sequences themselves.  Fossils provide that evidence.  Fossils are found in layers in the ground, and the layers themselves are assigned dates by methods independent of biology.  This idea predates the idea of molecular clocks and, divergence times are now typically a combination of fossil and molecular evidence.

We will use the best estimates available to calibrate our molecular clocks.  This information has been collated as part of a very large review on rates of evolution and I provide two graphics for you from this compilation (the two images are from Chapter 4 in an edited book by Hedges and Kumar 2009).  In addition, the divergence estimates are now available in a searchable database.

BI308L students — I have provided you with the complete set of divergence times for all pairwise combinations of your 14 species in a spreadsheet file available at Pairwise divergence times: TimeTree results. Use this file, you do not have to gather all of these times yourself. You must, however, make sure that rows match by species names  between your distances and the divergence times.

What follows is a the description of how the divergence times were acquired and what they mean (how to interpret them).

Vertebrates (FIGURE 3A, p. 44)

vertebrates_divergenceTimes.png

Invertebrates (FIGURE 3B, p. 45)

Invertebrates_divergenceTimes.png

Figure 3. Vertebrate (A) and invertebrate (B) phylogenies with divergence times in MYA.

How to use the image to get divergence times?

Simple 🙂

I’ve provided you with a portion of the file you need to construct your molecular clock. Go to Pairwise divergence times: TimeTree results, which has a link to updated pairwise divergence times, in millions of years ago (MYA or Ma), for our 14 species.

I do expect to see in your notebook evidence that you investigated how to get divergence times between pairs of species, so do read on. You can expect one or more exam questions!

But here’s how you’d do the work yourself. Let’s say I want to get the divergence time between crocodiles and humans — that is, I want the estimate of the last common ancestor between the two species.  First, identify the two species at the tips (leaves).  Second, trace the branches back to the intersection between the two leaves of the phylogenetic tree.  Here’s an animation of the process.

Play media comment.divergence_screen.mp4

New resource for obtaining divergence times!!  To find divergence times between two taxa (e.g., humans and mice) I would typically work by looking for a recent review via a Google Scholar search, painstakingly searching for original papers on fossil evidence, etc.  However, there is a fantastic resource at http://www.timetree.org, that allows you to collect divergence times.  Using their “TimeTree Search” form you can find best estimate for divergence time, collated for all such estimates.  For example,  I searched for  divergence between Homo sapiens and Mus musculus, and the results are shown below (I would use the median divergence time = 88 Ma).

Note: I created this page back in 2018, the median divergence time between mice and men was estimated at 94.5 Ma; estimates of divergence times may change based on new evidence.

TimeTree_search01.png

Figure 4. Screenshot of results for divergence times between mice and men, TimeTree.org

Important!  If you have already created your divergence time list, you do not have to redo your work!  Do record in your notebook the source(s) of your divergence times.

Note: I’ve provided you with an Microsoft Excel file with all 91 divergence times at Pairwise divergence times: TimeTree results

To do:

Protocol to complete your molecular clock file. You should now have two spreadsheet files. One file contains the extracted distances from your best gene tree (How to get the distances from a distance tree). The second, created in this exercise, contains the divergence times. Now, merge the two into a single data file, your molecular clock data set.

See below for an example of a completed spreadsheet

  1. Open a new Excel spreadsheet file and name four columns
    Column A = First Species
    Column B = Second Species
    Column C = Time since divergence (Ma)
    Column D = Total branch length
    Column E = log10-transformed Ma
  2. For each pairwise comparison obtain divergence estimates from your UGENE Maximum Likelihood tree diagram (e.g., our best tree); extract branch lengths for pairwise comparisons = Total branch length
  3. From your research on the Internet*, enter the time in millions of years ago (MYA) fossil the divergence time

Table 1. The spreadsheet looks like (made up example)

A B C D E
1 Compare1 Compare2 MYA Distance MYA, log10
-transformed
2 Sp1 Sp2 10 3 1.0
3 Sp1 Sp3 30 3.5 1.477
4 Sp1 Sp4 52 12 1.716
5 Sp2 Sp3 12 1.1 1.0792
6 Sp2 Sp4 210 10 2.322
7 Sp3 Sp4 190 12 2.279

Where Sp1 refers to the first taxa in your list (e.g., human), and sp2, sp3, sp4 are the next three taxa from your data set.  Preferred distances are found from your Maximum Likelihood tree  (see How to get the distances from a distance tree).

Tip: Be advised that creating an X-Y scatterplot within a spreadsheet works best if the “X” is in a column that precedes “Y”. It’s best to rearrange your data set so that MYA or logMYA (X-values) is in column C and Distance (Y-values) is in column D. This will make chart-making easier with a spreadsheet app.

Here’s a video that provides an overview of how to get the data for the Molecular Clock spreadsheet

Play media comment.molecular_clock_protocol.mp4

Make at least one plot.  The chart of the raw data set (i.e., not log-transformed) looks like

Example molecular clock plot with regression through the origin

Figure 5. Example “molecular Clock” plot with regression through the origin line (red).

Notes.  You can make the graph in Microsoft Excel or other spreadsheet program (I recommend the free LibreOffice Calc.  However, For Figure 5, I used a scatterplot graphing tool from a website to make the graph and saved it as a screenshot. The trendline through the origin was calculated by creating a second group with two points (x1=0, y1=0 and x2=210, y2=12.873; multiply 210 by the slope). Many of you choose to use Google Sheets. That’s fine, but note that with the exception of making a plot, Google Sheet option is a harder choice (e.g., you’ll have to use the LINEST() function to get the regression without the Y-intercept). Like my example, you get the trendline by adding a second group.

Obtain slope of the regression, Desktop Microsoft Excel or LibreOffice Calc

Next, you will need to calculate the linear regression slope, through the origin. If you recall from your algebra class or your (bio)statistics class, the linear regression equation is Y = a + bX, where a is the Y-intercept (the value of Y when X = 0), and b is the slope or rate of change in Y given change in X. Thus, the equation for the regression through the origin is Y = bX.  If you have the desktop version of Microsoft Excel (or Calc, LibreOffice ), you can do this simply by applying a linear trend line. For other spreadsheets you cannot use the trendline function.

Assuming you have desktop Excel or Calc, the following instructions work. Note however that we want regression through the origin (a = 0), so the equation is Y = bX . After all, if divergence time was zero years, what genetic distance should there be between two taxa?

  1. For the untransformed “raw” data set
  2. For the log-transformed data set (semi-log — transform only the “Years” or X-variable);
  • use the log-10 transform. In Excel, in an empty cell (e.g., E2 in my spreadsheet example), type “=log10(D2)” (without the quotes).  This would return
    the log10 value for value in cell D2 (in this case,  would

3. Insert Chart, scatter plot option.

Desktop Microsoft Excel, Trendline options menu

My results: for raw data set, slope (+ standard error) = 0.06130 (0.01501) and for transformed data set, slope (+ SE) = 4.5040 (0.7314)

What about LibreOffice Calc?

LibreOffice also provides a simple trend line approach, and you can force the linear regression through the origin, just like Microsoft Excel.  I’ve recommended before you consider the free, open source office package called LibreOffice, available for Linux distros, macOS, and Windows 10/11 computers. This is one example in which LibreOffice makes your work simpler than more familiar options. For example, Google Sheets and Numbers don’t have a simple way to get trend lines through the origin.

What if your are using Google Sheets?

You’re correct, can’t get a line through the origin using only the trend line option.ad, use LINEST(). Click on an empty cell (e.g., D2), and type in “=linest(B2:B7,A2:A7,FALSE,FALSE)“, everything except the quotes.

Table 2. Same example data set from Table 1

A B
1 X Y
2 10 3
3 30 3.5
4 52 12
5 12 1.1
6 210 10
7 190 12

The output will be two numbers, the estimate of the slope (D2), and “0” (E2). Zero, because the Y-intercept was set to go through the origin by adding “FALSE” at the third element of the LINEST() function call.

Table 3. LINEST() result, Google Sheets

D E
1
2 0.006279580901 0

D2, that’s the slope estimate

E2, that’s the Y-intercept estimate, which we set to zero

While we’re at it, If you set that last option to TRUE, forces Google to return an array of results (Table 4).

Table 4. Full LINEST() results, Google Sheets

D E
1
2 0.006279580901 0
3 0.01500575823 #N/A
4 0.7694620448 4.350324335
5 16.68840266 5
6 315.8333909 94.62660908

Unfortunately, Google doesn’t label these results. So here goes:

D2, E2: that’s the slope estimate, Y-intercept estimate, which we set to zero

D3, E3: Standard error (SE) slope, intercept

D4, E4: Coefficient of determination, SE Y estimate

D5, E5: F-statistic, degrees of freedom

D6, E6: regression sum of squares, residual sum of squares

Now, we need to revise our plot. Create a new variable titled, for example, trend, then multiply each element of the X variable (MYA in our case), by our new slope calculation. For example, to add the new value to the C column in our worksheet, type in the cell C2

=a2*d$2

which will return the predicted value we want. Carry this out for the remaining rows (Table 5).

Table 5. Calculate data for trend line through the origin

A B C D E
1 X Y trend
2 10 3 0.6130068532 0.06130068532 0
3 30 3.5 1.83902056
4 52 12 3.187635637
5 12 1.1 0.7356082239
6 210 10 12.87314392
7 190 12 11.64713021

then add “trends” to the plot you already created. You can then use the Trendline function again on your regression through the origin data. I would remove the trend line from the regression that included a Y-intercept. The plot is shown in the graph below (Fig. 6)

Screenshot of Google Sheets scatterplot with trend line plot through the origin (red line) and the default trend line (blue line)

Figure 6. Scattershot of Google sheets scatter plot with trend line fitted through the origin (red) and the default trend line (blue).

And finally, Numbers?

I don’t talk much about Apple’s Numbers spreadsheet app. It’s a perfectly good app, but it’s look and feel is distinctive. It also, like Google Sheets, lacks a simple way to get the trend line for a regression through the origin problem. However, like Google sheets, you can obtain one by following the same outline as in Google Sheets, including use of the =linest() function call. The only difference, enter a value of zero (0) instead of “FALSE” to force the regression through the origin.

Run the statistics yourself in R

We have also been working with R. Use your installation of R, or your other options, or scroll to bottom of this page where I have embedded the rdrr.io (Links to) script engine for your use.

Regression in R is simple with the lm() function.

x <-c(10,30,52,12,210,190)
y <- c(3,3.5,12,1.1,10,12)
#With intercept
resulta <- lm(y~x)
summary(resulta)
#without intercept
result <- lm(y~0+x)
summary(result)

Output from R, in red

And what about the plot? In R, the command for the plot is simply “plot.” But we also want to set the axis range (xlim, ylim), add axis labels (xlab, ylab), increase the size of the data points (cex=1.5), color (col="blue"), and select the data point type (pch=19 selects filled in circle). Last, add our trend line (use lm() function; don’t forget: through the origin for molecular clocks) (Fig. 7).

x <-c(10,30,52,12,210,190)
y <- c(3,3.5,12,1.1,10,12)
plot(x,y, ylim=c(0,15), xlim=c(0,240), xlab="MYA", ylab="distance", cex=1.5, col="blue", pch=19)
#draw the trend line
abline(lm(y~0+x), col="red")

Which returns the following plot (Fig. 7)

mclock_example_plot_R.png

Fig. 7. Scatterplot from R with trendline through the origin.

How to run R code

You can install R onto your computer. Go to

Or, you can run it from within your browser by going to rdrr.io

Or, you can run the code by updating code in this embedded window

And for completion, what if you want to grab the data from your spreadsheet and get it into R? If you are running in browser, then you have to paste the data into the window

myData <- read.table(header=TRUE, sep="\t", text="
Color    Var1    Var2
red    0.3    1.2
red    0.33    1.4
red    0.23    0.9
blue    0.6    2.2
blue    0.8    2.1
blue    0.7    2
")

Replace my data with your own. If you copy/paste from your spreadsheet, columns are separated by tabs (hence the “\t” in the R code).

If you have installed R onto your computer, then the straightforward way is to copy your X and Y data (MYA, Distance) to a new text file, then use

myData <- read.table(file, header=TRUE, sep="\t")

and replace “file” with the name of your data file (including path if the file is not in yoru working directory).

Another option is to install the package clipr, and use the following commands

#This script describes a way to copy data from your spreadsheet to R via your computer's clipboard. It requires that you install the clipr package. This works for two columns, for example, to make a scatterplot or to calculate a correlation, or run a simple regression. I'll use it to demonstrate how to make a plot of your "molecular clock" plus trend line through the origin. Assume that the first column contains your X (millions of years), and second column contains your Y (genetic distances)
#Works on local R installation, but not with rdrr.io
library(clipr)
#go to spreadsheet, copy your two columns of data, including the header row, then execute the following command.
myData <- read_clip_tbl()
#check to see that all 91 rows plus the first header row of data loaded
myData
#This next command tells R to store the data so you can just refer to the variable names; R will know that the current dataset is "myData"
attach(myData)

#Now, as before, make the plot and force the regression through the origin, add a red line
plot(MYA,Distance)
abline(lm(Distance~0+MYA),col="red")

You can, of course, use other statistical tools.

Microsoft Excel and Google Sheets both have regression tools, or any of a number of online regression software.  I used Regression tools (LR) at http://www.xuru.org/.

In Microsoft Excel, you have a couple of options, including using the regression function if you have the Data Analysis add-in package installed, but the easiest is to simply call the SLOPE function.  Here’s how to use the function.  Find an empty cell, type exactly as follows (without the quotes): “=slope(C2:C7, D2:D7)”, where D2:D7 is an array corresponding to my “X” values — the X values were in Column D rows 2 through 7, and where C2:C7 is an array corresponding to my “Y” values — the Y values were in Column C rows 2 – 7 from my spreadsheet (click here to see the sheet again).

Lastly, most spreadsheets allow you to apply a linear trend line from a plot. See above for notes about Google Sheets.

Reading and Resource links

http://evolution.berkeley.edu/evosite/evo101/IIE1cMolecularclocks.shtml (Links to an external site.)

http://en.wikipedia.org/wiki/Molecular_clock (Links to an external site.)

Ho, S. (2008) The Molecular Clock and Estimating Species Divergence. Nature Education 1(1) (Links to an external site.).

Benton, M. J., Donoghue, P. C. J., & Asher, R. J. (2009). Calibrating and constraining molecular clocks. The timetree of life35, 86.

Kumar. S. (2005) Molecular clocks: four decades of evolution. Nat, Rev, Genet, 6:654-662.

Margoliash, E. (1963) Primary structure and evolution of cytochrome c. Proc. Natl. Acad. Sci. USA 50:672-679

Takahasha 2007

Yi, S. (2013) Neutrality and Molecular Clocks. Nature Education Knowledge 4(2):3

Zuckerkandl, E., & Pauling, L. (1965). Evolutionary divergence and convergence in proteins. In Evolving genes and proteins (pp. 97-166). Academic Press.

 

/MD