GWAS and essential genes

Overview of the project

You will work with the gene product (protein) from your PheGenI project.  By now, you will have collected fourteen homologous (orthologs) sequences, including the human gene sequence from your PheGenI project Report 2: Bioinformatics project — Overview .  We will use protein sequences only. Why? First, select 13 additional species deemed suitable for testing hypotheses about the “origins” of your gene. This means we select species that represent “before” and “after” the period you think the gene may have appeared in time.  Clearly this is a guess on our part, but it may come in part from a reflection of the evolutionary history. For example, myoglobin, now one of several members of the hemoglobin family, traces back to the beginnings of Vertebrates over 500 million years ago. The data set needs to at once provide diversity of form and divergence time, but yet not also include species drastically different, e.g., from different phyla or kingdoms.

Project tasks

Step 1 Select a phenotype of interest (POI) from PheGenI database

Step 2 Identify an associated SNP and a gene of interest (GOI); note the context (i.e., location) of the SNP

Step 3 Identify protein product of your GOI; collect accession numbers for your GOI protein product plus 13 ortholog sequences

Step 4 Load all accession numbers into UGENE, import and save sequences from online database to your UGENE project.

Step 5 Conduct a preliminary sequence alignment

Sequence alignment — The basics explained

Step 6 Generate a NJ gene tree and a BAYESIAN gene tree

Phylogenetics

Building the gene tree — The basics explained

Step 7 Conduct evolutionary tests of sequences

Evolutionary rate tests

Tree Reconciliation

Step 8 Estimate molecular clock for your gene

How to get the distances from a distance tree

Build your Molecular Clock

Background and significance

The goal of the discipline of human genetics is to identify genetic variants (e.g., SNP), that lead to changes in phenotype. We either begin with phenotype differences, drilling down (top → down) to see if genetic variation is associated with the differences, or we start with genetic variation and work our way up (bottom → up) to see if the genetic differences are associated with phenotype differences. Note the word “associated.” That’s a general way to say two variables go together, there is correlation between them; “correlation” is a statistical term with more precise meaning (two variables are linearly associated with each other). Once an association is found between genetic differences and phenotype differences, additional analysis is needed to verify the effect is real (a true positive), and therefore a candidate gene worth studying further, as opposed to a chance association (false positive, probably not worth further investigation).

What kind of genetic variants cause disease phenotypes? For simple Mendelian traits, an allele mutation (or two allele copies of the mutation if the condition is recessive) at single gene has large effect (i.e., high penetrance), resulting in phenotype variation in the population. This is not the case for multi-factorial, complex disorders, where many genes are involved and alleles contribute only a fraction towards the phenotype (i.e., show incomplete penetrance). Thus, what kinds of alleles at these numerous genes affect disorders like Type 2 diabetes? Are the alleles found in many individuals in many populations, the common disease – common variant model (Reich & Lander 2001), or are the alleles actually rare in populations, the rare functional variant model (Pritchard & Snow 2002)? After nearly 20 years the answer is neither model describes the relationship well.

Genome Wide Association Studies (GWAS) may be used used to identify candidate genes, but because of the nature of the assay, positive results are consistent, but not sufficient evidence that the genetic differences cause the phenotype differences (Visscher et al 2017 and references therein). Some problems with hybridization-based assays like GWAS include if the target sequence is found in multiple sites of the genome, cross-hybridization with non-target sequences. Additionally, because thousands of sequences are compared simultaneously, the method requires attention to what’s known as the multiple comparison problem. In other words, GWAS results in correlations between a disease or condition and a set of genes. These limitations are procedural or statistical; linkage disequilibrium (LD) is both key to why GWAS works, but it is also the chief biological limitation of the GWAS approach for identifying cause and effect. From Visscher et al (2017), the logic behind GWAS is that in order to detect associations between DNA variants (alleles ) and a trait “… depends on the

  • experimental sample size
  • distribution of effect sizes of (unknown) causal genetic variants that are segregating in the population
  • frequency of those variants
  • LD between observed genotyped DNA variants and the unknown causal variants.

Linkage disequilibrium across large regions of chromosomes will mean that many SNP in genes or intergenic regions may show statistically significant association with the trait, not because they themselves cause the difference, but because they are linked to one or more genes that do cause the difference (Sanchez et al 2019). Thus, researchers refer to these regions as QTL or quantitative trait loci, in recognition that large stretches of the genome “go together,” and ascribing cause and effect within these large regions is challenging.

Despite the shortcomings of GWAS, the technique can be used to identify candidate loci (genes) that should be followed up (Visscher et al 2017). There exists a large, public database of studies using the technique, which allows us to call up data and run different analysis to test hypotheses about cause/effect of complex diseases. (The GWAS alternative, and now more popular approach, RNASeq, is to directly sequence all RNAs present in tissues from individuals with and without the condition, also has limitations.) We are all particularly interested in genetic basis of diseases and for those diseases like Type II diabetes which have multifactorial causes, genes of large effect are of particular interest, but challenging to find. Recent advances in sequencing have been coupled with basic surveys of who does and does not have a condition, with the hope that this will identify candidate loci, deserving further analysis.

Our project begins with a search of a GWAS database, PheGenI, of conditions and correlations to Single Nucleotide Polymorphisms (SNP). PheGenI reports on hundreds of GWAS studies, which will permit you to investigate from a perspective of frequency of mutation/location as to whether the candidate gene deserves further scrutiny and testing. Thus, instead of a classic experiment, we are taking a bioinformatics approach. Let’s continue to set up the necessary background to understand how this approach can be used to test whether or not the SNP identified is more or less likely to identify genes worth further consideration.

Instead of looking for common or rare alleles, an alternate approach is to look for genetic variants in essential genes to see if alleles at these genes influence complex phenotypes like Type 2 diabetes. Essential genes are those that are essential for survival or reproduction of an organism; in other words, essential genes are expected to be under purifying selection. Because the genes are essential for an organism we would not expect lots of alleles at these genes — functional variants would have reduced survival or reproduction and would tend to be removed from the population. Nonessential genes, on the other hand, are expected to have more variation and be less subject to natural selection, i.e., change expected to be more due to chance (genetic drift). This is an overly simplistic view, as it ignores linkage, i.e., differences associated with an SNP are not due to the SNP itself, but because the SNP is close to a DNA element that is causally related. However, it does provide us a way to take advantage of the comparative method. While studies have looked at the potential relationship between essential genes and diseases by comparing orthologous sequences in mice (e.g., Dickerson et al 2011), we are extending this approach to include other species. We can identify essential genes by comparing variation among different species and plotting genetic variation against divergence time, the time since species last shared common ancestors. Essential genes are conserved over evolutionary time and the slope of the association between genetic variation and divergence time will be smaller than the slope for nonessential genes since divergence.

References

Alhuzimi, E., Leal, L. G., Sternberg, M. J., & David, A. (2018). Properties of human genes guided by their enrichment in rare and common variants. Human mutation, 39(3), 365-370.

Blekhman, R., Man, O., Herrmann, L., Boyko, A. R., Indap, A., Kosiol, C., … & Przeworski, M. (2008). Natural selection on genes that underlie human disease susceptibility. Current biology, 18(12), 883-889.

Dickerson, J. E., Zhu, A., Robertson, D. L., & Hentges, K. E. (2011). Defining the role of essential genes in human disease. PloS one, 6(11).

Goldstein, D. B., Allen, A., Keebler, J., Margulies, E. H., Petrou, S., Petrovski, S., & Sunyaev, S. (2013). Sequencing studies in human genetics: design and interpretation. Nature Reviews Genetics, 14(7), 460.

Kachroo, A. H., Laurent, J. M., Yellman, C. M., Meyer, A. G., Wilke, C. O., & Marcotte, E. M. (2015). Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science, 348(6237), 921-925.

Pritchard, J. K., & Cox, N. J. (2002). The allelic architecture of human disease genes: common disease–common variant… or not? Human molecular genetics, 11(20), 2417-2423.

Reich, D. E., & Lander, E. S. (2001). On the allelic spectrum of human disease. TRENDS in Genetics, 17(9), 502-510.

Sanchez, MP., Ramayo-Caldas, Y., Wolf, V. et al. Sequence-based GWAS, network and pathway analyses reveal genes co-associated with milk cheese-making properties and milk composition in Montbéliarde cows. Genet Sel Evol 51, 34 (2019). https://doi.org/10.1186/s12711-019-0473-7

Schork, N. J., Murray, S. S., Frazer, K. A., & Topol, E. J. (2009). Common vs. rare allele hypotheses for complex diseases. Current opinion in genetics & development, 19(3), 212-219.

Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am J Hum Genet. 2017 Jul 6;101(1):5-22. doi: 10.1016/j.ajhg.2017.06.005. PMID: 28686856; PMCID: PMC5501872.

Wang, T., Birsoy, K., Hughes, N. W., Krupczak, K. M., Post, Y., Wei, J. J., … & Sabatini, D. M. (2015). Identification and characterization of essential genes in the human genome. Science, 350(6264), 1096-1101.

Zhang, L., & Li, W. H. (2004). Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Molecular biology and evolution, 21(2), 236-239.

Zhang, Z., & Ren, Q. (2015). Why are essential genes essential?-The essentiality of Saccharomyces genes. Microbial Cell, 2(8), 280.