Find all SNP for your GOI

To do

  • Fix images
  • Update links

Update: January 2023. NCBI has changed how we access the Function Class filters, which changes how I instruct you to collect the information we need. Here’s new instructions for completing this exercise. The short version? Use the Advanced search features to construct queries.

This is the next step in Report (see work flow at Project essential bioinformatics project work flow). At the end of the work for this assignment, you’ll create two slides for your Project Report (SNP Brief); as evidence of work done, you submit a table and a bar chart graph. See Product at end of page for information about what to turn in.

Objective

  1. Identify the connection between function class and DNA element .
  2. Relate levels of genetic diversity to the DNA elements of a protein coding gene.

Background

There is no such thing as “wild type allele.” Variation is normal. We all differ. Given this fact, a naïve conclusion is that our genetic differences necessarily explain why we look different, sound different, behave different, and, ultimately, have different risk for cancer and other health concerns. The purpose of this exercise is to develop an understanding of where in the gene and how many variants occur within the different DNA elements present in a gene.

Recall our definition of a gene: Any DNA sequence that is transcribed and yields a functional product. The central premise of the course is that we apply definitions (algorithms) to any given DNA sequence to identify what type of DNA element we have. In gene finding, One such algorithm is the concept of an ORF, an open reading frame. At its most basic, ORF is defined as DNA sequence between start and stop codons. However, this definition is not sufficient: we know for most human genes have intron sequences. Given our gene definition we also must include the untranslated regions, UTR, regions. Therefore, at a minimum, we find genes based on whether or not the DNA sequence contains the following four DNA elements: exons, introns, 5-prime UTR, and 3-prime UTR.

Note: When applying the Function class filter in the Advanced query, “exons” is not an available option. Instead, use “coding sequence variant.” Exons and CDS are not strictly the same. Exons are the regions of a gene that remain after RNA splicing, while CDS refers specifically to the protein-coding part of the exon sequence.

By now you should appreciate that genes consist of functional (regulatory sequences, exons), and non-functional elements (introns). A basic premise derived from evolutionary biology is that traits of organisms subject to natural selection should be less variable than traits not constrained by natural selection. Nonfunctional sequences may accumulate differences among organisms at a rate proportional to the mutation rate (a principle established by the Neutral Theory of Evolution). We will accomplish this by collecting SNP for each of our genes of interest (selected in the first steps of this project), by DNA elements within a gene.

Single nucleotide polymorphism, or SNP, is a common type of genetic variation. Along with the other major classes of sequence variation (insertions and deletions (indels), short tandem repeats (STR)), humans differ at millions of single nucleotides. More formally, an SNP is a single nucleotide variant, or SNV, for which the rare allele occurs in the population at frequencies greater than 1%. In contrast, an SNV may have been found in only one person. SNV may differ by base substitution (C for T), or as an insertion (ins) or deletion (del), or even as both a deletion of one base and insertion of another single base (delins). Note: the abbreviations in brackets are the SNP class designations used by NCBI).

As your knowledge of genetics grows, you are aware that in many cases, a single nucleotide change can make the difference between and health and disease (e.g., Deng et al 2017). But very few of these differences are likely to be clinically important. In order to put into context the importance of the SNP found in your GWAS study, we need to investigate how common are SNP in your gene. Our objective? We’ve been emphasizing the concept of how to recognize function in DNA elements; an important concept for you to come away from this part of the project is that we expect to find lots of SNP in some DNA elements, but not other DNA elements. Why do we expect differences? Because the various DNA elements of the gene are not equally important to the functioning of the gene or its product. Thus, even though a GWAS study may find a statistically significant difference between groups, say, with and without a particular kind of cancer, that does not necessarily mean a functionally significant difference between cancer and no cancer has been identified (false positive).

Chaminade University students — What to turn in

Follow the instructions in this page to complete the three objectives. Submit your total number of SNV, Table of SNV by DNA elements, and bar chart to Submit: SNP by functional class

What to do

With your GOI name in hand (Submit PheGenI results: Gene of interest), or better yet, the gene ID number for your gene (Which you should have in your Digital Notebook!), go to the SNP database at NCBI and enter the name of your gene. Alternatively, from NCBI Gene database entry for your GOI (e.g., HIF1A), select SNP from link along right hand side of the browser screen. You’ll need to “scroll down” to find the link.

I’ve organized the rest of the worksheet into three objectives.

First objective

Find the total number of SNP and SNV for your gene.

For example in this screen image I entered gene name HIF1A (Fig. 1). The total number of SNP known for the gene are displayed as Search results: in this case, 12506, displayed 20 per page (Fig 1).

screenshot of 16 April 2019 SNP search

Figure 1. Screenshot of SNP in dbSNP for HIF1A.

Note: These numbers will change over time. Figure 1 was made 16 April 2019. Revisiting the gene 23 February 2022, the total number is now at 19,793. To make matters more complicated, NCBI has reworked the available filters again. So, pay attention and think about the purpose of the assignment.

But the total number is misleading; we’re not done. Although the database is called SNP, it contains a number of classes of single nucleotide differences. Thus, we need to use the Advanced search features to drill down to obtain the data set we are seeking. That is, what we are trying to find is the number of single nucleotide variants (snv) known for our genes. We accomplish this by using the ENTREZ search query standards.

Question 1. What is the most frequent class of SNP for HIF1A?

Work the problem by building a query (answer posted below).

From the SNP Search results page, click on the “Advanced” link that appears just below the SNP box (i.e., where you entered the gene name, Fig. 1). The Advanced Search options are shown in a screenshot image (Figure 2). Here, you build your search query, one request at a time, to create a search string. Although awkward at first, the logic grows on you: Select the field, then click on Show Index list to select from among the options within that field.

Note: Once you have your search query, you can just type in the query directly into the search window, bypassing the Advanced Search form builder.

screenshot advanced search

Figure 2. Screenshot of Query builder form

To get the SNV we select Gene name (or gene id if you know it) from the first All fields box (Fig. 3). Next, we select SNP Class and from the Show Index list, we select SNV (Fig. 3) to complete our query.

Screenshot Query builder NCBI

Figure 3. Screenshot of completed query builder to get total SNV for HIF1A (23 Feb 2022).

Click on the “Search” button to submit your query.

Answer 1.  SNV (17,119) outnumbered the other types (del [440], delins [2064], ins [167], and mnv [3]).

Second objective

We want to known location of SNV by DNA element, or as NCBI calls it: “Function Class.” Prior to April 2019 you could select from a number of filters available at the right of the frame; now, only function classes for which sequences are available are displayed. While some of these function class filters have returned as of 2022, their names have changed since before 2019 and some of the instructions on this page do not reflect that change.

Important note hint: For this assignment, you need to recognize that functional class is not the same thing as DNA element. By function class NCIB (actually the ENCODE people) are implying that we evaluate variation by effects on the phenotype. This is not consistent with our objective — to identify diversity in DNA elements of a gene. While in some cases you may be able to piece together all SNV for a DNA element (e.g., exons) by selecting each filter, you are advised to use the Advanced query option instead.

Table 1. As of 2022, list of Function Class filters available for HIF1A. Your gene ay or may not have the same filters.

  1. inframe deletion
  2. inframe indel
  3. inframe insertion
  4. initiator codon variant
  5. intron
  6. missense
  7. non coding transcript variant
  8. synonymous

Question 2. Note that these “Function classes” do not necessarily map to the DNA elements we want. From that list, which are “exon” SNVs?

Answer 2. Only F and H refer to SNV clearly contained in an exon (or CDS).

Thus, you may need to conduct an advanced search in order to obtain a meaningful accounting of SNV of the locations within our GOI. Relevant DNA element Function class options include (Table 2)

Table 2. As of 2022, list of Function Class filters that unambiguously refer to DNA elements of a gene.

  1. 3 prime utr variant
  2. 5 prime utr variant
  3. frameshift variant
  4. intron variant
  5. missense variant
  6. stop gained variant
  7. stop lost
  8. synonymous variant
  9. terminator codon variant

Question 3. Review Table 2 and repeat Question 2.

Answer 3. E, F, G, H, and I are “exon” related.

Third objective

Construct a table containing the number of SNV that fall within DNA elements within your GOI. From the Table, make a bar chart which clearly identifies the distribution of SN located in the DNA elements (i.e., tell a story).

The following series of images take you through use of the Query Builder to build the data you need to accomplish the third objective.

Figure 4. Query builder, select Gene Name to start.

Figure 5. Query builder, Gene Name now included, proceed to next “All fields”.

Figure 6. Function Class in All fields selection made. Showing Function Class from Show list options. Selected “intron variant” for this example.

Click on “Search” button (not shown on most screenshots on this page, see Figure 1 above) to begin the search. To complete the required table of SNV counts in DNA elements of your GOI you will need to repeat the search request several times, each time altering the Function Class request. Figure 7 shows building a nested query — e.g., how many frameshift variants occur within an intron? Answer for HIF1A was 18 (delins [14], del [3]. ins [1])

Fig. 6 Screenshot of advanced search, example AND OR

Figure 7. Screenshot of query builder, a nested or conditional search: for the gene HIF1A, how many frameshift variants located introns?

You are free to vary from the standard theme, but the objective is to

Ten functional classes are available as filters (listed in Table 3); many more are available via the Advanced Query builder.

To get these numbers, simply click on the function class filter name (e.g., 3′ utr), and note the number of SNP items under the “Search results” header. After recording the number, click on the “clear” button to remove the previous filter and then click on the next category (e.g., 5′ splice site).

Construct a table (better yet, create a spreadsheet). Think about how best to represent the SNP count in relation to our objective, then create a bar chart with your results. See Table 3 for the results of SNP by functional class for HIF1A and note I’ve made no attempt to “think about how best to represent the SNP count”. See the Background above for hints about how to “think about…”

Table 3. SNP by function class for human HIF1A (gene ID: 3091)*

Functional Class SNP
Splice site, 3′ 2
UTR, 3′ 235
Splice site, 5′ 4
UTR, 5′ 172
Synonymous 214
Frameshift 0
Intron 10695
Missense 428
Nonsense 8
Stop gained 6

*Note: Not all of the available functional classes are automatically listed; for example, no frameshift SNP were found for HIF1A, therefore on the website you won’t see an option to select frameshift. Results were obtained 1/30/2017

Table 4. Updated results for SNV (total 17119) functional class diversity for human HIF1A as of 22 Feb 2022 Note that this table does not address the objective of our assignment

Functional Class SNP
initiator codon variant 1
intron 17509
missense 585
non coding transcript variant 619
synonymous 260

These tables should go into your notebook

We need to get the snv by DNA elements. Note that we can extract this information from Table 3, but again, it’s much easier to use the Advanced query builder to get the counts you need to complete the assignment.

Product

The Report 2 project will have several steps. For each step, you submit PowerPoint files. By the end of the semester, the report is completed by merging (and editing as needed) all of the project steps.

This step, SNP Report.

Slides from this assignment include:

  • Table of SNP by functional class for your gene
  • A bar chart that displays the counts of SNP in a “think about” way (hint: by DNA elements)

However, for the assignment, see Submit your total number of SNV, Table of SNV by DNA elements, and bar chart to Submit: SNP by functional class

Next up

Work on the classification of your SNP and how common it is in humans.

Submit: SNP variants and mutation type

Work on identifying whether or not your gene is a member of one or more gene (protein) families

Project Essential: Gene family?

Work on identifying whether or not your gene contains one or more transposon sequences

Project Essential: Transposable elements in your gene?

Questions

See Week 5 questions.

References

Deng, N., Zhou, H., Fan, H. & Yuan, Y. (2017) Single nucleotide polymorphisms and cancer susceptibility. Oncotarget 8(66): 110635–110649. doi: 10.18632/oncotarget.22372

Guo, Y., & Jamison, D. C. (2005). The distribution of SNPs in human gene regulatory regions. BMC Genomics, 6(1), 140. (linkLinks to an external site.)

Lemos, B., Meiklejohn, C. D., Cáceres, M., & Hartl, D. L. (2005). Rates of divergence in gene expression profiles of primates, mice, and flies: stabilizing selection and variability among functional categories. Evolution59(1), 126-137. (linkLinks to an external site.)

/MD