Bioinformatics II: Detecting designed sequences
Designed sequences?
Background and setting up the problem
Place yourself in the following scenario. You work in a DNA lab for the government. It’s late, most everyone else has gone home, but you are still in your office working to get a report done. Your supervisor sees you and hands you a thumbdrive. The data are appended at the bottom of this page.
“These are the results of DNA sequencing on what we believe to be a new form of bioterrorism,” she says. “We think someone has made a new gene, one that codes for a potentially harmful protein, and that this gene may one day be inserted into an air borne pathogen. Can you tell us if we have evidence for a designed gene or is this a natural gene?
I can give you two weeks, but we really need to know ASAP.”
As she speaks, her voice fades from your attention. You contemplate the possibilities, you panic a bit. Just a bit. This is what you’ve trained for.
Is there reason to be concerned about these sequences? You’re well aware that used DNA synthesizer’s like the Bio XP 3200 DNA Synthesis can be purchased on Ebay and that public databases contain sequences of some nasty pathogenic substances. So with a little knowledge, just about anyone could make some DNA of concern.
Your supervisor leaves and suddenly you feel the weight of the world on your shoulders.
You’ll need to work out a protocol. You need to develop a plan for identifying not only design, but also potential for harm.
You start to think, where to start? Someone must have worked on a similar problem before? You remember the rumors and conspiracy theories centering on the COVID-19 pandemic of 2020, a significant charge about SARS-COV2, the causative agent of COVID-19, was that the virus had been genetically modified by lab workers and released to the public, either accidently or deliberately. This accusation was addressed in by Andersen et al in Nature Medicine, 2020. As of late 2021, no convincing evidence has been provided of lab derived or lab escape. Nor has anyone identified a natural source.
For guidance, you quickly read the Nature Medicine paper (see Reference list at end of this page), then return to your task.
You begin by writing down tools and the concepts you may need.
Item 1. Describe the sequences. Are they long or short? Do they show unusual GC ratio?
Item 2. Search the databases for a DNA match
Item 3. Search the sequences for ORFs
Item 4. Translate the DNA to protein
Item 5. Decide if there is evidence for “design.”
Item 6: Write a hypothesis to relate reading sequence to evidence for design of sequence
What about the design concern? In brief, the idea is that we can conceive of sequences that may enhance the infectiveness or damage caused by a virus in ways that cannot (or has not) appeared in nature. If the sequences are altered from nature or designed de novo, they could be assembled along with a vector. For example, the several varieties of GMO rice use a set of known plasmids. If all or part of one of these plasmids are present in suspect DNA sequence, then we have evidence of a designed sequence.
In general, we look to see if the structure (combination of nucleotides) is unknown in nature, not likely the result of natural selection. A designed sequence, as opposed to one derived by natural selection, may require multiple, implausible steps leading to significant efficiency in humans. Sars-COV-2 uses ACE2 receptors to get inside cells, and it attacks human ACE2 very well, but it can get into lots of other mammals, too, because it is not precisely matched to human ACE2. Thus, we can assess natural occurring vs designed by looking at permutations of DNA elements in our suspect DNA sequences and comparing
What Bioinformatics II is about:
The challenge: Develop and produce a protocol to distinguish between engineered (designed) sequences and natural occurring sequences. Students will apply use of the genetic code and reading frames, the pathway of information, the sequence hypothesis, DNA → RNA → Protein. Students will demonstrate use of two bioinformatics tools
The successful student will address two audiences: an executive summary that explains to the supervisor and other members of the bioterrorism task force how to distinguish between engineered and natural DNA sequences and a protocol to recreate steps a technician would need to follow in order to process new sequences.
Instructions
Read the entire document before beginning. Do the Practice work first (scroll down to Practice first, or click here)
What to turn in:
Two items are due by midnight 29 October.
- Submit your protocol that describes tools used and steps taken to evaluate the sequences.
- Click here to submit Bioinformatics II: Student protocol
- 500 word Executive summary, which describes your findings, how to apply the protocol to any suspect samples, and conclusions. Your report must also include a clear statement about your model for recognizing designed sequences.
- Click here to submit Bioinformatics II: Student Executive Summary
- Answers to any questions on the page not related to the protocol or the executive summary should go in your notebook.
Your protocol report and Executive summary must conform to the format and content listed on your CANVAS site.
What about the four questions in this handout? These belong in your notebook, along with evidence of your work in support of this assignment.
When is it due?
Your conclusions are due by end of Week 8.
Some background before you begin
What this exercise works on is use of the Standard Genetic code and the information pathway of DNA → RNA → Protein. The key question in this assignment is to consider whether these DNA sequences code for a protein that you should worry about. Naked DNA by itself is of no harm (DNA packaged in protein and carbohydrate? that’s a virus and that’s another story!). We eat DNA all of the time in the form of meat and produce, which, of course, contains DNA. Its only if the end-product is harmful do we need to worry.
The Genetic Code is a set of nearly universal instructions for translating from RNA (or DNA) to proteins, three nucleotides at a time (= codon). Each codon codes for one amino acid. Because there are 20 standard amino acids in biological organisms, but 64 codons, many amino acids are coded for by more than one codon (codon redundancy). The “universal” part of the Genetic Code implies that most organisms use the same set of instructions; this has largely been confirmed, and provides the single best evidence that all life alive today shares a common ancestor (evolution). We no longer refer to the code as “universal,” however, but rather, we refer to it as the “Standard Code” (Fig. 1).
Figure 1. Standard genetic code.
Question 1. True or False. Humans use the standard genetic code for all genes in the genome. Click for answer.
Click here to get a table to tell you what amino acid corresponds to the three letter symbol.
Question 2. For example, what is the full name for the amino acid “Phe”?
Feeling lost about the Report? The sequences you need for the report are listed at the end of this page. What we are covering now is background to do the exercise and to answer the questions for the report.
There are hints throughout this handout, but in addition, to translate the sequences, I suggest one of these websites may be helpful.
From your textbook and from lecture, you may already have the Mendelian definition of a gene as the functional unit of inheritance, and we know that genes consist of DNA sequence. Our operational definition of a gene is called the nominal gene, which is the current bioinformatics definition of a gene: any DNA sequence that contains an ORF, or Open Reading Frame.
ORF is defined as a DNA sequence that contains a start codon (ATG) and a stop codon (there are several, e.g., TAA, TGA, TAG, etc…), and then enough DNA to code for dozens or more amino acids. ORFs may or may not be “real” genes; genes are part of organisms, but DNA sequences can be genes or noncoding sequences.
Another point you’ll need to understand is that a single strand of DNA has 3 reading frames. Reading frames come from the bioinformatic definition of a gene or ORF. An ORF is a sequence of DNA that contains a start codon, a stop codon, and enough sequence in between that a typical protein could be coded for. Median protein length in humans is 375 amino acids long, but range down to 100 amino acids. Thus, a minimum ORF would be about 3+300+3 = 306 nucleotides long.
Now, back to the “three reading frames” part. When we feed a computer a sequence, we need to tell it where to begin translating. It may be the case that the very first nucleotide indeed corresponds to the beginning of the sequence in nature, but it is not necessary for us to make that assumption. We can start the sequence in 3 different modes. The good news is that we know one of these three is correct. I demonstrate the “3 reading frames” next.
Operationally, then, we take a DNA sequence and look to see if it contains an ORF. In general, we evaluate a sequence three times. We call these reading frames. A codon has three nucleotides, so the first reading frame (RF1) of a sequence is the first nucleotide, the second reading frame (RF2) begins at the second nucleotide, and RF3 begins at the third nucleotide. Consider an example. The DNA sequence is ATTGCGG. Thus, the three possible reading frames are
RF1 = ATTGCGG
RF2 = TTGCGG
RF3 = TGCGG
RF1 has codons ATT, GCG, and that’s it. Thus, RF1 would translate to a polypeptide consisting of (in order): Ile+Ala.
RF2 has codons TTG, CGG. RF2 would give Leu+Arg.
RF3 has only one codon, TGC. And RF3 would be the amino acid Cys.
Question 2. Applying the Genetic Code (e.g., image above or see figure in your genetics text book), what are the resulting polypeptides from the three reading frames of the DNA sequence, ATCCGCGTGCAA?
Bioterrorism? Some things to think about
What about the Bioterrorism angle? Here’s four points you should ponder.
First, DNA is harmless by itself. It is information: to effect, DNA requires transcription, then translation of the resulting transcript(s) by cellular machinery (i.e., the Pathway Hypothesis: DNA → RNA → Product).
Second, the end product of translation is generally a nonfunctional protein, requiring modification by the cell to gain function. Nevertheless, it is the protein one needs to worry about. Your boss’s concerns seem to require an answer.
What would happen IF the DNA sequences you have just received DID get introduced into a cell (us), and the sequence DID get translated? This is the problem of Gene Ontology, the mapping of all things known about a gene (its product, the product’s function, and what biological role the product is part of). What proteins are out there that we might be concerned with? Well, anything that is a toxin, especially anything that harms at low doses. But I can imagine all kinds of proteins that could potentially do harm:
Question 3. What would be the possible harm if a protein that blocks p53 is introduced into us?
Third, your boss mentioned pathogen. What if the DNA sequence codes for a protein that modifies transmission of pathogens?
Question 4: What are the different kinds of pathogens and which kinds might be more attractive as a vector for Bioterrorism?
Fourth, If these sequences are the work of humans (designed), what exactly would you expect/hope to see in the sequence? What constitutes EVIDENCE? Begin with the following assumption: even if the gene is designed, it is unlikely that a completely novel protein, never before known to science, suddenly shows up AND it can interfere with human physiology or alter pathogen transmission. What seems more likely, is that someone figures out how to deliver a known protein in a new way or has modified an existing protein. These points raise the possibility that we may be able to rely on existing databases to help us recognize if the sequences code for a protein and whether it is likely to do harm.
Question 5: What are the names and websites for some appropriate database(s) that contain gene sequences? Protein sequences?
Practice first
Use of ORF Finder: Practice and Worked example
Use of Virtual Ribosome: Practice and worked example
References, suggested reading
Allen, J. E., Gardner, S. N., & Slezak, T. R. (2008). DNA signatures for detecting genetic engineering in bacteria. Genome biology, 9(3), 1-10. link to article
Andersen, K. G., Rambaut, A., Lipkin, W. I., Holmes, E. C., & Garry, R. F. (2020). The proximal origin of SARS-CoV-2. Nature medicine, 26(4), 450-452. link to article
Budowle, B., Schutzer, S. E., Einseln, A., Kelley, L. C., Walsh, A. C., Smith, J. A., … & Campos, J. (2003). Building microbial forensics as a response to bioterrorism. Science, 301(5641): 1852-1853.
Das, S., & Kataria, V. K. (2010). Bioterrorism: A public health perspective. Medical Journal Armed Forces India, 66(3), 255-260. link to article
Need more help about reading frames?
You can read more about reading frames at Wikipedia
You can read more about open reading frame at Wikipedia
Data from supervisor, six DNA sequences
Here are the sequences provided to you by your supervisor. Reading left to right, assume 5′ to 3′, sequences are in FASTA format
>Sequence 1: ATTATTACTAAATGGTGTGATCTTATTAATTCTACTACTTGTCTTACTTCTATGATTAAAGAATGGGCTTCTCATGAACGTGAAACTTCTTTTTTTCTTGATGGTAATTGTTGGAATTCTCAACTTTTTTCTTGGAATTTTTCTTCTTCTAATACTAATAAAAAAGTTTGTTATTTTCGTTTTGATTTTTCTAATTATGGTATTAATAAATATTCTTGT >Sequence 2: GATAGTAGTGGGTGGAATAGTGAAGAAAACGAAGCTAAAAGTGATGCGCCCCTAAGTACAGGAGGGGGTGCTTCTTCTGGAACATTTAATAAATACCTCAACACCAAGCAAGCGTTAGAGAGCATCGGCATCTTGTTTGATGGGGATGGAATGAGGAATGTGGTTACCCAACTCTATTATGCTTCTACCAGCAAGCTAGCAGTCACCAACAACCACATTGTCGTGATGGGTAACAGCTTT >Sequence 3: ATTATCACAAAATGGTGTGATCTTATCAATAGCACTACTTGCTTTTTTTAGTTGGATGGCAATTGTTGGAATTCACAGCTTTTTAGTTGGAATTTTAGTTAATCATCAAACACTTAAAATAAGTAAAAAGTATGTTATTTTAGGTTCGATTTTTCCAATTATGGCATTAACAAATACTCTTGTAATTAGAAAAAAATTAAAAGCTTTATTAGGAGAGGGTAAGGTTCAAAAAGGACTCAA >Sequence 4: AGTTGTTGGTGGTATTGTTCTTACTTCTTGTCATGCTATGATTAATGCTGATGAAGGTGAAAATGAAACTATTTGTTCTAAAAAAACTAAACTTAAAGTTATGCGTCCTGTTCAAGAAGGTGTTCTTCTTCTTGAACATCTTATTAATACTTCTACTCCTTCTAAACGTCGTGCTTCTGCTTCTTGTCTTATGGGTATGGAAGGTATGTGGCTTCCTAAT >Sequence 5: CTTTTATGTTTAATGACTTAAAAAAGCTCTTACTTCTTTAGAGCAAATATGATGAACGCTAGATTTCTTAATTTGCTACTTAGGAATAGCTTACTTCTTGGGTTGATGTTTGTTATGGTGCTGCTACTATGCTTCGTTCTCGTGCTGTTGCTCTTAAACAATCTACTTCTTGTTATGTTCTTTCTATTCTTCTTGTTAAAAAAGTTCTTAATTTTTATCTTTTTCTTGTTAATAAATCT >Sequence 6: GGAGGGAGATCATCAGATCAAAGTAATAAATTCACCAAGTACCTCAACACCAAGCAAGCATTGGAAAGGATCGGCATCTTGTTTGATGGGGATGGAATGAGGAATGTGGTTACCCAACTCTACCAACCCAACAAGGTGAAAAGTGGTCAATATCAACAAAATAACACCTACAACAGGTTAATTGAGCCTGACAATGCAACAAGTGCAGCGAGCAGCATGACCAGCTTGTTAAAGCTGTTG