Bioinformatics I: BLAST

As always, evidence is expected in your Lab Notebook

Objectives:

Demonstrate how to retrieve results from a BLAST search.
Describe elements of sequence as FASTA format.
Define key bioinformatic terms.
Demonstrate use of text only data sets.
Examine and interpret results from BLAST search.
Demonstrate and interpret changes to software settings.
Collect proper evidence from work, documented in Notebook.
Use Blast Entrez to retrieve sequences.

Overview:

This Worksheet has seven (7) self-study questions.

Prepare your responses to the questions in your Digital NoteBook. Follow all general instructions about how to provide lab notebook evidence. For example, show your work in your notebook, include intermediate steps (e.g., screenshots of settings, results).

Submit completed worksheet:

You will need to create a docx or pdf file with your work. This is easy to do in OneNote. Once your edits are complete,

select to print the notebook page(s) (Windows: Ctrl+P, macOS: ⌘+P)
rename the file appropriately (name course task)
submit your pdf file to this Canvas page.

BLAST work

Billions of protein and nucleic acid (DNA and RNA) sequences are stored in publicly accessible databases. For researchers with an unknown sequence, the first step is to query these databases to see if their unknown sequence “matches” known sequences.

The Basic Local Alignment Search Tool (BLAST) is a program used to find regions of local similarity between protein or nucleotide sequences. BLAST compares nucleotide or protein sequences to sequences stored in a database; BLAST then calculates statistical significance of the matches.

For background, here’s a link to BLAST glossary http://www.ncbi.nlm.nih.gov/books/NBK62051/ . Don’t forget Wikipedia!

Question 1. Provide a definition IN YOUR OWN WORDS for each of the following terms.

accession
refseq

Click here for Question 1 answer

Question 2. What is meant by the term “Local alignment”?

Click here for Question 2 answer

BLAST

To access BLAST go to http://blast.ncbi.nlm.nih.gov/ (or simply search “BLAST” on Google). Figure 1, a screenshot of BLAST Home page, Spring 2022. Not shown, Standalone and Specialized search BLAST icons.

Screenshot, upper portion of NCBI BLAST homepage (Spring 2022)

Figure 1. Screenshot, upper portion of NCBI BLAST homepage (Spring 2022)

BLAST front-page options

As you can see from the front page, BLAST now has many options. If the goal is to compare sequences from one of the available published genomes, then you would select BLAST Assembled RefSeq Genomes. Of particular note is the addition of many specialized BLAST tools, including Primer-BLAST which would be used to help you design or check PCR primers.

However, in general, you will be selecting from the Basic BLAST programs. For example, if working with DNA, then select blastn (nucleotide blast). If working with protein sequences, then use blastp (protein blast).

Regardless of the BLAST selected, the basics are similar enough. Let’s try a worked example, mystery sequence 1.

First, the BLAST program expects the sequence in a certain format (FASTA). Here’s one for you

>mystery sequence 1
TAATGACCCGCTGGTCCTGAGGAAGAGGTGCTGACGACCAAGGAGATCTTCCCACAGACCC
AGCACCAGGGAAATGGTCCGGAAATTGCAGCCTCAGCCCCCAGCCATCTGCCGACCCCCCC
ACCCCAGGCCCTAATGGGCCAGGCGGCAGGGGTTGAGAGGTAGGGGAGATGGGCTCTGAG
ACTATAAAGCCAGCGGGGGCCCAGCAGCCCTCAGCCCTCCAGGACAGGCTGCATCAGAAGA
GGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGATTCCAGGGTG
GCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGT
GGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCCTCAGCCCTGC
CTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGG
CGCTGCTGGCCCTCTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGG
CTCACACCTGGTG GAAGCTCTCTACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCC
AAGACCCGCCGGGAGGCAGAGGACCTGCAGGGTGAGCCAACTGCCCATTGCTGCCCCTGG
CCGCCCCCAGCCACCCCCTGCTCCTGGCGCTC

Note: I’ve confirmed the mystery sequence runs correctly in BLAST. However, if you copy and paste this sequence into “Enter Query Sequence” BLAST window, run BLAST, then get an error message like: “Message ID#33 Error: Query contains no data: Query contains no sequence data,” this is due either to a missing EOL (end of line separators, hard return) for the first line, or EOL or SPACES introduced at the end of the lines containing the sequence. BLAST expects a single comment line, with the next and all subsequent lines consisting of one uninterrupted sequence. If BLAST does not run having used COPY/PASTE within your browser, suspect one or both errors. Use a text editor to check for hidden “non-printable” characters (spaces, EOL, etc) to fix your FASTA file. Use of text editor to find and replace non-printable characters was presented in Text files are data files.

Question 3. Does this “mystery sequence” call for use of blastn or blastp?

Click here for Question 3 answer

<h4align=”left”>

You can see that FASTA format allows a comment line marked by “>” at the start, then beginning with the second line follows with the sequence data. You can copy and past this sequence into the BLAST window

Like most searches, start with the Default” options — in other words, paste the sequence into BLAST, Press the “BLAST” button to start the search (Fig. 2). Here’s what we get for this sequence

Screenshot of our mystery sequence basic search, humans selected as target organism.

Figure 2. Screenshot of our mystery sequence basic search, humans selected as target organism.

If you run into trouble (i.e., the BLAST returns no results, then you go back and modify parameters to the algorithm (i.e., click on the link +Algorithm parameters).

While BLAST is running, this screen updates in your browser window (Fig. 3).

Screenshot BLAST running status page

Figure 3. Screenshot BLAST running status page

When BLAST is done you get this page (Fig. 4, partial screenshot, the actual page continues down the screen). The default returns up to 100 possible matches. I’d like you to record the first and the last match.

Fig. 4, partial screenshot, the actual page continues down the screen

Figure 4. Partial screenshot BLAST results page, top portion; the actual page continues down the screen

Scroll down or click on Description tab (Fig. 5)

Screenshot portion of BLAST results page, description

Figure 5. Screenshot portion of BLAST results page, description

Click on Graphic Summary tab to view (Fig. 6).

Screenshot of graphical summary tab, results page BLAST

Figure 6. Screenshot of graphical summary tab, results page BLAST

Each of the screen panels provides information, but the DESCRIPTION page is the one we want — we want to know, is there a sequence that matches our mystery sequence?

In Figure 7, I enlarged the first line from Description tab shown in Figure 5. (Remember, the output here is just an example, it is not your example.)

Screenshot, portion of Description tab displayed in Figure 5.

Figure 7. Screenshot, portion of Description tab displayed in Figure 5.

We see that BLAST has returned its first, and therefore, most likely hit, human Insulin. Continuing off the the right of this line shown in Figure 5, the supporting evidence for the match is visible (Fig. 8).

Screenshot supporting evidence for the match, visible at right-hand portion of image displayed in Figure 7

Figure 8. Screenshot supporting evidence for the match, visible at right-hand portion of image displayed in Figure 7.

The E-value (Expected value) is equivalent (though not exactly so) to the p-value from a statistical test. Like statistical tests, we interpret low E-values that the chance of our sequence matching Insulin by chance alone is highly unlikely. In our example it reported an E-value of “0.0” which is misleading — as you know, an empirical probability cannot be exactly zero (or for that matter, exactly one — there’s always chance deviations). BLAST is reporting “0.0” because it is rounding the value — actually the E-value is really really small, but not zero.

To help us interpret our results, BLAST also provides the inverse of the E-value or “Identity” and we see that our mystery sequence matches by 100% over the range of sequence searched by BLAST.

Question 4 What is the percent identity for our mystery sequence with Pan troglodytes gene for insulin precursor?

Click here for Question 4 answer

Question 4a) What is a Pan troglodytes?

Click here for Question 4 answer

Question 5. Return to the BLAST search page (back arrow in browser will do) then select +Algorithm parameters. Locate the “Word size” value (report it), then change it from the default to

a) a much larger value

b) a much smaller value

then for (a) and (b) re-run the BLAST search for mystery sequence 1 and report on the first match, if any, and the last match, if any– do you get the same answer as before?

In question 5 I had you explore effects of changing the “word size” on the search. The “word size” refers to how many letters the algorithm starts the search. For example, if word size was “3”, then BLAST would search a sequence of TTGGCCGAGAGACCT in blocks of threes TTG GCC GAG AGA CCT. It would find matches in the database for each of these three letter combinations (a lot of sequences!), then next iteration start to search against larger word combinations.

Question 6. What effect does word size have on search results?

What else can you do from the BLAST results screen?

Our goal was to check for a match between our mystery sequence and a known sequence in a database. We received matches with 100% identity and conclude that our mystery sequence was human insulin. So, we’re done. However, a number of things could follow. A useful visual display can be had by first selecting some of the sequences (click in the little box next to each line you are interested in), then choosing “Distance tree of results”. In addition to our human insulin sequence I selected several other primate sequences to generate the following distance tree (Fig. 9).

Screenshot of distance tree made from BLAST aligned sequences.

Figure 9. Screenshot of distance tree made from BLAST aligned sequences.

Now, distance has a precise definition, but for now, think of it like this. If two things are identical, then the distance between them is small (zero if truly identical). If two things are different, then the distance between them is large, proportional to how different they are. In the image our “mystery sequence” is noted by yellow tag and it shares a branch with the human insulin gene. The branch length is short, so this is in keeping with our 100% sequence match.

Question 7. Try this mystery sequence yourself.

>mystery sequence 2
MEGAGGANDKKKISSERRKEKSRDAARSRRSKESEVFYELAHQLPLPHNVSSHLDKASVMRLTISYLRVRKLLDAV

(a) which blast search algorithm should you use blastn or blastp?

(b) what is the accession number of the top match?

(d) generate a distance tree for the top ten matches.

(e) based on your experience/results from Question 5, select a new word size and re-run the BLAST search on mystery sequence 2 and repeat question 7b-d.

Use Blast Entrez to retrieve sequences

Once you’ve built up a list of accession numbers, you can retrieve the sequences via Batch Entrez. See Download sequences with BATCH Entrez for help.

/MD