BLAST Part I
11 Apr 2023Background reading
There is good background information on sequence analysis at Additional Resources
Scenario
Let’s turn the clock back to December, 2019. There is a strange new infectious disease causing flu-like symptoms and respiratory distress in humans and it is spreading rapidly. Based on the disease etiology it seems most likely to be caused by a virus. Your collaborators have isolated viral particles from an infected patient, purified the viral genomes and sequenced them. They are asking for your help in analyzing the sequences.
Of the sequenced viral genomes from the patient there are two immediate questions:
- Which of the viral sequences from the patient is most likely to be the one responsible for the disease?
- What is the origin of this virus?
We will focus on those questions over the next few days, explore how BLAST parameters affect search outcomes, and practice our Linux skills along the way.
Part 1: Getting Organized
cd
into the assignment_2...
directory.
I like to organize my projects so that they contain at least these three directories:
input
this will contain any input dataoutput
this will contain any results coming from analysisscripts
scripts or markdown files used for the analysis
These should have been created when you cloned your repository.
Please find a template for answering the exercises in the scripts
folder. Open Assignment_2_BLAST_template.md
in the editor of your choice (RStudio or nano, for example)
There are two sequence files that you will be working with in this lab, you will need to download them both (see instructions below).
ncbi_virus_110119_2.txt.gz
This contains all complete genomic viral sequences available at NCBI on November 01, 2019. I will refer to this file as the “refseq” file below.- The second file contains the sequences obtained from the patient. I will refer to this as the “patientseq” file below. Download them into your
Assignment_2
directory with the following command:
Now let’s download the files.
What directory should you place these in? Please decide and then cd
there
Download the files with wget
as follow. wget
gets things from the Web…
wget https://bis180ldata.s3.amazonaws.com/downloads/ncbi_virus_110119_2.txt.gz
wget https://bis180ldata.s3.amazonaws.com/downloads/patient_viral.txt.gz
Let’s begin by practicing our Linux skills by summarizing the file content. Start by taking a look at each file using zless
or zmore
.
zless
and zmore
are shortcuts for typing gunzip -c 'your_file.gz' | less
and gunzip -c 'your_file.gz' | more
Start of Exercises
When in doubt always provide the code you used to obtain your answers. This will help in both remembering what you did and troubleshooting problems
Exercise 1: What is the sequence format for these sequences? Do the files contain RNA or DNA sequences? How do you know?
Exercise 2: For the refseq file, the header line contains a number of fields, each separated by a “|”.
Exercise 2a: Use zgrep
and head
to display the first set of header lines (without printing the sequences themselves). What command did you use?
Exercise 2b: For the first entry in the refseq file (starts with >MH781015
), try to figure out what each item in the header is. List each field and what you think it is. You should have fields for
- sequence completeness
- viral species
- country where the virus was isolated
- sequence title
- accession number (a unique identifier for the sequence).
- mystery (figure out what the unnamed field is)
Exercise 3: How many of the viruses come from a domesticated cat (Felis catus) host? How many come from a human host? How many were isolated in the United States? Show the commands used to answer these questions.
Hint 1: Look at the man
entry for grep
to figure out how to count things
Hint 2: You can do the whole thing with grep, or you might want to use cut
. cut
will give you “fields” from a text file. For example: (try this in the terminal to see what happens)
# create a simple example to illustrate cut
## first create a text file
echo "julin,maloof,jnmaloof@ucdavis.edu" > email.txt
echo "john,davis,jtdavis@ucdavis.edu" >> email.txt
## look at the file
cat email.txt
## extract the third field
cut -f 3 -d "," email.txt
Stop and Think: what do the -f
and -d
options do? You can look at man cut
if you aren’t sure
Hint 3: Be aware when searching for viruses isolated in the United States, the United States can appear in an entry even when it was not isolated there. Make sure you are using the “country where virus isolated” field when obtaining your count
Exercise 4: Create a simple (markdown formatted) table with the following information for the refseq and patientseq files:
- Size of the file (in kilobytes, megabytes, or gigabytes. Indicate your units).
- Number of sequences in the file.
- Total number of basepairs or amino acids in the file (see hint 2 below!)
- Average sequence length.
Hint 1: Review the “Just enough Unix” tutorial to determine the correct commands needed to answer these questions. You may need to use the man
pages to learn about options to give the commands.
Hint 2: The total number of basepairs will be the number of nucleotide letters. This includes letters other than the usual “ACGT”. If you are curious in the meaning of these new letters you can look up their definitions here. Finding the size of the genome is a little complicated because it’s not the size of the file. The files also have newline characters and FASTA headers. You need to subtract these. Fortunately, you can do all of this with the Unix skills you already have. Use the man
pages!
Part 3: Build BLAST database
Now that we have explored the files some, let’s get BLASTING!
In order for BLAST to search efficiently if needs to build a database of words for each sequence in the reference that we want to search. This only needs to be done once.
You only need to buld a database for ncbi_virus_110119_2.txt
The command is makeblastdb
. Take a look at the help file to see how to run it:
makeblastdb -help
Can you figure out how to specify the input file and the sequence type? If you get stuck, use your cursor to select the hidden text underneath this sentence for a hint.
makeblastdb -in ncbi_virus_110119_2.txt -dbtype nuclGetting an error? Maybe the file needs to be gunzipped
?
Part 4: Run BLAST, explore output parameters, find disease agent
First, cd
to the output directory in your project folder.
cd ../output
At the command line you can run blast as follows:
time blastn -db ../input/ncbi_virus_110119_2.txt -query ../input/patient_viral.txt -evalue 1e-3 > blastout.default
Where the evalue
option limits the returned results to those with an evalue < 1e-3
Note: time
is optional, it just tells bash to report the amount of time that a command takes.
Note: the command above is written assuming that you are in the output
directory when you issue the command and that your sequence files and blast database are in input
. You may need to modify the paths depending on how you organized things.
Exercise 5: How long did it take BLAST to run?
Exercise 6: Take a look at the beginning of the output file. What type of virus did “Seq_A” come from?
Change format of the BLAST output
If you scroll through the blast output you will see that for each query sequence there are a list of hits followed by alignments for the top hits. This might be useful if you have a single query but is not so great for summarizing the results of multiple query sequences as we have here.
One thing that would be helpful would be to have each “hit” summarized on a single line. We can do this by specifying -outfmt 7
. We’ll modify that table by adding subject title (stitle
) to the standard (std
) fields. (Look at blastn help to see all options).
time blastn -db ../input/ncbi_virus_110119_2.txt -query ../input/patient_viral.txt -evalue 1e-3 -outfmt '7 std stitle' > blastout.default.tsv
Limit the number of results returned
Later in the quarter we will learn how to easily load and analyze these sequences in R. But in the meantime it would be helpful to have only the single or a few hits reported for each query.
We can do this with -max_target_seqs
. -max_target_seqs
limits the number of subject sequences returned. Due to some quirks with the BLAST algorithm these may not be the absolute best hits but they should be pretty similar. For more info, click here.
Exercise 7: Run a new BLAST but limit the results to 2 subjects per query sequence. (Show your code). Which patient sequence do you think comes from the virus that causes the respiratory disease? If you are unsure, do some literature/web searches on the different viruses returned.
Note: You may get more than 2 hits per a query sequence, if you look closely at columns 2, 7, and 8 of the output you may figure out why.
Exercise 8: Consider the host species listed for the hits. Remembering that our samples came from a human patient, which hosts are most evolutionarily distant? Could the viruses generating these hits still have come from the patient sample or does this represent contamination from another source? (Again some web or literature searches will be helpful).
OK that is it for part 1 of BLAST. In the next lab session we will continue with this data set and learn how to use for loops to run multiple BLAST searches.