Viewing Illumina reads in IGV; finding SNPs

Goals

In the last lab period we learned about Illumina reads and fastq files. We performed quality trimming and then mapped those reads to the B. rapa reference genome.

Today we will pick up where we left off. Our goals are to:

  1. Learn about sequence alignment/mapping (SAM/BAM) files. More info on SAM
  2. Examine mapped reads in IGV–the integrative genomics viewer. This will allow us to see how reads are placed on the genome during the mapping process.
  3. Find polymorphic positions in our sequencing data.

Preliminaries

OK to cut and paste today’s code

Data files

For better viewing and SNP calling I compiled all of the IMB211 internode files and all of the R500 internode files and ran tophat on those. Then to keep the download to a somewhat reasonable size I subset the bam file to chromosome A03. cd into your Assignment_8/output directory and download the files as listed below:

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/tophat_out-IMB211_A03_INTERNODE.tar.gz
tar -xvzf tophat_out-IMB211_A03_INTERNODE.tar.gz

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/tophat_out-R500_A03_INTERNODE.tar.gz
tar -xvzf tophat_out-R500_A03_INTERNODE.tar.gz

Examine tophat output

cd into the tophat_out-IMB211_A03_INTERNODE/ output directory that you downloaded above

You will see several files there. Some of these are listed below

  • accepted_hits.bam – A bam file for reads that were successfully mapped (this is called accepted_hits_A03.bam in the tophat output that I created for you since I only retained chromosome A03)
  • unmapped.bam A bam file for reads that were not able to be mapped missing from the downloaded directory because I deleted it to save space
  • deletions.bed and insertions.bed bed files giving insertions and deletions
  • junctions.bed A bed file giving introns
  • align_summary.txt Summarizes the mapping

Exercise 7: Take a look at the align_summary.txt file.
You only need to answer for the IMB211 alignment
a. What percentage of reads mapped to the reference?
b. Think about the process of read mapping and that the reference genome is from a different cultivar (B3) than the cultivar we sequenced (IMB211). Give 2 reasons why reads might not map to the reference.

Look at a bam file

Bam files contain the information about where each read maps. They are in a binary, compressed format so we can not use less on its own. Thankfully there is a collection of programs called samtools that allow us to view and manipulate these files.

Let’s take a look at accepted_hits_A03.bam. For this we use the samtools view command

samtools view -h accepted_hits_A03.bam | less

The file is structured as follows

Field Value
01 Read Name (just like in the fastq)
02 Code providing info about the alignment (bitwise FLAG)
03 Template Name (Chromosome in this case)
04 Position on the template where the read starts
05 Phred based mapping quality for the read
06 CIGAR string providing information about the mapping
07 - 09 These give information about the “mate pair” if paired end sequencing was done
10 Sequence
11 Phred+33 quality of sequence
Additional fields Varies; see SAM page for more info

samtools has many additional functions. These include

samtools merge – merge BAM files
samtools sort – sort BAM files; required by many downstream programs
samtools index – create an index for quick access; required by many downstream programs
samtools idxstats – summarize reads mapping to each reference sequence
samtools mpileup – count the number of matches and mismatches at each position
samtools addreplacerg – add/replace read group identifiers in a BAM file

And more

Look at a bam file with IGV

While samtools view is nice, it would be nicer to actually see our reads in context. We can do this with IGV, the Integrative Genome Viewer

To use IGV, first create an index of the bam file

samtools index accepted_hits_A03.bam

Do the above step for both the IMB211 and the R500 files

Then to start IGV, type igv at the Linux command line.

Prepare the genome reference files for IGV to use

By default IGV starts with the human genome. It has a number of built-in genomes, but does not include B. rapa. We must upload it ourselves.

In IGV, click on Genomes > Load Genome from File
(it may take a few seconds for the file select window to open)

Select BrapaV1.5_chrom_only.fa located in input/Brapa_reference of your Assignment 8 repository

In IGV, click on File > Load from File

Select Brapa_gene_v1.5.gff located in input/Brapa_reference of your Assignment 8 repository

Load some tracks

Now to load our mapped reads:

Click on File > Load From File ; then select the accepted_hits_A03.bam file from both IMB211 and R500

I recommend right-clicking on each track after it is loaded and renaming it for the cultivar it came from (IMB211 or R500)

Take a look

Click on the “ALL” pull-down menu and select chromosome A03. Then zoom in until you can see the reads. If are having trouble with your cursor and the pulldown menu, try resizing your VNC window while IGV is open

  • Grey vertical bars are a histogram of coverage
  • Grey horizontal bars represent reads.
    • Thin lines connect read segments that were split across introns.
    • Colored vertical lines show places where a read differs from the reference sequence
  • In the lower panel, the blue blocks and lines show the reference annotation. Blocks are exons and lines are introns. The arrowheads show the direction of transcription.
  • The orange lines show junctions inferred by Tophat from our reads

Exercise 8:
Can you distinguish likely SNPs from sequencing/alignment errors? How?

Exercise 9: View each gene listed below in IGV. (You can type the gene ID into the IGV search bar). For each gene, does the computationaly predicted annotation (blue bars) appear to be correct or incorrect given our RNA-seq data? If incorrect, describe what is wrong.
a Bra001706
b Bra001700
c Bra001702
d Bra012917

Exercise 10: Comment on the trust-worthiness of computational annotation.

Calling SNPs

The goal of this section is to find polymorphisms between IMB211 and R500. There are many tools available. We will use FreeBayes
(For more info click on the link above. Once you are on the FreeBayes page, scroll down to the README for more info on FreeBayes)

Make a new directory for this analysis inside the Assignment_8/output directory

mkdir SNP_analysis
cd SNP_analysis

In the library preparation step there is a PCR amplification. This can cause duplication of DNA fragments. We want to remove these duplicate reads because they can skew our SNP analysis (essentially they represent pseudo-replication, giving us artificially high sample numbers). We will use samtools rmdup to remove the duplicate reads

samtools rmdup -s ../tophat_out-IMB211_A03_INTERNODE/accepted_hits_A03.bam IMB211_rmdup.bam

samtools rmdup -s ../tophat_out-R500_A03_INTERNODE/accepted_hits_A03.bam  R500_rmdup.bam

Next create an index for each of the new files

samtools index IMB211_rmdup.bam
samtools index R500_rmdup.bam

Now we use freebayes to look for SNPs. freebayes calculates the number of reference and alternate alleles at each position in the genome and genotype likelihoods.

We first call samtools addreplacerg to add Read Groups to our bam files, enabling us to call SNPs separately for the two genotypes. The output from samtools addreplacerg is a bam file with the Read Groups, we will then merge the two bam files, index the merged file and use it as input for freebayes.

samtools addreplacerg -o rg_IMB211_rmdup.bam -r "ID:IMB211" -r "SM:IMB211" IMB211_rmdup.bam

samtools addreplacerg -o rg_R500_rmdup.bam -r "ID:R500" -r "SM:R500" R500_rmdup.bam

samtools merge combined_IMB211_R500.bam rg_IMB211_rmdup.bam rg_R500_rmdup.bam

samtools index combined_IMB211_R500.bam

freebayes --fasta-reference ../../input/Brapa_reference/BrapaV1.5_chrom_only.fa combined_IMB211_R500.bam  > IMB211_R500.vcf 

freebayes will take ~ 3 minutes to run. If you would prefer not to wait, you can go ahead to the second part of today’s lab where you can download my copy of the VCF file.