Viewing Illumina reads in IGV; finding SNPs

18 May 2023

Goals

In the last lab period we learned about Illumina reads and fastq files. We performed quality trimming and then mapped those reads to the B. rapa reference genome.

Today we will pick up where we left off. Our goals are to:

Learn about sequence alignment/mapping (SAM/BAM) files. More info on SAM
Examine mapped reads in IGV–the integrative genomics viewer. This will allow us to see how reads are placed on the genome during the mapping process.
Find polymorphic positions in our sequencing data.

Preliminaries

OK to cut and paste today’s code

Data files

For better viewing and SNP calling I compiled all of the IMB211 internode files and all of the R500 internode files and ran tophat on those. Then to keep the download to a somewhat reasonable size I subset the bam file to chromosome A03. cd into your Assignment_8/output directory and download the files as listed below:

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/tophat_out-IMB211_A03_INTERNODE.tar.gz
tar -xvzf tophat_out-IMB211_A03_INTERNODE.tar.gz

wget https://bis180ldata.s3.amazonaws.com/downloads/Illumina_Assignment/tophat_out-R500_A03_INTERNODE.tar.gz
tar -xvzf tophat_out-R500_A03_INTERNODE.tar.gz

Examine tophat output

cd into the tophat_out-IMB211_A03_INTERNODE/ output directory that you downloaded above

You will see several files there. Some of these are listed below

accepted_hits.bam – A bam file for reads that were successfully mapped (this is called accepted_hits_A03.bam in the tophat output that I created for you since I only retained chromosome A03)
unmapped.bam A bam file for reads that were not able to be mapped missing from the downloaded directory because I deleted it to save space
deletions.bed and insertions.bed bed files giving insertions and deletions
junctions.bed A bed file giving introns
align_summary.txt Summarizes the mapping

Exercise 7: Take a look at the align_summary.txt file.
You only need to answer for the IMB211 alignment
a. What percentage of reads mapped to the reference?
b. Think about the process of read mapping and that the reference genome is from a different cultivar (B3) than the cultivar we sequenced (IMB211). Give 2 reasons why reads might not map to the reference.

Look at a bam file

Bam files contain the information about where each read maps. They are in a binary, compressed format so we can not use less on its own. Thankfully there is a collection of programs called samtools that allow us to view and manipulate these files.

Let’s take a look at accepted_hits_A03.bam. For this we use the samtools view command

samtools view -h accepted_hits_A03.bam | less

The file is structured as follows

Field	Value
01	Read Name (just like in the fastq)
02	Code providing info about the alignment (bitwise FLAG)
03	Template Name (Chromosome in this case)
04	Position on the template where the read starts
05	Phred based mapping quality for the read
06	CIGAR string providing information about the mapping
07 - 09	These give information about the “mate pair” if paired end sequencing was done
10	Sequence
11	Phred+33 quality of sequence
Additional fields	Varies; see SAM page for more info

samtools has many additional functions. These include

samtools merge – merge BAM files
samtools sort – sort BAM files; required by many downstream programs
samtools index – create an index for quick access; required by many downstream programs
samtools idxstats – summarize reads mapping to each reference sequence
samtools mpileup – count the number of matches and mismatches at each position
samtools addreplacerg – add/replace read group identifiers in a BAM file

And more

Look at a bam file with IGV

While samtools view is nice, it would be nicer to actually see our reads in context. We can do this with IGV, the Integrative Genome Viewer

To use IGV, first create an index of the bam file

samtools index accepted_hits_A03.bam

Do the above step for both the IMB211 and the R500 files

Then to start IGV, type igv at the Linux command line.

Prepare the genome reference files for IGV to use

By default IGV starts with the human genome. It has a number of built-in genomes, but does not include B. rapa. We must upload it ourselves.

In IGV, click on Genomes > Load Genome from File
(it may take a few seconds for the file select window to open)

Select BrapaV1.5_chrom_only.fa located in input/Brapa_reference of your Assignment 8 repository

In IGV, click on File > Load from File

Select Brapa_gene_v1.5.gff located in input/Brapa_reference of your Assignment 8 repository

Load some tracks

Now to load our mapped reads:

Click on File > Load From File ; then select the accepted_hits_A03.bam file from both IMB211 and R500

I recommend right-clicking on each track after it is loaded and renaming it for the cultivar it came from (IMB211 or R500)

Take a look

Click on the “ALL” pull-down menu and select chromosome A03. Then zoom in until you can see the reads. If are having trouble with your cursor and the pulldown menu, try resizing your VNC window while IGV is open

Grey vertical bars are a histogram of coverage
Grey horizontal bars represent reads.
- Thin lines connect read segments that were split across introns.
- Colored vertical lines show places where a read differs from the reference sequence
In the lower panel, the blue blocks and lines show the reference annotation. Blocks are exons and lines are introns. The arrowheads show the direction of transcription.
The orange lines show junctions inferred by Tophat from our reads

Exercise 8:
Can you distinguish likely SNPs from sequencing/alignment errors? How?

Exercise 9: View each gene listed below in IGV. (You can type the gene ID into the IGV search bar). For each gene, does the computationaly predicted annotation (blue bars) appear to be correct or incorrect given our RNA-seq data? If incorrect, describe what is wrong.
a Bra001706
b Bra001700
c Bra001702
d Bra012917

Exercise 10: Comment on the trust-worthiness of computational annotation.

Calling SNPs

The goal of this section is to find polymorphisms between IMB211 and R500. There are many tools available. We will use FreeBayes
(For more info click on the link above. Once you are on the FreeBayes page, scroll down to the README for more info on FreeBayes)

Make a new directory for this analysis inside the Assignment_8/output directory

mkdir SNP_analysis
cd SNP_analysis

In the library preparation step there is a PCR amplification. This can cause duplication of DNA fragments. We want to remove these duplicate reads because they can skew our SNP analysis (essentially they represent pseudo-replication, giving us artificially high sample numbers). We will use samtools rmdup to remove the duplicate reads

samtools rmdup -s ../tophat_out-IMB211_A03_INTERNODE/accepted_hits_A03.bam IMB211_rmdup.bam

samtools rmdup -s ../tophat_out-R500_A03_INTERNODE/accepted_hits_A03.bam  R500_rmdup.bam

Next create an index for each of the new files

samtools index IMB211_rmdup.bam
samtools index R500_rmdup.bam

Now we use freebayes to look for SNPs. freebayes calculates the number of reference and alternate alleles at each position in the genome and genotype likelihoods.

We first call samtools addreplacerg to add Read Groups to our bam files, enabling us to call SNPs separately for the two genotypes. The output from samtools addreplacerg is a bam file with the Read Groups, we will then merge the two bam files, index the merged file and use it as input for freebayes.

samtools addreplacerg -o rg_IMB211_rmdup.bam -r "ID:IMB211" -r "SM:IMB211" IMB211_rmdup.bam

samtools addreplacerg -o rg_R500_rmdup.bam -r "ID:R500" -r "SM:R500" R500_rmdup.bam

samtools merge combined_IMB211_R500.bam rg_IMB211_rmdup.bam rg_R500_rmdup.bam

samtools index combined_IMB211_R500.bam

freebayes --fasta-reference ../../input/Brapa_reference/BrapaV1.5_chrom_only.fa combined_IMB211_R500.bam  > IMB211_R500.vcf

freebayes will take ~ 3 minutes to run. If you would prefer not to wait, you can go ahead to the second part of today’s lab where you can download my copy of the VCF file.

Two other popular SNP callers:

samtools mpileup.
GATK offers an alternative and popular genotyping pipeline and caller

BIS180L ARCHIVE

Viewing Illumina reads in IGV; finding SNPs

Goals

Preliminaries

Data files

Examine tophat output

Look at a bam file

Look at a bam file with IGV

Prepare the genome reference files for IGV to use

Load some tracks

Take a look

Calling SNPs

Two other popular SNP callers:

Related Pages

Working with Illumina Sequence data 16 May 2023

Illumina continued: analyze a VCF file in R 18 May 2023

Differential Gene Expression from Illumina RNAseq 23 May 2023

Differential Gene Expression: Now What? 25 May 2023

Metagenomics 08 Jun 2023