Intro to Illumina Sequence Analysis

Julin Maloof

Allows the rapid and inexpensive sequencing of billions of base pairs of DNA or RNA in a single reaction.
Revolutionized many aspects of biology over the last decade.
Analyzing Illumina data is a critical skill for any bioinformaticist.
We will spend the next six labs working with an Illumina data set.

For this series of labs:

Learn about Illumina reads, how to map them, and quality control (Tuesday)
How to view reads in a genome browser and how to find single nucleotide polymorphisms (Thursday)
Find genes that are differentially expressed between genotypes or treatments (Next week)
Ask if differentially expressed genes have any common functionality (gene ontologies) or promoter motifs
Build a gene regulatory network to determine how genes connect to one another.

Each flow cell of a Illumina machine produces ~ 350 million reads of DNA, each of which is 50 - 150 bp long.

This data is returned to the user in a FASTQ file

FASTQ files have 4 lines of information for each read

@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**

FASTQ files have 4 lines of information for each read

@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**

Each base has a quality score representing how confident the machine is that the base is correct.
These are called PHRED scores are range from 0 to ~ 40, where

\[ PHRED = -10 * log_{10}(p) \] and \( p \) is the probability that the reported base is wrong.

QUESTION: If there is a 1 in 100 chance that the base is wrong, what is the PHRED score? (Try this in R)

But how can the following encode PHRED qualities?

!''*((((***+))%%%++)(%%%%).1***-+*''))**

In computerese each character is represented internally as a number. This is called the ASCII code.

For example ! has an ASCII code of 33, * has an ASCII code of 42, etc.

Thus, these characters represent numbers, and numbers can represent quality.
Why use characters instead of numbers?

To add an additional wrinkle, the ASCII codes must be converted to the actual PHRED scores.

Why? ASCII characters 0 - 32 are invisible so they can't be used.

Additionally, different starting points have been used:

phred

Allow multiple samples to be sequenced in a single lane.
Tag each DNA fragment with a sequence that is unique for each sample
“Indexes”
- Tag or index is internal in the adapter and is sequenced in a separate reacion
- Reads are automatically separated for the different samples
“Barcodes”
- Tag or barcode is at the end of the adapter
- The barcode is sequenced in the same reaction used to sequence the insert DNA
- The reads must be sorted and barcodes must be trimmed by the end user.

If the sequences come from an organism with an already sequenced genome, then you will want to map them to the reference sequence so that you know where they came from.
- Look for polymorphisms and structural changes
- If RNA, examine expression levels differences
There are many mapping programs. Some popular ones:
- BWA. Non-splicing. Use for mapping genomic reads to a genomic reference or mRNA reads to a cDNA reference
- Tophat / Bowtie. Splicing. Use for mapping mRNA reads to a genomic reference.
- STAR. Splicing. Use for mapping mRNA reads to a genomic reference.
- kallisto. Non-splicing. Use for mapping mRNA reads to a cDNA reference.

If the sequences come from an organism without a reference, then you will need to perform a de novo assembly. (not covered in this class)