Intro to Illumina Sequence Analysis

Julin Maloof

Illumina or Short-Read Sequencing

  • Allows the rapid and inexpensive sequencing of billions of base pairs of DNA or RNA in a single reaction.
  • Revolutionized many aspects of biology over the last decade.
  • Analyzing Illumina data is a critical skill for any bioinformaticist.
  • We will spend the next six labs working with an Illumina data set.

Three Videos for more info on Illumina sequencing

The Data Set

  • Illumina technology can be used to sequence RNA or DNA.
  • In this experiment we purified mRNA from:
    • 2 varieties of Brassica rapa
    • Multiple growth conditions:
      • Growth chamber: simulated sun and shade
      • Greenhouse: crowded and uncrowded plantings
      • Field
    • Multiple tissues: (see lab manual).
  • What can we learn from sequencing RNA?

What can we learn from sequencing RNA?

  • Transcript abundance (gene expression levels)
  • Intron/exon junctions (gene structure)
  • Transcript start and stop sites (gene structure)
  • Genetic variants (SNP and in/del discovery and genotyping)

Goals

For this series of labs:

  1. Learn about Illumina reads, how to map them, and quality control (Tuesday)
  2. How to view reads in a genome browser and how to find single nucleotide polymorphisms (Thursday)
  3. Find genes that are differentially expressed between genotypes or treatments (Next week)
  4. Ask if differentially expressed genes have any common functionality (gene ontologies) or promoter motifs
  5. Build a gene regulatory network to determine how genes connect to one another.

Illumina Data

Each flow cell of a Illumina machine produces ~ 350 million reads of DNA, each of which is 50 - 150 bp long.

This data is returned to the user in a FASTQ file

FASTQ

FASTQ files have 4 lines of information for each read

@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
  1. @SEQID
    1: Machine name
    2: Flow cell lane
    3: Tile
    4: X-position
    5: Y-position
    6: #index number
    7: read pair

    (Details vary for different software versions; see wiki )

FASTQ

FASTQ files have 4 lines of information for each read

@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
  1. @SEQID
  2. Sequence
  3. Starts with “+” and then usually blank
  4. Quality information

PHRED QUALITY

  • Each base has a quality score representing how confident the machine is that the base is correct.

  • These are called PHRED scores are range from 0 to ~ 40, where

\[ PHRED = -10 * log_{10}(p) \] and \( p \) is the probability that the reported base is wrong.

  • QUESTION: If there is a 1 in 100 chance that the base is wrong, what is the PHRED score? (Try this in R)

PHRED Qualities, part 2

But how can the following encode PHRED qualities?

!''*((((***+))%%%++)(%%%%).1***-+*''))**

In computerese each character is represented internally as a number. This is called the ASCII code.

For example ! has an ASCII code of 33, * has an ASCII code of 42, etc.

Thus, these characters represent numbers, and numbers can represent quality.
Why use characters instead of numbers?

PHRED Qualities, part 3

To add an additional wrinkle, the ASCII codes must be converted to the actual PHRED scores.

Why? ASCII characters 0 - 32 are invisible so they can't be used.

Additionally, different starting points have been used:

phred

Barcodes and sample indexing

  • For RNAseq one typically needs 10 - 20 Million reads per sample.
  • However the sequencer gives 350 Million reads per flow cell.
  • “Barcodes” or “Indexes” are used to uniquely associate reads with samples.

Summary: Barcodes and sample indexing

  • Allow multiple samples to be sequenced in a single lane.
  • Tag each DNA fragment with a sequence that is unique for each sample
  • “Indexes”
    • Tag or index is internal in the adapter and is sequenced in a separate reacion
    • Reads are automatically separated for the different samples
  • “Barcodes”
    • Tag or barcode is at the end of the adapter
    • The barcode is sequenced in the same reaction used to sequence the insert DNA
    • The reads must be sorted and barcodes must be trimmed by the end user.

What to do with your sequences

  • If the sequences come from an organism with an already sequenced genome, then you will want to map them to the reference sequence so that you know where they came from.
    • Look for polymorphisms and structural changes
    • If RNA, examine expression levels differences
  • There are many mapping programs. Some popular ones:
    • BWA. Non-splicing. Use for mapping genomic reads to a genomic reference or mRNA reads to a cDNA reference
    • Tophat / Bowtie. Splicing. Use for mapping mRNA reads to a genomic reference.
    • STAR. Splicing. Use for mapping mRNA reads to a genomic reference.
    • kallisto. Non-splicing. Use for mapping mRNA reads to a cDNA reference.

What to do with your sequences

  • If the sequences come from an organism without a reference, then you will need to perform a de novo assembly. (not covered in this class)

Workflow for tomorrow's lab

  1. Check sequence quality with fastqc
  2. Filter reads based on quality with Trimmomatic
  3. Split into samples based on barcodes with auto_barcode
  4. Map reads to find where the came from in the genome

File types

  • .fastq – file of short read data
  • .fa – fasta files for reference genome
  • .sam – sequence alignment/map file for mapped reads
  • .bam – the binary version of a sam file
  • .bai – index for bam files
  • .gff – genome annotation: information about where the genes are in the genome