Julin Maloof
UCSF Professor Eric Chow https://www.youtube.com/watch?v=mI0Fo9kaWqo (32 minutes)
SDSU Professor Rob Edwards https://www.youtube.com/watch?v=WneZp3fSJIk&t=13s (9 minutes)
Slick Illumina Video https://www.youtube.com/watch?v=fCd6B5HRaZ8 (5 minutes)
For this series of labs:
Each flow cell of a Illumina machine produces ~ 350 million reads of DNA, each of which is 50 - 150 bp long.
This data is returned to the user in a FASTQ file
FASTQ files have 4 lines of information for each read
@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
@SEQID
1: Machine name
2: Flow cell lane
3: Tile
4: X-position
5: Y-position
6: #index number
7: read pair
(Details vary for different software versions; see wiki )
FASTQ files have 4 lines of information for each read
@HWUSI-EAS100R:6:73:941:1973#0/1
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCC
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**
Each base has a quality score representing how confident the machine is that the base is correct.
These are called PHRED scores are range from 0 to ~ 40, where
\[ PHRED = -10 * log_{10}(p) \] and \( p \) is the probability that the reported base is wrong.
But how can the following encode PHRED qualities?
!''*((((***+))%%%++)(%%%%).1***-+*''))**
In computerese each character is represented internally as a number. This is called the ASCII code.
For example !
has an ASCII code of 33, *
has an ASCII code of 42, etc.
Thus, these characters represent numbers, and numbers can represent quality.
Why use characters instead of numbers?
To add an additional wrinkle, the ASCII codes must be converted to the actual PHRED scores.
Why? ASCII characters 0 - 32 are invisible so they can't be used.
Additionally, different starting points have been used:
fastqc
Trimmomatic
auto_barcode