QQPlots in GWAS

Updated April 26, 2022

QQ Plots and GWAS

Purpose: Determine if there are a likely a large number of false positive in the GWAS
Method: Compare the p-values from the GWAS to those expected from doing the same number of SNP tests if there were no true associations.
Logic: Only a very, very small number of SNPs should be associated with our trait.
Therefore, the distribution of p-values obtained from the true GWAS should mostly match those from random data.

What is QQ anyway?

Q-Q stands for quantile-quantile
OK so what is a quantile?
You are probably familiar with percentiles.
Quantiles are similar but are cut points that divide a set of observations (or a distribution) into evenly sized groups.
For example, quartiles divide a set of observations into 4 groups, split at the 25th, 50th, and 75th percentile.
An example may help:

Quantiles example

somedata <- rnorm(n=20, mean=10, sd=3) %>% sort()
somedata

##  [1]  5.384034  5.446049  6.233185  7.132912  7.564635  7.855319  8.083410
##  [8]  8.874314  8.969542  9.111669  9.173461 10.308713 10.498773 10.874464
## [15] 11.463326 11.605816 12.002330 12.171091 12.489762 15.247534

quantile(somedata, c(.25,.5,.75))

##       25%       50%       75% 
##  7.782648  9.142565 11.498949

the 25% quantile is between the 5th and 6th data point
the 50% is between the 10th and 11th
and the 75% is between 15th and 16th

What are expected p-values?

Expected p-values are those we would expect to see in a randomized data set.
These are simple to calculate. If we had a 100 randomized data sets (or SNPs), then:
We expect 1 test (1%) to have a p-value of <= 0.01
We expect 5 tests (5%) to have a p-value of <= 0.05
We expect 75 tests (75%) to have a p-value of <= 0.75
Thus if you know the quantile, you know the expected p-value!

But how does this work in practice?

GWAS with 100 SNPs:

First make a matrix of RANDOM SNPs

set.seed(123)
nsnps <- 100
nplants <- 100
snpmatrix <- matrix(nrow=nplants, ncol = nsnps)
for (s in 1:nsnps) {
  snpmatrix[,s] <- sample(sample(c("A","C","G","T"),2), 100, replace = TRUE)
}
colnames(snpmatrix) <- str_c("SNP", 1:nsnps)

SNP matrix

datatable(head(snpmatrix[,1:10]), rownames = FALSE, options=list(searching=FALSE) )

Add pheno data

set.seed(456)
gwasdata <- tibble(
  plantID=1:100,
  height=rnorm(100, mean=c(9,10), sd=1.5),
  as_tibble(snpmatrix)) %>%
  mutate(SNP10=rep(c("C", "G"), length.out=nplants)) # a true positive!

GWAS table

gwasdata %>% datatable(rownames = FALSE, options=list(searching=FALSE) ) %>% formatSignif("height", 4)

Test association for each SNP and get pvalue

First: pivot to longer

gwaslong <- gwasdata %>%
  pivot_longer(c(-plantID, -height), names_to="SNP_name", values_to = "genotype") %>%
  arrange(SNP_name)
gwaslong

## # A tibble: 10,000 × 4
##    plantID height SNP_name genotype
##      <int>  <dbl> <chr>    <chr>   
##  1       1   6.98 SNP1     G       
##  2       2  10.9  SNP1     T       
##  3       3  10.2  SNP1     G       
##  4       4   7.92 SNP1     T       
##  5       5   7.93 SNP1     T       
##  6       6   9.51 SNP1     T       
##  7       7  10.0  SNP1     G       
##  8       8  10.4  SNP1     G       
##  9       9  10.5  SNP1     T       
## 10      10  10.9  SNP1     T       
## # … with 9,990 more rows

Next do a t-test on each SNP and extract p value

pvals <- gwaslong %>% group_by(SNP_name) %>%
  summarize(gwas_pval=t.test(height ~ genotype)$p.value)

observed pvals table

These are the observed pvalues

sort based on p-value, add quantile

For these type of plots often we calculate a quantile for every test.

pvals <- pvals %>% 
  arrange(gwas_pval) %>%
  mutate(quantile=1:nrow(pvals)/nrow(pvals))

observed pvals by quantile

calculate expected p values, convert to -log10(p)

pvals <- pvals %>% 
  mutate(exp_pval = quantile,
         neglog10_gwas = -log10(gwas_pval),
         neglog10_expected = -log10(exp_pval))

observed and expected pvals

plot it

pvals %>% ggplot(aes(x=neglog10_expected, y=neglog10_gwas)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0)