Updated May 17, 2022


What is a Distribution?

  • A distribution is a mathematical function that describes the probability of observations

  • It turns out that most observational data can be described by one of several common distributions

    • Continuous data often follows a Gaussian or Normal distribution
    • Count data often follows a Poisson distribution
    • When one of two events can occur (coin toss), the data often follows a binomial distribution

Gaussian or “Normal” Distribution

Defined by mean \(\mu\) and standard deviation \(\sigma\).

Probability density (relative probability) of x: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\)

Gaussian Distribution: Vary mean and std. dev

Defined by mean \(\mu\) and standard deviation \(\sigma\).

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \]

Gaussian Distribution

If you go out and measure something there is a good chance your observations can be described by a mean and standard deviation and will look more or less like this.

Poisson Distribution

Defined by a single parameter \(\lambda\) that is both the mean and variance of the distribution.

Probability of observing x counts: \(p(x) = \frac{\lambda^xe^{-\lambda}}{x!}\)

Poisson Distribution

  • If you are counting things that are randomly dispersed in space or time there is a good chance your data will look like this.
  • Example: go to a forest.
    • Mark grid of 10M x 10M squares.
    • Count the number of Ponderosa Pines per square.
    • Expect to be Poisson (if nothing else is influencing the distribution of trees).


Overdispersion is when there is more variance in the counts (# of trees per grid; or number of reads per gene per sample) then you would expect from Poisson.

Negative Binomial Distribution

  • The negative binomial distribution is parameterized by a mean and a dispersion parameter
  • This allows it to fit the “overdispersed” distributions shown here.

The calculate dispersion functions in edgeR are estimating the dispersion parameter.

Why the distribution matters

  • Statistical tests make assumptions about the underlying distribution of data.
  • A t-test is determining the probability that your data could have come from a single Gaussian distribution (the null hypothesis) instead of from two separate Gaussian groups (with different means).
  • If the process that generates the data isn’t Gaussian in the first place then the p-values are not valid.
  • You have to use a test appropriate to you data.

Overview of Statistical Models

Statistical models

  • Mathematical representation of our observations
  • Allows us to test whether various experimental factors (genotype, treatment) are important predictors of our observations.
  • One way to do this is to compare the how well model predictions “fit” the observations.
    • Does a model with more predictors fit better than a simpler model?
    • Which model matches the observations better?
    • Which model explains more of the variance?


  • You measure height of 500 plants grown in a dense planting (DP) or non-dense planting (NDP). You want to know if the treatment influences height.
    • Null hypothesis: treatment (DP vs NDP) does not influence height. Your observations come from a single Gaussian distribution.
    • Alternative hypothesis: treatment does influence height. Your data comes from two Gaussian distributions with different means.
  • Which of these match the observed data better?


model = single mean


compare models

50 observations

This also works with more realistic sample sizes

Model Matrix

We need a mathematical way of describing the data. First, let’s replot it.

Model Matrix

We can draw a best fit line through the data. This is a model of the data.

Model Matrix

  • The line can be described with the equation \(height = intercept + trt\_DP*slope\_DP\)
    • slope_DP is the difference between NDP and DP.
    • trt_DP is an indicator for whether or not the plant was grown in DP.
    • trt_DP can be 0 (NDP) or 1 (DP)
  • NDP plants can be described with the intercept value
  • DP plants can be described with the intercept + slope_DP

Model Matrix

The model matrix describes how each plant can be plugged in to this equation:

\(height = intercept + trt\_DP*slope\_DP\)

Plant Trt Intercept trt_DP
1 NDP 1 0
2 NDP 1 0
3 NDP 1 0
4 DP 1 1
5 DP 1 1
6 DP 1 1

Model Comparison

Full Model: \(height = intercept + trt\_DP*slope\_DP\)
Reduced Model: \(height = intercept\)

Which fits the data better?


Additive gt and treatment effects

Interactions between gt and treatment effects

Model Matrix More levels

What if there are three levels to the treatment, e.g. * NDP (Not Dense Planting) * DP (Dense Planting) * UDP (Ultra Dense Planting)
Because these are categories, each level gets its own indicator variable and term in the equation
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)

Plant Trt Intercept trt_DP trt_UDP
1 NDP 1 0 0
2 NDP 1 0 0
3 NDP 1 0 0
4 DP 1 1 0
5 DP 1 1 0
6 DP 1 1 0
7 UDP 1 0 1
8 UDP 1 0 1
9 UDP 1 0 1

Model Matrix More levels

What if there are two different factor (treatment and genotype)
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)
Intercept is IMB211, NDP

Plant Trt GT Intercept trt_DP gt_R500
1 NDP IMB 1 0 0
2 NDP IMB 1 0 0
3 NDP IMB 1 0 0
4 DP IMB 1 1 0
5 DP IMB 1 1 0
6 DP IMB 1 1 0
7 UDP R500 1 0 1
8 UDP R500 1 0 1
9 UDP R500 1 0 1
10 UDP R500 1 1 1
11 UDP R500 1 1 1
12 UDP R500 1 1 1

Brief thoughts on edgeR empirical Bayes


  • You want to make the best estimates possible
  • You have limited information on each individual gene
  • But you have a lot of information about genes in total
  • Use understanding of how genes behave in general to improve your estimate of each gene