Updated May 17, 2022

Distributions

What is a Distribution?

  • A distribution is a mathematical function that describes the probability of observations

  • It turns out that most observational data can be described by one of several common distributions

    • Continuous data often follows a Gaussian or Normal distribution
    • Count data often follows a Poisson distribution
    • When one of two events can occur (coin toss), the data often follows a binomial distribution

Gaussian or “Normal” Distribution

Defined by mean \(\mu\) and standard deviation \(\sigma\).

Probability density (relative probability) of x: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\)

Gaussian Distribution: Vary mean and std. dev

Defined by mean \(\mu\) and standard deviation \(\sigma\).

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \]

Gaussian Distribution

If you go out and measure something there is a good chance your observations can be described by a mean and standard deviation and will look more or less like this.

Poisson Distribution

Defined by a single parameter \(\lambda\) that is both the mean and variance of the distribution.

Probability of observing x counts: \(p(x) = \frac{\lambda^xe^{-\lambda}}{x!}\)

Poisson Distribution

  • If you are counting things that are randomly dispersed in space or time there is a good chance your data will look like this.
  • Example: go to a forest.
    • Mark grid of 10M x 10M squares.
    • Count the number of Ponderosa Pines per square.
    • Expect to be Poisson (if nothing else is influencing the distribution of trees).

Overdispersion

Overdispersion is when there is more variance in the counts (# of trees per grid; or number of reads per gene per sample) then you would expect from Poisson.

Negative Binomial Distribution

  • The negative binomial distribution is parameterized by a mean and a dispersion parameter
  • This allows it to fit the “overdispersed” distributions shown here.

The calculate dispersion functions in edgeR are estimating the dispersion parameter.

Why the distribution matters

  • Statistical tests make assumptions about the underlying distribution of data.
  • A t-test is determining the probability that your data could have come from a single Gaussian distribution (the null hypothesis) instead of from two separate Gaussian groups (with different means).
  • If the process that generates the data isn’t Gaussian in the first place then the p-values are not valid.
  • You have to use a test appropriate to you data.

Overview of Statistical Models

Statistical models

  • Mathematical representation of our observations
  • Allows us to test whether various experimental factors (genotype, treatment) are important predictors of our observations.
  • One way to do this is to compare the how well model predictions “fit” the observations.
    • Does a model with more predictors fit better than a simpler model?
    • Which model matches the observations better?
    • Which model explains more of the variance?

Example

  • You measure height of 500 plants grown in a dense planting (DP) or non-dense planting (NDP). You want to know if the treatment influences height.
    • Null hypothesis: treatment (DP vs NDP) does not influence height. Your observations come from a single Gaussian distribution.
    • Alternative hypothesis: treatment does influence height. Your data comes from two Gaussian distributions with different means.
  • Which of these match the observed data better?

Example

model = single mean

Example

compare models

50 observations

This also works with more realistic sample sizes

Model Matrix

We need a mathematical way of describing the data. First, let’s replot it.

Model Matrix

We can draw a best fit line through the data. This is a model of the data.

Model Matrix

  • The line can be described with the equation \(height = intercept + trt\_DP*slope\_DP\)
    • slope_DP is the difference between NDP and DP.
    • trt_DP is an indicator for whether or not the plant was grown in DP.
    • trt_DP can be 0 (NDP) or 1 (DP)
  • NDP plants can be described with the intercept value
  • DP plants can be described with the intercept + slope_DP

Model Matrix

The model matrix describes how each plant can be plugged in to this equation:

\(height = intercept + trt\_DP*slope\_DP\)

Plant Trt Intercept trt_DP
1 NDP 1 0
2 NDP 1 0
3 NDP 1 0
4 DP 1 1
5 DP 1 1
6 DP 1 1

Model Comparison

Full Model: \(height = intercept + trt\_DP*slope\_DP\)
vs
Reduced Model: \(height = intercept\)

Which fits the data better?

Interactions

Additive gt and treatment effects

Interactions between gt and treatment effects

Model Matrix More levels

What if there are three levels to the treatment, e.g. * NDP (Not Dense Planting) * DP (Dense Planting) * UDP (Ultra Dense Planting)
Because these are categories, each level gets its own indicator variable and term in the equation
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)

Plant Trt Intercept trt_DP trt_UDP
1 NDP 1 0 0
2 NDP 1 0 0
3 NDP 1 0 0
4 DP 1 1 0
5 DP 1 1 0
6 DP 1 1 0
7 UDP 1 0 1
8 UDP 1 0 1
9 UDP 1 0 1

Model Matrix More levels

What if there are two different factor (treatment and genotype)
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)
Intercept is IMB211, NDP

Plant Trt GT Intercept trt_DP gt_R500
1 NDP IMB 1 0 0
2 NDP IMB 1 0 0
3 NDP IMB 1 0 0
4 DP IMB 1 1 0
5 DP IMB 1 1 0
6 DP IMB 1 1 0
7 UDP R500 1 0 1
8 UDP R500 1 0 1
9 UDP R500 1 0 1
10 UDP R500 1 1 1
11 UDP R500 1 1 1
12 UDP R500 1 1 1

Brief thoughts on edgeR empirical Bayes

Bayes

  • You want to make the best estimates possible
  • You have limited information on each individual gene
  • But you have a lot of information about genes in total
  • Use understanding of how genes behave in general to improve your estimate of each gene