Updated May 17, 2022
A distribution is a mathematical function that describes the probability of observations
It turns out that most observational data can be described by one of several common distributions
Defined by mean \(\mu\) and standard deviation \(\sigma\).
Probability density (relative probability) of x: \(f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}\)
Defined by mean \(\mu\) and standard deviation \(\sigma\).
\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \]
If you go out and measure something there is a good chance your observations can be described by a mean and standard deviation and will look more or less like this.
Defined by a single parameter \(\lambda\) that is both the mean and variance of the distribution.
Probability of observing x counts: \(p(x) = \frac{\lambda^xe^{-\lambda}}{x!}\)
Overdispersion is when there is more variance in the counts (# of trees per grid; or number of reads per gene per sample) then you would expect from Poisson.
The calculate dispersion functions in edgeR are estimating the dispersion parameter.
This also works with more realistic sample sizes
We need a mathematical way of describing the data. First, let’s replot it.
We can draw a best fit line through the data. This is a model of the data.
slope_DP
is the difference between NDP and DP.trt_DP
is an indicator for whether or not the plant was grown in DP.trt_DP
can be 0 (NDP) or 1 (DP)intercept
+ slope_DP
The model matrix describes how each plant can be plugged in to this equation:
\(height = intercept + trt\_DP*slope\_DP\)
Plant | Trt | Intercept | trt_DP |
---|---|---|---|
1 | NDP | 1 | 0 |
2 | NDP | 1 | 0 |
3 | NDP | 1 | 0 |
4 | DP | 1 | 1 |
5 | DP | 1 | 1 |
6 | DP | 1 | 1 |
Full Model: \(height = intercept + trt\_DP*slope\_DP\)
vs
Reduced Model: \(height = intercept\)
Which fits the data better?
What if there are three levels to the treatment, e.g. * NDP (Not Dense Planting) * DP (Dense Planting) * UDP (Ultra Dense Planting)
Because these are categories, each level gets its own indicator variable and term in the equation
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)
Plant | Trt | Intercept | trt_DP | trt_UDP |
---|---|---|---|---|
1 | NDP | 1 | 0 | 0 |
2 | NDP | 1 | 0 | 0 |
3 | NDP | 1 | 0 | 0 |
4 | DP | 1 | 1 | 0 |
5 | DP | 1 | 1 | 0 |
6 | DP | 1 | 1 | 0 |
7 | UDP | 1 | 0 | 1 |
8 | UDP | 1 | 0 | 1 |
9 | UDP | 1 | 0 | 1 |
What if there are two different factor (treatment and genotype)
\(height = intercept + trt\_DP*slope\_DP + trt\_UDP*slope\_UDP\)
Intercept is IMB211, NDP
Plant | Trt | GT | Intercept | trt_DP | gt_R500 |
---|---|---|---|---|---|
1 | NDP | IMB | 1 | 0 | 0 |
2 | NDP | IMB | 1 | 0 | 0 |
3 | NDP | IMB | 1 | 0 | 0 |
4 | DP | IMB | 1 | 1 | 0 |
5 | DP | IMB | 1 | 1 | 0 |
6 | DP | IMB | 1 | 1 | 0 |
7 | UDP | R500 | 1 | 0 | 1 |
8 | UDP | R500 | 1 | 0 | 1 |
9 | UDP | R500 | 1 | 0 | 1 |
10 | UDP | R500 | 1 | 1 | 1 |
11 | UDP | R500 | 1 | 1 | 1 |
12 | UDP | R500 | 1 | 1 | 1 |