Julin Maloof
An alternative to doing differential gene expression analysis for RNAseq data is to look for patterns in the data.
This can be particularly useful as the data sets get larger.
The 12 samples your worked with last week are part of a much larger data set. Today we will work with 48 samples from the same experiment.
We will explore three types of clustering:
Basic idea:
let's try it on some cities
BOS | NY | DC | MIA | CHI | |
---|---|---|---|---|---|
BOS | 0 | 206 | 429 | 1504 | 963 |
NY | 206 | 0 | 233 | 1308 | 802 |
DC | 429 | 233 | 0 | 1075 | 671 |
MIA | 1504 | 1308 | 1075 | 0 | 1329 |
CHI | 963 | 802 | 671 | 1329 | 0 |
BOS_NY | DC | MIA | CHI | |
---|---|---|---|---|
BOS_NY | 0 | NA | NA | NA |
DC | NA | 0 | 1075 | 671 |
MIA | NA | 1075 | 0 | 1329 |
CHI | NA | 671 | 1329 | 0 |
After merging to create a cluster, must re-compute distances from the new mode to all other nodes.
But what value should we use?
BOS | NY | DC | MIA | CHI | |
---|---|---|---|---|---|
BOS | 0 | 206 | 429 | 1504 | 963 |
NY | 206 | 0 | 233 | 1308 | 802 |
DC | 429 | 233 | 0 | 1075 | 671 |
MIA | 1504 | 1308 | 1075 | 0 | 1329 |
CHI | 963 | 802 | 671 | 1329 | 0 |
Could use minimum, maximum, or average distance. The default in r hclust
is maximum
BOS | NY | DC | MIA | CHI | |
---|---|---|---|---|---|
BOS | 0 | 206 | 429 | 1504 | 963 |
NY | 206 | 0 | 233 | 1308 | 802 |
DC | 429 | 233 | 0 | 1075 | 671 |
MIA | 1504 | 1308 | 1075 | 0 | 1329 |
CHI | 963 | 802 | 671 | 1329 | 0 |
BOS_NY | DC | MIA | CHI | |
---|---|---|---|---|
BOS_NY | 0 | 429 | 1504 | 963 |
DC | 429 | 0 | 1075 | 671 |
MIA | 1504 | 1075 | 0 | 1329 |
CHI | 963 | 671 | 1329 | 0 |
BOS_NY | DC | MIA | CHI | |
---|---|---|---|---|
BOS_NY | 0 | 429 | 1504 | 963 |
DC | 429 | 0 | 1075 | 671 |
MIA | 1504 | 1075 | 0 | 1329 |
CHI | 963 | 671 | 1329 | 0 |
BOS_NY_DC | MIA | CHI | |
---|---|---|---|
BOS_NY_DC | 0 | 1504 | 963 |
MIA | 1504 | 0 | 1329 |
CHI | 963 | 1329 | 0 |
Basic idea:
library(animation)
kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 3,
hints = c("Move centers!", "Find cluster?"), pch = 1:3, col = 1:3)
kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 10,
hints = c("Move centers!", "Find cluster?"), pch = 1:10, col = 1:10)
kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 5,
hints = c("Move centers!", "Find cluster?"), pch = 1:5, col = 1:5)
How many clusters are enough?
Cluster variance = average squared distance from cluster center to each member.
Calculate within-cluster variance for N random clusters = “Expected Random”
Calculate within-cluster variance for calculated K-means clusters = “Observed”
Choose the smallest number of clusters that maximizes the “gap” between observed and expected random within-cluster variance.