Genetic Networks 1: Clustering

Julin Maloof

Clustering

An alternative to doing differential gene expression analysis for RNAseq data is to look for patterns in the data.

This can be particularly useful as the data sets get larger.

The 12 samples your worked with last week are part of a much larger data set. Today we will work with 48 samples from the same experiment.

Clustering Goals

  1. Identify groups of samples with similar expression patterns (why?)
  2. Identify groups of genes with similar expression patterns (why?)

Clustering methods

We will explore three types of clustering:

  1. Hierarchical Clustering (today)
  2. K-means clustering (today)
  3. Co-expression (Tuesday)

Hierarchical Clustering

Basic idea:

  1. Calculate distances between each pair of samples (or genes)
  2. Pair the closest samples together
  3. Recalculate distances
  4. pair the closest samples
  5. repeat 3 and 4

let's try it on some cities

Cities

BOS NY DC MIA CHI
BOS 0 206 429 1504 963
NY 206 0 233 1308 802
DC 429 233 0 1075 671
MIA 1504 1308 1075 0 1329
CHI 963 802 671 1329 0

Cities

BOS_NY DC MIA CHI
BOS_NY 0 NA NA NA
DC NA 0 1075 671
MIA NA 1075 0 1329
CHI NA 671 1329 0

After merging to create a cluster, must re-compute distances from the new mode to all other nodes.

But what value should we use?

BOS NY DC MIA CHI
BOS 0 206 429 1504 963
NY 206 0 233 1308 802
DC 429 233 0 1075 671
MIA 1504 1308 1075 0 1329
CHI 963 802 671 1329 0

Could use minimum, maximum, or average distance. The default in r hclust is maximum

Cities

BOS NY DC MIA CHI
BOS 0 206 429 1504 963
NY 206 0 233 1308 802
DC 429 233 0 1075 671
MIA 1504 1308 1075 0 1329
CHI 963 802 671 1329 0
BOS_NY DC MIA CHI
BOS_NY 0 429 1504 963
DC 429 0 1075 671
MIA 1504 1075 0 1329
CHI 963 671 1329 0

Cities

BOS_NY DC MIA CHI
BOS_NY 0 429 1504 963
DC 429 0 1075 671
MIA 1504 1075 0 1329
CHI 963 671 1329 0
BOS_NY_DC MIA CHI
BOS_NY_DC 0 1504 963
MIA 1504 0 1329
CHI 963 1329 0

K-means clustering

Basic idea:

  1. Define the number of clusters desired
  2. Randomly assign each gene to a cluster
  3. Calculate the mean position of each cluster (aka cluster centroid)
  4. Assign each gene to the cluster whose centroid it is closest to
  5. Repeat until stable.

K-means animation

library(animation) 

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 3,
hints = c("Move centers!", "Find cluster?"), pch = 1:3, col = 1:3)

K-means animation

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 10,
hints = c("Move centers!", "Find cluster?"), pch = 1:10, col = 1:10)

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 5,
hints = c("Move centers!", "Find cluster?"), pch = 1:5, col = 1:5)

Gap-statistic

How many clusters are enough?

Cluster variance = average squared distance from cluster center to each member.

Calculate within-cluster variance for N random clusters = “Expected Random”

Calculate within-cluster variance for calculated K-means clusters = “Observed”

Choose the smallest number of clusters that maximizes the “gap” between observed and expected random within-cluster variance.