Genetic Networks 1: Clustering

Julin Maloof

Clustering

An alternative to doing differential gene expression analysis for RNAseq data is to look for patterns in the data.

This can be particularly useful as the data sets get larger.

The 12 samples your worked with last week are part of a much larger data set. Today we will work with 48 samples from the same experiment.

Clustering Goals

Identify groups of samples with similar expression patterns (why?)
Identify groups of genes with similar expression patterns (why?)

Clustering methods

We will explore three types of clustering:

Hierarchical Clustering (today)
K-means clustering (today)
Co-expression (Tuesday)

Hierarchical Clustering

Basic idea:

Calculate distances between each pair of samples (or genes)
Pair the closest samples together
Recalculate distances
pair the closest samples
repeat 3 and 4

let's try it on some cities

Cities

	BOS	NY	DC	MIA	CHI
BOS	0	206	429	1504	963
NY	206	0	233	1308	802
DC	429	233	0	1075	671
MIA	1504	1308	1075	0	1329
CHI	963	802	671	1329	0

Cities

	BOS_NY	DC	MIA	CHI
BOS_NY	0	NA	NA	NA
DC	NA	0	1075	671
MIA	NA	1075	0	1329
CHI	NA	671	1329	0

After merging to create a cluster, must re-compute distances from the new mode to all other nodes.

But what value should we use?

	BOS	NY	DC	MIA	CHI
BOS	0	206	429	1504	963
NY	206	0	233	1308	802
DC	429	233	0	1075	671
MIA	1504	1308	1075	0	1329
CHI	963	802	671	1329	0

Could use minimum, maximum, or average distance. The default in r hclust is maximum

Cities

	BOS	NY	DC	MIA	CHI
BOS	0	206	429	1504	963
NY	206	0	233	1308	802
DC	429	233	0	1075	671
MIA	1504	1308	1075	0	1329
CHI	963	802	671	1329	0

	BOS_NY	DC	MIA	CHI
BOS_NY	0	429	1504	963
DC	429	0	1075	671
MIA	1504	1075	0	1329
CHI	963	671	1329	0

Cities

	BOS_NY	DC	MIA	CHI
BOS_NY	0	429	1504	963
DC	429	0	1075	671
MIA	1504	1075	0	1329
CHI	963	671	1329	0

	BOS_NY_DC	MIA	CHI
BOS_NY_DC	0	1504	963
MIA	1504	0	1329
CHI	963	1329	0

K-means clustering

Basic idea:

Define the number of clusters desired
Randomly assign each gene to a cluster
Calculate the mean position of each cluster (aka cluster centroid)
Assign each gene to the cluster whose centroid it is closest to
Repeat until stable.

K-means animation

library(animation) 

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 3,
hints = c("Move centers!", "Find cluster?"), pch = 1:3, col = 1:3)

K-means animation

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 10,
hints = c("Move centers!", "Find cluster?"), pch = 1:10, col = 1:10)

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 5,
hints = c("Move centers!", "Find cluster?"), pch = 1:5, col = 1:5)

Gap-statistic

How many clusters are enough?

Cluster variance = average squared distance from cluster center to each member.

Calculate within-cluster variance for N random clusters = “Expected Random”

Calculate within-cluster variance for calculated K-means clusters = “Observed”

Choose the smallest number of clusters that maximizes the “gap” between observed and expected random within-cluster variance.