Genetic Networks 1: Clustering

Julin Maloof

Clustering

An alternative to doing differential gene expression analysis for RNAseq data is to look for patterns in the data.

This can be particularly useful as the data sets get larger.

The 12 samples your worked with last week are part of a much larger data set. Today we will work with 48 samples from the same experiment.

Clustering Goals

  1. Identify groups of samples with similar expression patterns (why?)
  2. Identify groups of genes with similar expression patterns (why?)

Clustering methods

We will explore three types of clustering:

  1. Hierarchical Clustering (today)
  2. K-means clustering (today)
  3. Co-expression (Tuesday)

K-means animation

library(animation) 

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 3,
hints = c("Move centers!", "Find cluster?"), pch = 1:3, col = 1:3)

K-means animation

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 10,
hints = c("Move centers!", "Find cluster?"), pch = 1:10, col = 1:10)

kmeans.ani(x = cbind(X1 = runif(50), X2 = runif(50)), centers = 5,
hints = c("Move centers!", "Find cluster?"), pch = 1:5, col = 1:5)

Gap-statistic

How many clusters are enough?

Cluster variance = average squared distance from cluster center to each member.

Calculate within-cluster variance for N random clusters = “Expected Random”

Calculate within-cluster variance for calculated K-means clusters = “Observed”

Choose the smallest number of clusters that maximizes the “gap” between observed and expected random within-cluster variance.