Clustering vs Networks

  • In the last lab you learned how to group genes into clusters based on similar expression patterns.
  • In this lab we extend this concept to build gene networks
  • Gene networks are graphs that show connections between genes with similar expression.

Co-expression Network

  • The goal is to connect genes with the most similar expression
  • One simple way to do this is to use correlation as a measure of expression similarity
  • Why might we want to do this?

Correlation Matrix

First calculate the correlation between each gene’s expression across samples

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA 1.00 0.12 0.75 0.86 0.49 0.32
GeneB 0.12 1.00 0.92 0.08 0.88 0.08
GeneC 0.75 0.92 1.00 0.81 0.78 0.02
GeneD 0.86 0.08 0.81 1.00 0.28 0.59
GeneE 0.49 0.88 0.78 0.28 1.00 0.78
GeneF 0.32 0.08 0.02 0.59 0.78 1.00

Adjacency Matrix

Then create an adjacency matrix with “1” indicating genes that are correlated above a threshold, and “0” indicating below threshold.

Connect genes with a “1”

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA 0 0 1 1 0 0
GeneB 0 0 1 0 1 0
GeneC 1 1 0 1 1 0
GeneD 1 0 1 0 0 0
GeneE 0 1 1 0 0 1
GeneF 0 0 0 0 1 0

Terminology

  • Genes are nodes
  • Connections between genes are edges

Mutual Rank Networks

One problem with correlation networks is that it is hard to know what threshold to pick. Further, correlation values can be affected by “noise” in the experiment, that may not be relevant.

An alternative (and in my hands better) approach is to use Mutual Ranks. We can connect genes with the the highest correlations, regardless of their precise value.

More info

Here, we:

  1. Create a pairwise correlation matrix, as above
  2. Rank the correlations from strongest to weakest
  3. Compute the pairwise geometric mean ranks
  4. Choose a rank-based cutoff to create the adjacency matrix

Correlation Matrix

First calculate the correlation between each gene’s expression across samples

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA 1.00 0.12 0.75 0.86 0.49 0.32
GeneB 0.12 1.00 0.92 0.08 0.88 0.08
GeneC 0.75 0.92 1.00 0.81 0.78 0.02
GeneD 0.86 0.08 0.81 1.00 0.28 0.59
GeneE 0.49 0.88 0.78 0.28 1.00 0.78
GeneF 0.32 0.08 0.02 0.59 0.78 1.00

Rank Matrix

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA NA 3 4 1 4 3
GeneB 5 NA 1 5 1 4
GeneC 2 1 NA 2 2 5
GeneD 1 4 2 NA 5 2
GeneE 3 2 3 4 NA 1
GeneF 4 4 5 3 2 NA
  • Rankings in columns
  • Note that these are not necessarily symmetrical.
  • A-D and D-A are symmetrical.
  • E-B and B-E are not symmetrical.
  • Use the (geometric) mean of the ranking

Geometric mean

Geometric average of \(x\) and \(y\): \(\sqrt{x*y}\)

x y arith_mean geom_mean
1 1 1.0 1.00
1 10 5.5 3.16
3 2 2.5 2.45
3 10 6.5 5.48
5 3 4.0 3.87
5 20 12.5 10.00
100 1 50.5 10.00
100 2 51.0 14.14
100 20 60.0 44.72

When \(x\) and \(y\) are different, the geometric mean weights the smaller numbers more heavily.

Average Ranks

Geometric average of \(x\) and \(y\)

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA NA 3.87 2.83 1.00 3.46 3.46
GeneB 3.87 NA 1.00 4.74 1.41 4.24
GeneC 2.83 1.00 NA 2.00 2.74 5.00
GeneD 1.00 4.74 2.00 NA 4.47 2.45
GeneE 3.46 1.41 2.74 4.47 NA 1.58
GeneF 3.46 4.24 5.00 2.45 1.58 NA

Adjacency Matrix

Mutual Rank <= 3

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA 0 0 1 1 0 0
GeneB 0 0 1 0 1 0
GeneC 1 1 0 1 1 0
GeneD 1 0 1 0 0 1
GeneE 0 1 1 0 0 1
GeneF 0 0 0 1 1 0

Network

GeneA GeneB GeneC GeneD GeneE GeneF
GeneA 0 0 1 1 0 0
GeneB 0 0 1 0 1 0
GeneC 1 1 0 1 1 0
GeneD 1 0 1 0 0 1
GeneE 0 1 1 0 0 1
GeneF 0 0 0 1 1 0

Network Measures

  • Density: how highly connected are the nodes?
    • Total edges in the network / All possible edges
  • What are the most “important” or “central” nodes?
    • Degree centrality: which node has the most number of connections?
    • Betweeness centrality: which node has the most number of shortest paths going through it?

Limitations

Correlation and mutual rank networks easy to make and easy to understand but have some limitations

  • Is a hard threshold proper?
  • What is the right threshold?
  • Are correlations even the right measure?
  • Directionality?

Additional method:

  • Weighted Gene Correlation Networks. Uses a “soft” threshold. (WGCNA) (nice tutorials also YouTube videos)