For this, we will use the profiles of expression of genes differentially expressed between two cancer types (AML and ALL). These genes have previously been selected on the basis of a Student test, described in the tutorial on Student test.
dir.golub.data <- file.path(dir.course, 'data', 'gene_expression','golub_1999') golub.expr.E.1 <- read.table(file.path(dir.golub.data, 'golub_standardized_filtered_student_E1.tab')) ## Check the dimension of the table containing ## the selected expression profiles dim(golub.expr.E.1)This should give the following result:
> dim(golub.expr.E.1) [1] 243 38
We have thus selected 243 genes, showing a significantly different level of expression between the 27 patients suffering from ALL (acute lymphoblastic leukemia) and the 11 patients suffering from AML (acute myeloblastic leukemia).
In principle, these libraries should be installed on your machine before the beginning of the practicals. Assuming this is the case, you can load the libraries with the following instructions.
library(combinat) library(maptree) library(e1071)
If the libraries are loaded properly, you do not need to install them, and you can skip the next section.
To install the required libraries, log in as system administrator, open R and type the following command.
install.packages("combinat")
install.packages("maptree")
install.packages("e1071")
As usual, you need to load the configuration file before starting this tutorial. See the configuration page if you forgot how to do this.
Hierarchical clustering is performed in two steps:
## Read the help for the function dist() help(dist) ## Calculate Euclidian distances d.euclidian <- dist(golub.expr.E.1,method="euclidian") ## Have a look to the attributes of the distance matrix attributes(d.euclidian)
for the coefficient of correlation, the maximal value is 1. We can thus calculate the "correlation distance" accordingly.
d.correlation <- as.dist(1 - cor(golub.expr.E.1))
The method cor() returns an object of class matrix. For the hierarchical clustering, we need an object of class dist. The method as.dist() is used to cast (reformat) a matrix into a dist object
## Read the help on the hierarchical clustering method. help(hclust) ## Build a tree with the Euclidian distance tree.euclidian <- hclust(d.euclidian) ## Plot the tree plot(tree.euclidian)The plot s barely readable, due to the large number of genes. We can slightly decrease the overlap between labels by decreasing the font size.
x11(width=16,height=8) plot(tree.euclidian,cex=0.7)
The package stats contains another useful function, cutree(), which prunes the tree, i.e. cuts the branches at a certain level.
## Cut the tree to the level of 7 branches ## Each object (gene) is assigned to one of the 7 clusters clusters.euclidian <- cutree(tree.euclidian,k=7) ## Count the number of genes per cluster table(clusters.euclidian)
Another library, maptree, contains the method clip.clust(), which allows to plot the main branches of the tree.
library(maptree)
pr <- clip.clust(tree.euclidian,k=7)
plot(pr, main=paste("pruned tree, k=7"))
clusters.kmeans <- kmeans(d.euclidian, 7)Berware: each time you run the kmeans clustering, you can obtain a different result, since the initial centers are placed randomly in the variable space.
The simplest way to compare two clustering results is with a contingency table. For instance, we could count the number of common genes between each cluster obtained by kmeans with each cluster obtained by hierarchical clustering.
cont.table <- table(clusters.kmeans$cluster,as.vector(clusters.euclidian)) print(cont.table)The package e1071 contains a useful function, matchClasses(), which allows you to finds the optimal matches between two clustering results. This is done by permuting columns and rows of the contingency table in order to maximize the values on the diagonal.
library(e1071) ## Find optimal match between the two classifications class.match <- matchClasses(as.matrix(cont.table),method="exact") ## Print the confusion table, with rows permuted to maximize the diagonal print(cont.table[,class.match])