Tutorial 11: Machine Learning

April 4, 2023 (11th week of classes)
Machine leanring: K-means & Classification trees

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!


We will use K-means to estimate the number of groups that differ in their environmental features (i.e., estimate the “number of environment types”) using a data set on environmental variation across 83 sites from two Brazilian rivers (Peres-Neto 2004; Patterns in the co-occurrence of stream fish metacommunties: the role of site suitability, morphology and phylogeny versus species interactions. Oecologia 140:352-360).

Download the environment data

river.data <- read.csv("env.river.csv",header=TRUE)

We need to first standardize the data so that each variable has a mean = 0 and a sd = 1. As such, all variables have the same influence in the analysis.

river.std.data <- scale(river.data[,3:14]) 

Let’s run the K-means setting the number of groups (clusters of sites) as 5 (you could had set other number of groups as well); the function kmeans is a standard function in R and does not need a special package.

kM.result.5G<-kmeans(river.std.data,centers = 5) 

Next, we can list river sites according to how the algorithm classifed them into groups:

clustered.data <- cbind(kM.result.5G$cluster,river.data)

The column kM.result.5G$cluster indicates which site belongs to which group (cluster). Often we want to put the data in the same order as the clusters. This operation can be done using the function order as follows:

clustered.data.ordered <- clustered.data[order(-kM.result.5G$cluster,decreasing=TRUE),]

This was just a very simple example as the best number of groups that classifies the data may well differ from 5 clusters. Let’s estimate what is the number of groups that best fit these data. First, we need to determine an appropriate statistic to measure the quality of clustering (fit) given a particular number of groups. As we discussed in our lecture on K-means, there are many metrics that can be used for this goal. Here we will use the package NbClust that calculates a large number of these metrics and tell us by a majority rule (consensus) how many clusters best describe the data. The majority rule is basically the number of clusters for which the majority of metrics agreed on.

Let’s install the package:


Let’s run the analysis. In the function NbClust we need to select type of distance to use among objects (obects here are stream sites; we will use the Euclidean distance among observations (here sites) as discussed in lecture). We also need to provide the function with the minimum and the maximum number of clusters to consider for the data (arguments min.nc and max.nc, respectively), and finally the method of cluster (here K-means). Setting the argument index to all tells the function to calculate all possible indexex for each of the number of clusters considered for the data. If you type ? NbClust you will find a list of all indexes calculated by the function NbClust:

? NbClust

Scroll down in the help window and you will see the list.

Let’s run the function. The majority rule is given at the end of the output Conclusion. The function also generates two graphs. The graphs provide values only for two indexes SDindex and the Dindex:

result.km <- NbClust(data=river.std.data, distance = "euclidean", min.nc = 2, max.nc = 12, method = "kmeans", index = "all")

For these data, the majority rule (i.e., the number of groups proposed by the majority of indexes measuring goodness-of-fit) estimates that the best number of clusters is 3. A list of values for each index for each number of clusters is given by:


The columns of this matrix contain the values for the different indexes.

The optimal number of clusters (groups) suggested by each index can be found in:


Note that the majority of indexes (10 of them to be exact) estimated the optimal number as 3 clusters. 5 metrics suggested 2 clusters to be the best solution. Note that there is large variation among indexes (which depends on the data) and some indicated even 12 groups as the optimal solution.

Following the majority rule, we will consider 3 clusters. Let’s list river sites according to their classification into the three clusters:

list.sites.km <- cbind(result.km$Best.partition,river.data)

You could use the function order as earlier to sort the sites according to their cluster membership.

Often, it is nice to produce a graphical output to help out distinguishing the groups (clusters) visually. One way is to plot the results in a reduced (PCA) space given that the data is multivariate (i.e., several variables/dimensions). We will see more on PCA later on in the course but this is a good opportunity to see how it can be helpful in describing the features of these data.

The legends represent the names of the sites in the original data. The function dev.off() resets the graph properties given that they seem to be changed by the function NbClust

pca.river.scores<-princomp(river.std.data,cor = TRUE)$scores
plot(pca.river.scores[,1],pca.river.scores[,2], col =(kmeans.result$cluster+1), main="K-Means result with 3 clusters", pch=20, cex=2,xlab="PC-1",ylab="PC-2")

You can observe that river Macacu (sites starting by M) has two distinct groups of environments and that sites in river Macae (sites starting by MA) are very environmentally distinct from river Macacu.

To see how the different clusters differ in their the environmental features:

pca.river.scores<-princomp(river.std.data,cor = TRUE)$scores
plot(pca.river.scores[,1],pca.river.scores[,2], col =(kmeans.result$cluster+1), main="K-Means result with 6 clusters", pch=20, cex=2,xlab="PC-1",ylab="PC-2")
biplot(princomp(river.data[,3:14],cor = TRUE),xlabs=rep("", nrow(river.data[,3:14])),main="correlation of PCs with variables")

Macacu sites (M) in green color are bigger (greater area), have more sand and higher water velocity. Macacu sites (M) in blue color contain more rubble and more variation in sediment within sites. Macae sites (MA) in red color are in higher altitudes, have more boulder and vary more in depth within sites.

Calculate the mean values for each environmental variable per cluster

mean.per.cluster <- aggregate(river.data[,3:14],list(kmeans.result$cluster),mean)

Classification Trees

Let’s start with a simple simulated example to generate a classification tree; the same way of simulating data was used to generate the simulated example used in our lecture.

Let’s assume 200 observations and two predictor variables X1 and X2. We will generate data assuming an uniform distribution (i.e., runif):

n <- 200
X1 <- runif(n)
X2 <- runif(n)

Now, let’s create a response variables (say called Probabilities) that depends on complex combinations of the two variables X1 and X2:


The values in Y represent probabilities of say a species being absent (0.2) or present (0.8) according to two environmental predictors. Let’s now transform these probabilities into real presences and absences adding some random error using a binomial random generator:

Y <- rbinom(n,size=1,Probabilities)

Let’s plot the distribution of values as seen in class (blue=present/red=absence):


Let’s build the classification tree model for the data (in machine learning jargon we say “train the tree”). There are different packages and we will use the package rpart to produce models and the package rattle to produce better graphs representing the regression tree.

ctree.result<-rpart(Y~X1+X2, method="class")

The tree graphics produced by the package rpart has very low quality. We will use another package to produce a m uch improved graph based on the classification tree built above:


Let’s now use the Den Boer’s (2009) genomic data set (gene expression) to classify subtypes of lymphoblastic leukaemia (ALL) already discussed in our early lectures. The file GSE13425_Norm_Whole.txt contains the gene expression data; 22283 genes in rows and 190 patients in columns. The pheno.txt file is 190 patients in rows and 4 columns having different types of information for each patient, including the 6 cancer subtypes.

Download the leukaemia data
Download the types of leukemia for the data
The data is too large for saving it into a csv file, so we used a text file. The file GSE13425_Norm_Whole.txt contains the gene expressions and the file pheno.txt contains the types of ALL for each patient:

expression <- read.table("GSE13425_Norm_Whole.txt", sep = "\t", head = TRUE, row = 1)
dim(expression) # each row contains gene expression a gene (out of 22283) and each column represents a patient.
pheno <- read.table("pheno.txt", sep = "\t", head = TRUE, row = 1)
subtypes <- as.vector(pheno$Sample.title)

Build the regression tree model and plot it; because the classification tree function classifies rows and patients are in columns, we need to transpose the expression data so that patients instead are classified, i.e., expression<-t(expression). The data are quite large and will take a few seconds to run it.

expression <- t(expression)
tree.expression <- rpart(subtypes ~ expression, method="class")

Note that the gene names were cut because of their sizes and the tree is cut when plotted. So, we will need to change the names of the genes for numbers and then based on their numbers, find the appropriate genes:

expression.tmp <- expression
colnames(expression.tmp) <- 1:22283

Retain the tree with the new names:

tree.expression <- rpart(subtypes ~ expression.tmp, method="class")

The important genes are in columns 592, 575, 208, 212 and 200 (see numbers at the end of gene names in the tree); their names can be retrieved as follows:

colnames(expression[,c(592, 575, 208, 212, 200)])

Note that only 5 genes are needed for producing the optimal tree and, as a result, a really large number of genes were not important (bad predictors) in separating the different leukemia subtypes. So, only a few genes are necessary for maximizing diagnostics on the basis of these data.

One can always use export this tree into a different piece of software with better graphic capabilities. We often to that for “publication” purposes.

Let’s look into how well the classification predicts across the different subtypes; these data can also be placed on the tree. Note that many subtypes are included in the pheno.txt data with subtype names, but there is no gene expression for these types; as such, they are listed here anyway, but their sums per row in the classification table is zero.

predictedclass <- predict(tree.expression, type="class")
table(predictedclass, subtypes)

The observed values are in columns whereas the predicted values are in rows. A couple of examples of predicted errors are:

  1. Two BCR-ABL (cancer subtype) patients were predicted as E2A-rearranged (EP).
  2. One BCR-ABL + hyperdiploidy patient was classified (predicted) as pre-B ALL.

We can transform the predictive success into percentages:

table(predictedclass, subtypes)/apply(table(predictedclass, subtypes),1,sum)*100

The cancer types that have the greater chances of correct diagnostics (correct classifications) are pre-B ALL (88.37% correctly classified), T-ALL (100%) and TEL-AML1 (93.18%).