Tutorial 12: PCA and RDA

April 11, 2023 (12th week of classes)
Unconstrained and Constrained ordinations
The term unconstrained is used to ordination methods in which no external information is considered while analysing the data. The most commonly used method is Principal Component Analysis (PCA). Conversely, constrained ordination uses external information. Response variables of interest are first predicted by a set of predictors; the predictions are then analysed used PCA or other (unconstrained) ordination method. The most common method is Redundancy Analysis (RDA).

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!

Principal Component Analysis (PCA)

The default function in R for conducting a PCA is prcomp but it has limited capabilities. Here we will use the package InPosition that has very nice graphic capabilities.

install.packages("InPosition")
library(InPosition)

It may say that was built under a different R version but should not generate issues in running the package functions.

For an empirical application, let’s use the environmental data from the two Brazilian streams used in tutorial 11.

Download the environment data

river.data <- read.csv("env.river.csv",header=TRUE)
View(river.data)

Let’s start by using the standard PCA function princomp in R. The argument cor needs to be set to TRUE so that all variables are standardized and have equal weight prior to producing the PCA.

PCA.result<-princomp(river.data[,3:14],cor=TRUE)
biplot(PCA.result,xlabs=river.data[,2])

Note that we only use variable from column 3 to column 14 as they involve environmental variables. The first two columns are river and site identifiers.

Now let’s use the package factoextra instead. It uses ggplot and has much greater graphical output capability than the standard R function. More information on this package can be found at http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

install.packages("factoextra")
library("factoextra")

The ggplot2 package is required to run factoextra.

Compute the PCA:

res.pca <- princomp(river.data[,3:14], cor = TRUE)

Visualize eigenvalues (scree plot). Show the percentage of variation explained by each principal component:

fviz_eig(res.pca)

We will not discuss here the tools for assessing the number of PCA axes that should be interpreted (i.e., number of significant eigenvalues). The package InPosition can be used for this purpose.
The series of functions below can be easily adapted to any type of data.

Graph of observations. Observations (here river sites) with a similar profile (i.e., scores for PC-1 and PC-2) are closer to each other. I chose the gradient colors here because I find them to represent well variation in patterns in two dimensions. The repel argument allows to connect a line between the dot representing the stream score and its label in case they overlap too much with other labels. This will take a couple of seconds to plot.

fviz_pca_ind(res.pca,col.ind = "cos2",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE)

Plot the contribution of each environmental variable to the two first principal components:

fviz_pca_var(res.pca,col.var = "contrib",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE)

The plot reveals that river sites (multivariate scores) to the right of the graph tend to have greater amounts of sand and the ones in the left tend to be in higher altitudes, and have greater variance in depth. Sites that are on the top of the graph have greater amounts of rubble and those sites that at the bottom have greater depth.

Putting all this information into one graph. Biplot of individual river sites and environmental variables. This will take a couple of seconds to plot.

fviz_pca_biplot(res.pca, repel = TRUE,col.var = "#2E9FDF",col.ind = "#696969")

Groups can be individually colored to improve pattern recognition. We will use here the groups identified by the K-means analysis performed earlier.

river.std.data <- scale(river.data[,3:14]) 
kmeans.result<-kmeans(river.std.data,3)
groups <- as.factor(kmeans.result$cluster)
fviz_pca_ind(res.pca,col.ind = groups,palette = c("#00AFBB",  "#FC4E07","#E7B800"),addEllipses = TRUE, ellipse.type = "convex", legend.title = "Groups",repel = TRUE)

As it can be observed, the three groups are well separated along the first PCA dimensions. The different graphs can be exported and organized into single figures using InkScape or Powerpoint.

Let’s finalize our last BIOL4222 tutorial using an exampl of “planetary” dimensions (literally). As you should understand by now, PCA is often applied to represent (summarize) complex multivariate information into a few dimensions that represent common variation (correlations or covariances) among variables. How about mapping the pca scores so that we can visualize how sites vary environmental on a map (i.e., spatially)?

We will use a data set containing environmental information for the whole terrestrial part of the planet. Each geographical cell is 110km by 110km and values for each cell represent thee average values of environmental features within cells:

Latitude (Lat) and Longitude (Long) at the centre of geographic cell.
Average precipitation (last 40 years; avg_prec)
Average actual evapotranspiration (avg_ET, a proxy of productivity)
Average vegetation index (avg_VI)
Mean altitude (avg_Alt)
Maximum altitude minus minimum altitude (altitudinal range; range_Alt)
Average temperature (avg_temp)
Seasonal temperature (annual range in temperature; seas_temp)
Seasonal precipitation (annual range in precipitation; seas_prec)

Download the global climate data

world.climate <- read.csv("Climate.csv")
dim(world.climate)
head(world.climate)

There are 14909 geographic cells of 110km by 110km on the terrestrial part of the planet (note that some cells may be missing due to lack of data for these cells)

Run the PCA; environmental variables are in columns 3 to 10. Lat and Long are in colums 1 and 2.

world.pca <- princomp(world.climate[,3:10], cor = TRUE)

Analyse the percentage of variation explained by each principal component:

fviz_eig(world.pca)

The exact values for each axis are:

world.pca$sd^2/sum(world.pca$sd^2)*100

The first three axes summarize 35.8%, 25.5% and 20.0% of the total variation across all variables, respectively.

Let’s plot the variables. There are 1000s of PCA scores each representing a geographic cells and it would be too difficult to plot them all. So we can contrast the eigenvector structure with the map later.

fviz_pca_var(world.pca,col.var = "contrib",gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),repel = TRUE)

Note that average temperature, precipitation, vegetation index and evapotranspiration are all positively strongly correlated to PC-1 (Dim 1); whereas seasonality of temperature and precipitation are negatively correlated to PCA-2. This indicates that seasonality (difference in minimum and maximum) are independent of average temperature, precipitation, vegetation index and evapotranspiration given that PC-2 and PC-1 are zero correlated.

Let’s produce the map using the package lattice:

install.packages("lattice")
library(lattice)

Let’s produce the map based on PC-1 (dimension 1:

lat <- world.climate[,1]
long <- world.climate[,2]
tcol=colorRampPalette(c("blue","lightblue","yellow","red"))(100) 
levelplot(world.pca$scores[,1]~lat*long, aspect="iso", cuts=99, col.regions=tcol)

High scores (red) indicate areas where temperature, precipitation, vegetation index and evapotranspiration are greater and blue the lowest given that these 4 variables are all positively correlated with PC-1.

Let’s map PC-2:

levelplot(world.pca$scores[,2]~lat*long, aspect="iso", cuts=99, col.regions=tcol)

Blue values are negative scores on PC-2 and given that seasonality of temperature and precipitation are negatively correlated to PCA-2, blue areas have the larges seasonality and red has the smallest seasonality (remember PCA-2 are negatively correlated to seasonality as seen in class).

Redundancy Analysis (RDA)

Download the fictional data

Let’s start by a fictional example. This example was built to represent a very clear pattern so that you can understand the main features in an RDA. It has only 4 species (response variables) and 4 environmental variables across 10 observations (sites).

data.fictional <- read.csv("RDA_fictional_example.csv")
# data need to be matrix to use the rda function in the vegan package for RDA
data.fictional <- as.matrix(data.fictional) 
View(data.fictional)

We will use the library vegan to run RDA:

install.packages("vegan")
library(vegan)

Let’s put species in a separate matrix (easier to run the rda function) and give them shorter names so that the plot is better.

species <- data.fictional[,2:5]
colnames(species) <- c("s1","s2","s3","s4")
env <- data.frame(data.fictional[,6:9])

Run the RDA model:

RDA.result <- rda(species~Env1+Env2+Env3+Env4,data=env)
result.anova <- anova(RDA.result)
result.anova

The R² for the RDA model can be found within the RDA.result object, or simply:

RDA.result
R2 <- result.anova$Variance[1]/sum(result.anova$Variance)
R2

Because the data is built to be very close to ‘perfect’ for pedagogical demonstration, the R² is quite high.

The plot using the plot function is quite rough:

plot(RDA.result)

A better graph (still rough though) can be produced by:

plot(RDA.result,type="n")
text(RDA.result, "sites", col="red", cex=0.8)
points(RDA.result, pch=21, col="red", bg="yellow", cex=1.2)
text(RDA.result, "species", col="blue", cex=1.8)

Producing an RDA based on standardized response variables; the default is on centered variables (as seen in class)

RDA.result.std <- rda(scale(species)~Env1+Env2+Env3+Env4,data=env)
plot(RDA.result.std,type="n")
text(RDA.result, "sites", col="red", cex=0.8)
points(RDA.result.std, pch=21, col="red", bg="yellow", cex=1.2)
text(RDA.result.std, "species", col="blue", cex=1.8)

Now let’s look into a real example using data included in the vegan package. It is very common that packages have read data examples to be used as examples.

data(varechem)
data(varespec)
? varespec # look at the data structure and origin

Run the model:

vegetation.RDA <- rda(as.matrix(varespec) ~ as.matrix(varechem))
anova.rda.vegetation <- anova(vegetation.RDA)
anova.rda.vegetation

Calculate the R² (total amount of species variation explained by the environmental variables):

R2 <- anova.rda.vegetation$Variance[1]/sum(anova.rda.vegetation$Variance)
R2

Because of the large number of predictors, the R² is likely quite inflated and needs to be adjusted:

n <- dim(varechem)[1]
p <- dim(varechem)[2]
R2.adjusted <- 1-(1-R2)*(n-1)/(n-p-1)
R2.adjusted

Let’s produce the RDA triplot:

plot(vegetation.RDA,type="n")
text(RDA.result, "sites", col="red", cex=0.8)
points(vegetation.RDA, pch=21, col="red", bg="yellow", cex=1.2)
text(vegetation.RDA, "species", col="blue", cex=0.8,scaling = 1)

There are not many packages that conduct RDAs and the graphical capabilities for output are not of great quality. Often, we pick the values from the analysis and use ggplot to build more sophisticated triplots. Some code on how to do this here:

https://stackoverflow.com/questions/32194193/plotting-rda-vegan-in-ggplot