Tutorial 3: Factorial ANOVA

January 24, 2023 (3rd week of classes)

Factorial Analysis of Variance

REMEMBER: There are no reports or grades for tutorials. Note, though, that reports and midterm 2 are heavily based on tutorials and your knowledge of R. Therefore you should attend tutorials and/or work on them weekly on your own time. Also consider using the Forum for tutorials with you have any specific questions along the way.

General setup for the tutorial
A good way to work through is to have the tutorial opened in our WebBook and RStudio opened beside each other.


*NOTE: Tutorials are produced in such a way that you should be able to easily adapt the commands to your own data! And that’s the way that a lot of practitioners of statistics use R!


Revisiting one-factor ANOVA

Let’s do a quick review on how to perform a one-factor ANOVA (or one-way ANOVA) in R. Here we will use the data from the “The knees who say night” problem, as seen in class; The data come from Whitlock and Schluter (2009), The Analysis of Biological Data. Note that in most cases I use data in the plural form (data are); but many use it also in the singular form even though the proper singular form is datum. Here a discussion for the curious ones: https://en.wikipedia.org/wiki/Data_(word)

Download the Knees data

Let’s load the data:

circadian <- read.csv("chap15e1KneesWhoSayNight.csv")

It is worth understanding how the data are structured so that you learn how to structure it for your own data. The command View allows you to see the data as they appear in the csv file. But also importantly, it allows you to see the name of the variables (in columns) in case you forgot them (your own data) or new data (from others):

View(circadian)

Now, let’s conduct the ANOVA. We first save the output from the function aov (acronym standing for analysis of variance) into the object result.aov.circardian.To generate the ANOVA output (table), use the function summary with the aov object as input:

result.aov.circardian<-aov(shift ~ treatment,data=circadian)
summary(result.aov.circardian)

Assumptions

There are three assumptions underlying ANOVA that observed data need to meet (jargon: we say assumptions are met or not met) to make sure that the the observed F statistic is properly estimated and that the type I error of the test equals the pre-established significance level (alpha). Here we will cover the first two assumptions (homoscedasticity and normality among treatments). The the third assumption of random sampling (i.e., independence of observations) is more complex and will be discussed later in the course.

Homoscedasticity: The first assumption (as seen in Intro BioStats courses) is equality of variances (i.e., homoscedasticity) among = s (treatments). Here, the null hypothesis H0 is that groups or treatments come from populations with equal variances. If this H0 is rejected, you cannot use a standard ANOVA. To test this H0, we can use the Levene’s test. Let’s proceed with the test and find out whether we can apply it to the circardian data. The standard R does not contain Levene’s test and a special packaged called car needs to be installed. To do that, simply:

install.packages("car")

Then simply call the package by:

library(car)

Now we are ready to run the Levene’s test. Note that when using the leveneTest function, you need to “tell” the function that treatment is a factor for analytical purposes. That’s not the case for the function aov to conduct the actual ANOVA.

leveneTest(shift ~ as.factor(treatment), data=circadian)

Given that the estimated P-value is 0.8545, we don’t have evidence to reject the H0 that the treatments (groups) come from statistical populations with different variances. So, the assumption of homoscedasticity is met. Note (reinforce this concept) that I said “estimated P-value” as P-values are estimated from samples; if another sample would had been used, likely a different (but perhaps not too different) would had been estimated.

Normality: The second important assumption underlying the analysis of variance (ANOVA) is normality. The assumption is that the distribution of the variable of interest (circadian shift) is normally distributed within each treatment (or withing combinations of treatments in the case of factorial ANOVA). This assumption can be assessed by simply assessing whether the ANOVA residuals fall (more or less) on a straight line from what would be expected if they came from a normal distribution. If you don’t remember Q-Q plots or they were not covered in your Intro Stats course (they were covered in BIOL322), watch this fantastic video:

For that, we use the widely used Q-Q plot (quantile-quantile plot):

plot(result.aov.circardian, which=2)

Note: plot of an ANOVA object contains different types of plots that can be accessed using the argument which in the function. The residual Q-Q plot is the second plot in the function.

Most of the residuals fall on a straight line so we will assume that the data come from a normally distributed populations. Remember, however, that ANOVA is robust against lack of normality.


Two-factorial ANOVA

We will use here a couple of examples to show the different types of outcomes from a two-factorial ANOVA (e.g., main effect versus interactions).

Example 1: We will start by the fictional example covered in lecture 4 regarding the effects of exercise and diet on weight loss.

Download the diet fictional data

Let’s load the data:

diet.data <- read.csv("DietExerciseA.csv")

Let’s see how the data is structured. Again, this is particularly important when you want to use your own data and adapt it to the way that it should be organized in a csv file:

View(diet.data)

Let’s conduct the ANOVA:

result.aov.diet <- aov(WeightLoss ~ factor(Diet) * factor(Exercise), data=diet.data)
summary(result.aov.diet)

Take a look in the results!

One thing that you will learn is that there are different ways to run the same analysis in R. Another way to run ANOVAs is to use the function lm (fitting Linear Models, hence lm). The lm function can be used to run a number of other analyses besides ANOVA, so better get acquainted with it. Note that the results are exactly the same as with the function aov:

result.ANOVA.lm <- lm(WeightLoss ~ Diet * Exercise, data=diet.data)
anova(result.ANOVA.lm)

Let’s use the function aggregate to calculate the mean and the variance per combination of exercise and diet:

aggregate(WeightLoss ~ Diet * Exercise, FUN=mean,data=diet.data)
aggregate(WeightLoss ~ Diet * Exercise, FUN=var,data=diet.data)

Let’s verify if the assumptions are met. First, homoscedasticity (i.e., if treatment combinations come from populations with the same variance):

leveneTest(WeightLoss ~ factor(Diet) * factor(Exercise), data=diet.data)

Given that the P-value is 0.2044 (large enough), we have no evidence to reject the H0 that the treatments (groups) come from populations with different variances. So, homoscedasticity is assumed to be met.

Next, let’s assess normality:

plot(result.aov.diet, which=2)

Most of the residuals fall on a straight line so we will assume that the data come from normally distributed populations (i.e., that the statistical populations from which the data within combinations of treatments (or groups) are normally distributed).

The next usual step in two-way ANOVAs is to look at interaction plots to understand how the mean combinations across treatments vary (i.e., combination of diet and exercise treatments compare).

We can also use the powerful package ggplot2 to produce plots. We will also need a couple of additional packages to facilitate coding. The package tidyverse is designed to facilitate data science and allows data manipulation with great data structure. This package also includes ggplot2. So, in a way, you wouldn’t need to install ggplot2 twice but thought that you should know that ggplot2 is a standalone package that is really great.
The package psych will be used to calculate standard errors for the group means.

Let’s install them:

install.packages("ggplot2")
install.packages("tidyverse") 
install.packages("psych")

NOTE for macOS Big Sur users (M1 chips) - You will to push the installation package dependencies (i.e., install the packages of which the package of interest depends on) as follows: install.packages(“package name”, dependencies = TRUE) as follows:

install.packages("ggplot2",  dependencies = TRUE) 
install.packages("tidyverse", dependencies = TRUE) 
install.packages("psych", dependencies = TRUE) 

Let’s load them:

library(ggplot2)
library(tidyverse) 
library(psych)

A good additional resource for two-factorial anova plots can be found here: https://philippmasur.de/2018/11/26/visualizing-interaction-effects/

Let’s start by summarizing the data. Note that tidyverse and other packages use the operator %>% which are called pipes; the original of this operator comes from the package magrittr and is now widely used in many R packages to simplify data manipulation.

The operator operator %>% passes the left hand side of the operator to the first argument of the right hand side of the operator. For instance, bellow diet.data (our data) is passed to the function group_by which is a function that converts the data to ta grouped table format that is then compatible with other functions in tidyverse; the results from group_by are then passed to the function summarise. In summarise we use then the package psych to estimate standard errors (se). This is a very efficient way to conduct analysis in R and takes a while to get used. And, again, the majority of statistical practitioners adapt code to their own data with ease but without a in-depth understanding of what each function is doing.

data.summary <- diet.data %>% group_by(Diet,Exercise) %>% 
  summarise(y_mean = mean(WeightLoss),
            y_se = psych::describe(WeightLoss)$se)
data.summary

Now we can produce the plot (ggplot). It also takes a while to get used to the ggplot “grammar” and, again, adapting the code is the best way to start getting use to it. The package has a lot of functions for graphic and plot manipulation. In ggplot we use the symbol + to add elements to the plot. Note that the data.summary produce just above is passed to the function ggplot:

data.summary %>% 
      ggplot(aes(x = Diet, y = y_mean, color = Exercise)) +
      geom_line(aes(group = Exercise)) +
      geom_point() +
      geom_errorbar(aes(ymin = y_mean-1.96*y_se, 
                    ymax = y_mean+1.96*y_se),
                    width = .1) +
      labs(x = "diet",color  = "exercise", y = "Weight loss") +
      theme_classic() 
      

The error bars around mean values represent 95% confidence intervals. Instead of the traditional t-value, I have used 1.96 * the standard error (y_se) instead to simplify finding the appropriate t-value. It’s easy to do it but want to focus on the graph here.

As I said in one of the lectures; I could spend between 10 minutes and 10 hours producing the same graph depending on how detailed and stylistic I want the graph to appear. Even then, sometimes we give up on trying to find the “perfect” way to make a graph look really great and end up resorting using another software. I sometimes use InkScape (https://inkscape.org/en/) which is freely available for Linux, Mac and Windows! Or, as some people do (no me :), you can also export into pdf and then use powerpoint to change certain aspects of the graph aesthetics.

Example 2: Here we will use data on normalized change in gene expression for 1000 genes on two (2) mice strains and six (6) brain region; each region within each mouse has two (2) replicates as seen in class; i.e., total of 24 (2 x 6 x 2) rows in the file; genes are in columns.

In this example, we are interested whether change in gene expression (response variable) differs among two factors: mice strain and brain region. However, here we have 1000 genes, so that we could potentially run 1000 ANOVAs.

Download the Sandberg gene data

Let’s upload the data:

gene.Sandberg <- read.csv("SandbergGeneData.csv")

Let’s look at the data. The brain regions are coded to make it shorter as “ag” (Amygdala), “cb” (Cerebellum), “cx” (Cortex), “ec” (EntorhinalCortex), “hp” (Hippocampus) and “mb” (Midbrain). X129 and B6 are mice strains (I often wonder how were these names invented). If you are curious about mice strains, look into it: https://www.jax.org/jax-mice-and-services/customer-support/technical-support/genetics-and-nomenclature/hybrid-mice).

View(gene.Sandberg)

Let’s conduct a two-way ANOVA for the first gene, i.e., the first column in the data (aa000148.s.at):

result.ANOVA.gene <- lm(aa000148.s.at ~ strain * brain.region, data=gene.Sandberg)
anova(result.ANOVA.gene)

Both main effects and their interactions are not significant.

We can certainly run all the other analyses that were done early for the Exercise example (Levene’s test, normality, interaction plot). In gene expression studies, we are often interested in “bulk” analysis such as identifying genes for which differences only relate to strain, or to brain.region or that only for which the interaction was significant. For that, we need to run all 1000 two-way ANOVAs. This is such an analysis that really demonstrates the power of R as a statistical environment. Imagine clicking menu-driven statistical software, one gene at the time, 1000 times!!!

There are different ways in which we can run a bulk two-way ANOVA for multiple responses (here genes). Here we will use one of them that is not too complex to understand and at the same time very efficient.

The first thing we will do is to create a vector containing the names of all genes. These are response variables so we will be calling the vector responseList but any other name of your choosing would work:

responseList <- colnames(gene.Sandberg)[3:length(gene.Sandberg)]

We will then use the function lapply to apply a function that we will write that will allow us to run the multiple ANOVAs. To make it general, we will create a formula that goes in the function lm (linear model). I called that formula my.formula, but any name would work as well. ANOVAs are linear models (we will see that later in the course) and we can use the function lm to run the model; and then conduct an ANOVA on the results of the linear model:

modelList <- lapply(responseList, function(resp) {
  my.formula  <- formula(paste(resp, " ~ strain*brain.region"))
  lm(my.formula, data = gene.Sandberg)
})

It should take less than 30 seconds to run 1000 ANOVAs, one for each gene. The code creates a list, which I called modelList. Objects in lists are assigned as “[[number]]” and not simply as “[number]”, which is used for assigning cells in vectors and matrices. Let’t take a look into the model for the fifth gene:

modelList[[5]]

We could also run the ANOVA for this gene as follows:

anova(modelList[[5]])

Let’s now run an anova for all genes by simply using lapply again:

ANOVA.genes <- lapply(modelList, anova)

Another few seconds and done! 1000 ANOVAs! Wow!

We can pull out the anova results for any individual gene. Say gene 145:

ANOVA.genes[[145]]

Now, take a look in the list of ANOVA results:

ANOVA.genes

All 1000 ANOVA results are in the list ANOVA.genes. Now, let’s say that we want to identify all the genes for which change in gene expression only differed between strains. For that, we need to isolate from the ANOVA table the probabilities for the two main factors (strain and brain.region) as well as for their interaction. The first thing to learn is that all the information kept in an object (here ANOVA table) can be found by using the function names:

names(ANOVA.genes[[1]])

In there, you find “Pr(>F)”, which stands for the P-values we are looking for. Note that the p-values are in the 5th columns of the ANOVA object. So we can access the three p-values (the two main effects, strain and brain region, as well as the p-value for the interaction) as follows:

ANOVA.genes[[1]][1:3,5]

When you look at the table, we know that the 1st probability refers to strain, the 2nd to brain.region and the third to their interaction.

Now we need to get all the p-values from all the 1000 ANOVA objects in the list. This can be done as follows:

P.values <- sapply(ANOVA.genes, function(x) {x[1:3,5]})
dim(P.values)

Again, this code may seem a bit “cryptic” to you but can be easily modified to your own data. With time and practice, one gets used to the logic of functions and commands in R. Again, most practitioners adapt existing code or do internet searches to find solutions to their problems in R. R is also a huge community and there are lots of blogs that you can ask questions - members of the community of R users usually answers very quickly. R has millions of individuals using it!

P.values is a 3 (rows: main effects and interaction) by 1000 (colums: genes) Let’s give names to the rows and columns of the matrix. This will facilitate identifying which genes and effects (main and interaction) are significant (we will see that below). The genes were in the order that they were analysis and are recorded in the vector responseList we created earlier.

rownames(P.values) <- c("strain","brain.region","interaction")
colnames(P.values) <- responseList

Let’s take a look into the P.values matrix:

View(P.values)

Note that only the first 50 genes are shown and you can move with the arrows above to view the other genes.

Now, let’s identify which genes differ in their change in gene expression regarding only the strain, considering a significant level (alpha) equal to 0.05.

The function which allows us to determine the rows in the matrix P.values that follow particular conditions. Let’s find the genes for which only strain is significant (i.e., but not brain.region and the interaction):

genes.strain.only <- which(P.values["strain",] <= 0.05 & P.values["brain.region",] > 0.05 & P.values["interaction",] > 0.05) 
genes.strain.only

The results come with the columns names of the matrix P.values. To list just the names of the genes:

names(genes.strain.only)

Let’s find the genes for which only the interaction was significant:

genes.interaction.only <- which(P.values["strain",] > 0.05 & P.values["brain.region",] > 0.05 & P.values["interaction",] < 0.05) 
names(genes.interaction.only)

You can certainly modify the code to determine other combinations of effects (e.g., only brain.region significant). You can also modify the code to test for whether the ANOVA assumptions are met for each gene or only the genes of interest (e.g., the ones that rejected the H0 for the desired conditions).

We can also generate interaction plots for any gene of interest. Let’s consider the first gene for which only the interaction was significant:

gene.of.interest <- names(genes.interaction.only[1])
gene.of.interest

The change in gene expression values for this gene are:

gene.Sandberg[gene.of.interest]

Because we have lots of genes, the easiest is to create a matrix with the groups (which are in the first two columns of the complete data matrix gene.Sandberg) and change in gene expression values:

data.gene <- cbind(gene.Sandberg[,1:2],gene.Sandberg[gene.of.interest])
View(data.gene)

Now, we can easily adapt the code we learned earlier applied to the fictional diet study to the gene data:

data.summary <- data.gene %>% group_by(strain,brain.region) %>% 
  summarise(y_mean = mean(aa014563.s.at),
            y_se = psych::describe(aa014563.s.at)$se)
data.summary 

We can decide whether we want brain.region or strain in the horizontal axis; and the other as lines. Because we have six brain regions and only two strains, let’s use the former

data.summary %>% 
      ggplot(aes(x = brain.region, y = y_mean, color = strain)) +
      geom_line(aes(group = strain)) +
      geom_point() +
      geom_errorbar(aes(ymin = y_mean-1.96*y_se, 
                    ymax = y_mean+1.96*y_se),
                    width = .1) +
      labs(x = "brain region",color  = "strain", y = "Normalized change in gene expression") +
      theme_classic() 

One may find the error bars very distracting; we can simply remove it and re-plot as follows:

data.summary %>% 
      ggplot(aes(x = brain.region, y = y_mean, color = strain)) +
      geom_line(aes(group = strain)) +
      geom_point() +
        labs(x = "brain region",color  = "strain", y = "Normalized change in gene expression") +
      theme_classic() 

After the lectures and this tutorial, we hope that you:

  1. Understood how data are structured for a two-way ANOVA.
  2. Understood how to interpret two-way ANOVAs
  3. Learned how to test assumptions of ANOVAs and multi-factorial ANOVAs (two-way and beyond)
  4. Started understanding how to create conditions to find patterns in bulk analysis involving multiple data.
  5. Improved your overall R skils.
  6. Learned the ggplot environment. This is a widely used package for graphing and we will make use from time to time in our course.