Tutorial 4: Multiple Testing

January 31, 2023 (4th week of classes)

The Multiple Testing Problem & Adjustment Procedures

REMEMBER: There are no reports or grades for tutorials. Note though that reports and midterm 2 are heavily based on tutorials and your knowledge of R. Therefore you should attend tutorials and/or work on them weekly on your own time. Also consider using the Forum for tutorials with you have any specific questions along the way.

General setup for the tutorial
A good way to work through is to have the tutorial opened in our WebBook and RStudio opened beside each other.

*NOTE: Tutorials are produced in such a way that you should be able to easily adapt the commands to your own data! And that’s the way that a lot of practitioners of statistics use R!

Understanding the issues of inflated Type I error when conducting multiple tests

Let’s assume a one-way ANOVA design with 10 groups (treatments) in which the null hypothesis H0 is true; in other words, all samples come from populations with the same mean. Assume that the statistical population mean is 10 and its standard deviation is 3. Let’s then sample observations at random from that population and divide them into 10 groups (with 7 observations each). We can then to run an ANOVA in the simulated data:

Data.simul <- matrix(rnorm(70,mean=10,sd=3))
Groups <- rep(1:10,each=7)

Let’s run an ANOVA on the simulated data and extract the P-value from the ANOVA table:

result.lm <- lm(Data.simul~Groups)

As expected, given that all samples came from the same population, likely (assuming an alpha = 0.05), the null hypothesis wasn’t rejected (i.e., P-value > alpha). Now let’s repeat the sampling multiple times and count how many times the null hypothesis H0 was rejected (i.e., how many times the ANOVA P-value was smaller than alpha). Here we will use again the function lapply and sapply to run the same analysis across multiple data. Let’s consider 1000 data sets based on the population we described above.

n.tests <- 1000
data.simul <- data.frame(replicate(n.tests,rnorm(n=70, mean=10, sd=3)))
lm.results <- lapply(data.simul, function(x) {
  lm(x~as.factor(Groups), data = data.simul)
ANOVA.results <- lapply(lm.results, anova)
P.values <- sapply(ANOVA.results, function(x) {x$"Pr(>F)"[1]})
TypeIerror <- length(which(P.values<=0.05))/n.tests

Pretty close to alpha (0.05), right? If you had run the simulation infinite times, the value in TypeIerror (i,e,m false positives) would had been exactly equal to 0.05, i.e, 5% of tests rejected H0 when H0 was correct.

The issue here is that we ended up with about 50 tests that were significant when the null hypothesis H0 was true. Let’s use the Bonferroni and the FDR adjustments for multiple tests to adjusted these p-values:

p.bonf <- p.adjust(P.values,method="bonferroni")
p.fdr<- p.adjust(P.values,method="fdr")

Let’s view the original and the adjusted p-values:


Go through these p.values, and you will notice that none of them were considered significant after adjustment; even though 50 original p.values were. We can also count them as follows:

length(which(p.bonf <= 0.05))
length(which(p.fdr <= 0.05))

In real situations, most likely, there will be a mix of true positives and true negatives, and p-values are adjustments to optimize their detection, i.e., avoid type I errors (false positives) and type II errors (false negatives).

Multiple correction procedures to post-hoc constrasts between group means in a one-factorial design

Let’s use the “The knees who say night” data from Whitlock and Schluter (2009) as used in tutorial 3; The Analysis of Biological Data.

Download the Knees data

Let’s load the data:

circadian <- read.csv("chap15e1KneesWhoSayNight.csv")

We know from tutorial that the variances of the three groups (treatments) were not significantly different. And that the ANOVA was significant, indicating that at least two groups differ in their means. Which ones? BH below makes the pairwise t test function to use the FDR procedure (coded as BH because of the original authors, Benjamini & Hochberg). g stands for group.

pairwise.t.test(circadian$shift, g = circadian$treatment,p.adjust.method="none")
pairwise.t.test(circadian$shift, g = circadian$treatment,p.adjust.method="bonferroni")
pairwise.t.test(circadian$shift, g = circadian$treatment,p.adjust.method="BH")

Note that the function produces P-values (corrected) for pairwise (i.e., two at the time) comparisons of means. Ideally, you would report these values in a table where the observed t-values for each contrast, their original P-values (uncorrected) and corrected are presented (as seen in our lecture). To my knowledge, there is no standard function in R that produces such a table. To do that, I wrote two specialized functions (included in the tutorial) for one and two-way ANOVAS.

Download the pairwise t table functions
The two functions are included the file (pairwise.t.table.R) and can be loaded into R using the command source as follows:


There are two functions in the file pairwise.t.table. If you want to look into the code of each function, you can simply type these names:


For the FDR correction (BH stands for “Benjamini and Hochberg”, the authors of the procedure)


For the Bonferroni correction:


Note that, as expected, the adjusted p-values (AdjP) are always larger than the unadjusted ones (unAdjP).
We usually don’t report the adjusted p-values for different correction procedures, so choose between Bonferroni and FDR, though I suggest the latter in almost all cases.

Multiple correction procedures to post-hoc contrasts between group means in a two-factorial design

Here we will use another study (so that you get even more used to multifactorial designs) by Harley, C.D.G. 2003. Ecology 84: 1477-1488. The main question of the paper was: “Do abiotic stress and herbivory interact to set range limits across a two-Dimensional stress gradient.” Abiotic stress here is the position in the intertidal zone (see below).

Harley conducted an experiment to assess the influence of herbivory (presence or absence of herbivores), position of the habitat (low and intermediate position in the intertidal zone) on the surface area (cm2) covered by algae (response variable). Presences or absences of herbivores is coded as plus or minus, respectively, as herbivores were removed or allowed in experimental plots with algae. Low position in the intertidal zone is likely stressful for the algae of interest (Mazzaella parksii) whereas intermediate position is optimal. Algae turfs of the same size were transplanted into plots combinining the levels of the two factors (4 combinations in total). The surface area (cm2) covered by algae was measured at the end of the experiment to assess how algae was affected by different experimental combinations.

Download the Harley data

Let’s upload the data:

algae <- read.csv("IntertidalAlgae.csv")

Let’s run the two-way ANOVA:

aov.result<-aov(sqrtArea ~ height * herbivores, data=algae)

Herbivory and interaction were significant. As such, we need to contrast pairwise differences among treatment levels. There are 6 possible pairwise mean tests based on the 4 combinations (levels and factors):

low.minus - mid.minus  
low.minus - low.plus   
low.minus - mid.plus    
mid.minus - low.plus   
mid.minus - mid.plus   
low.plus - mid.plus   

One way to calculate adjusted P-values in a two-way design in R is to first calculate all pairwise combinations between the two factors (height level and herbivory amount) and their levels (height = low/mid and herbivory = minus/plus). Each cell in the data will be now represented by the combination of their levels.

interactions.out <- interaction(algae$height,algae$herbivores)

Now we can run the correction procedures:

pairwise.t.test(algae$sqrtArea, g = interactions.out,p.adjust.method="none")
pairwise.t.test(algae$sqrtArea, g = interactions.out,p.adjust.method="bonferroni")
pairwise.t.test(algae$sqrtArea, g = interactions.out,p.adjust.method="BH")

But it still looks a bit unpolished. Let’s use the specialized function I made for the case of two-way ANOVAs:

pairwise.t.table.twoWay(algae$sqrtArea,algae$height,algae$herbivores,method = "BH")

Only significant values based on the adjusted P-values should be interpreted.