Tutorial 10: ANOVAs

Analysis of variance (ANOVA) and important related statistical frameworks: Levene’s test of homogeneity of variances and the Tukey-Kramer tests between all pairs of means

Week of November 14, 2022
(10th week of classes)


How the tutorials work

The DEADLINE for your report is always at the end of the tutorial. Problems for this report are spread out throughout this tutorial.

The INSTRUCTIONS for this report is found at the end of the tutorial.

Students may be able after a while to complete their reports without attending the synchronous lab/tutorial sessions. That said, we highly recommend that you attend the tutorials as your TA is not responsible for assisting you with lab reports outside of the lab synchronous sections.

The REPORT INSTRUCTIONS (what you need to do to get the marks for this report) is found at the end of this tutorial.

Your TAs

Section 101 (Wednesday): 10:15-13:00 - John Williams ()
Section 102 (Wednesday): 13:15-16:00 - Hammed Akande ()
Section 103 (Friday): 10:15-13:00 - Michael Paulauskas ()
Section 104 (Friday): 13:15-16:00 - Alexandra Engler ()


Levene’s test of homogeneity of variances

When conducting an Analysis of Variance (ANOVA), we assume that the samples from all groups were drawn from populations with the same variances. This is because the F-distribution is used in ANOVA assumes that variance within groups do not vary significantly (i.e., can be assumed the same). As seen previously, the F distribution can be understood as representing the sampling distribution of the ratios of sample variances sampled from normally distributed populations with the same variances. It is known that ANOVA results are affected by differences in variances among groups in the same way that the results based on the standard two-sample t-test are.

As such, before conducting an ANOVA, we need first to generate evidence that this assumption is met. So, the first assessment when conducting an ANOVA is to test for the equality (homogeneity) of variances (commonly referred as to homoscedasticity).

Here the H0 is that samples (groups) come from populations with the same variances. If this H0 is rejected, we cannot use the standard ANOVA. To test this H0, we will use the Levene’s test. There are other tests to assess homoscedasticity, but this is commonly used. Let’s proceed with the Levene’s test and determine whether we should use ANOVA to analyse the circardian data seen in lecture 17.

The standard R instalation (called R base) does not contain the Levene's test and a special packaged called car needs to be installed. There are other packages that contain the Levene's test but car is commonly used. To do that, simply:

install.packages("car",dependencies=TRUE)

During the installation process, R may ask you Do you want to install from sources the package which needs compilation? (Yes/no/cancel); if that happens (it doesn’t always), simply write no and press enter.

You may also need to install this package to run the package car as it seems that car depends on it but was not made part of the installation process of car even setting dependencies=TRUE:

install.packages("openxlsx")

Now, call the package car:

library(car)

Download the circadian data file

Now upload and inspect the data:

circadian <- read.csv("chap15e1KneesWhoSayNight.csv")
View(circadian)

Now we can run the Levene’s test as follows:

leveneTest(shift ~ factor(treatment), data=circadian)

Problem 1:
1a) Using a significance level (alpha) equal to 0.05, should we reject or not the null hypothesis of homoscedasticity?

1b) Can we apply ANOVA to analyse these data?

In your file identify problem 1:
# Problem 1: write your answer.
# continue your answer


Graphical and table representation of group means

Let’s produce a stripchart of the data to observe their differences in a graphical format. The argument pch=1 sets the data points to be graphed as circle (open circles).

stripchart(shift ~ treatment, data = circadian, vertical = TRUE,pch=1,col="firebrick",xlab="light treatment",ylab="shift in circadian rhythm (h)")

The order of the groups (treatments) appear on alphabetical order. As discussed in class, because the eyes treatment has the most negative shift in circadian rhythm (i.e., highest production of melatonin), one may prefer to display it in the third column of the graph so that the contrast among groups (treatments) is more obvious. This can be done by changing the order of the column treatment in the data:

circadian$treatment <- factor(circadian$treatment, levels = c("control", "knee", "eyes")) 

Now plot again the stripchart:

stripchart(shift ~ treatment, data = circadian, vertical = TRUE,pch=1,col="firebrick",xlab="light treatment",ylab="shift in circadian rhythm (h)")

Let’s produce a table of descriptive statistics (mean and standard deviation) and sample size for each treatment. This table is commonly used to report data considering multiple groups:

meanShift <- tapply(circadian$shift, circadian$treatment, mean)
sdevShift <- tapply(circadian$shift, circadian$treatment, sd)
n         <- tapply(circadian$shift, circadian$treatment, length)
data.frame(mean = meanShift, std.dev = sdevShift, n = n)

The function data.frame (used above) is extremely useful but won’t be covered in details in BIOL322. It suffices to say that data.frame allows to place together different variables into a common data structure. In the above case these variables were the mean, standard deviation and sample size (n).

Problem 2:
Create code to generate a similar table to the one above but that also includes the median of each group (just after the mean).

In your file identify problem 2:
# Problem 2: enter your code here.
# continue your answer


Analysis of variance (ANOVA)

To conduct an Analysis of Variance (ANOVA), we use the function aov which stands for analysis of variance. The function aov analyzes a response variable (dependent variable) as a function (hence the symbol ~) of a categorical predictor. Here we want to analyse the variation in shifts in circadian rhythm (response variable) as a function of light treatment (predictor). The function anova is then used to create the anova table, including the F-value for the analysis and its associated P-value.

circadianANOVA <- aov(shift ~ treatment, data = circadian)
anova(circadianANOVA)

Problem 3:
3.a) Using an significance level (alpha) equal to 0.05, should we reject or not the null hypothesis?

3.b) Provide an alpha value in which the null hypothesis would not be rejected.

In your file identify problem 3:
# Problem 3: write your answer.
# continue your answer


Post-hoc test: Tukey-Kramer tests between all pairs of means

Post-hoc means “performed after the event” where the event here is the ANOVA. So, we conduct the ANOVA first, and then conduct the Tukey-Kramer Post-hoc test. Post-hoc is the jargon used in stats books!

Now we will conduct the Tukey-Kramer tests between all pairs of means to evaluate which contrast (any given difference between a pair of means) are significantly different. The function TukeyHSD uses the object created by the function aov earlier to compare all possible mean differences (contrasts) between groups (treatments):

posthoc <- TukeyHSD(circadianANOVA, conf.level=0.95)
posthoc 

The adjusted probability (p adj) is the P-value calculated on the basis that 3 different statistical tests were performed for the same data.

Problem 4:
Using a significance level (alpha) equal to 0.05, which pairs of means should be considered significantly different? Provide an explanation of the results in the context of the research problem.

In your file identify problem 4:
# Problem 4: write your answer.
# continue your answer

Submit the RStudio file containing the report via Moodle.