Tutorial 10: ANOVAs

Analysis of variance (ANOVA) and important related statistical frameworks: Levene’s test of homogeneity of variances and the Tukey-Kramer tests between all pairs of means

Week of November 11, 2024
(10th week of classes)

How the tutorials work

The DEADLINE for your report is always at the end of the tutorial. Problems for this report are spread out throughout this tutorial.

The INSTRUCTIONS for this report is found at the end of the tutorial.

While students may eventually be able to complete their reports independently, we strongly recommend attending the synchronous lab/tutorial sessions. Please note that your TA is not responsible for providing assistance with lab reports outside of these scheduled sessions.

The REPORT INSTRUCTIONS (what you need to do to get the marks for this report) is found at the end of this tutorial.

Your TAs

Section 0101 (Tuesday): 13:15-16:00 - Aliénor Stahl (a.stahl67@gmail.com)
Section 0103 (Thursday): 13:15-16:00 - Alexandra Engler (alexandra.engler@hotmail.fr)

Levene’s test of homogeneity of variances

When conducting an Analysis of Variance (ANOVA), we assume that the samples from all groups are drawn from populations with the same variances. This assumption is crucial because the F-distribution, used in ANOVA, relies on the premise that variance within groups does not differ significantly (i.e., the variances are homogenous).

As previously discussed, the F-distribution represents the sampling distribution of the ratios of sample variances taken from normally distributed populations with equal variances. ANOVA results can be influenced by differences in variances among groups, much like results from the standard two-sample t-test. Therefore, before performing an ANOVA, it is essential to ensure that this assumption is satisfied.

0.1 Testing for Homogeneity of Variances

The first step in an ANOVA is to test for equality (homogeneity) of variances, commonly referred to as homoscedasticity. The null hypothesis (H₀) for this test states that the samples (groups) come from populations with the same variances. If this H₀ is rejected, the standard ANOVA should not be used.

0.1.1 Levene’s Test for Homoscedasticity

To test for homoscedasticity, we will use Levene’s test, a widely used method for assessing variance equality. Although other tests for homoscedasticity exist, Levene’s test is commonly preferred due to its robustness.

Let us proceed with Levene’s test to determine if we can apply ANOVA to the circadian data discussed in Lecture 17.

0.1.2 Installing the Required Package in R

The standard R installation (R base) does not include Levene’s test. To perform this test, we need to install the car package, which is frequently used for this purpose. While other R packages also provide Levene’s test, car is a popular and reliable choice.

To install the car package, use the following code:

install.packages("car",dependencies=TRUE)

During the installation process, R may ask you Do you want to install from sources the package which needs compilation? (Yes/no/cancel); if that happens (it doesn’t always), simply write no and press enter.

You may also need to install this package to run the package car as it seems that car depends on it but was not made part of the installation process of car even setting dependencies=TRUE:

install.packages("openxlsx")

Now, call the package car:

library(car)

Download the circadian data file

Now upload and inspect the data:

circadian <- read.csv("chap15e1KneesWhoSayNight.csv")
View(circadian)

Now we can run the Levene’s test as follows:

leveneTest(shift ~ factor(treatment), data=circadian)

Problem 1:
Write all the code in the R file necessary to answer the problems in this tutorial.
1a) Using a significance level (alpha) equal to 0.05, should we reject or not the null hypothesis of homoscedasticity?

1b) Can we apply ANOVA to analyse these data?

In your file identify problem 1:
# Problem 1: write your answer.
# continue your answer

Graphical and table representation of group means

Let’s produce a stripchart of the data to observe their differences in a graphical format. The argument pch=1 sets the data points to be graphed as circle (open circles).

stripchart(shift ~ treatment, data = circadian, vertical = TRUE,pch=1,col="firebrick",xlab="light treatment",ylab="shift in circadian rhythm (h)")

The groups (treatments) are displayed in alphabetical order by default. However, as discussed in class, the eyes treatment, which shows the most negative shift in circadian rhythm (i.e., the highest melatonin production), might be more effectively visualized if placed in the third column of the graph. This adjustment can enhance the contrast among the groups (treatments) and make the differences more apparent. To achieve this, we can reorder the treatment column in the dataset as follows:

circadian$treatment <- factor(circadian$treatment, levels = c("control", "knee", "eyes"))

Now plot again the stripchart:

stripchart(shift ~ treatment, data = circadian, vertical = TRUE,pch=1,col="firebrick",xlab="light treatment",ylab="shift in circadian rhythm (h)")

Let’s produce a table of descriptive statistics (mean and standard deviation) and sample size for each treatment. This table is commonly used to report data considering multiple groups:

meanShift <- tapply(circadian$shift, circadian$treatment, mean)
sdevShift <- tapply(circadian$shift, circadian$treatment, sd)
n         <- tapply(circadian$shift, circadian$treatment, length)
data.frame(mean = meanShift, std.dev = sdevShift, n = n)

The data.frame function (used above) is an extremely useful tool, but it will not be covered in detail in BIOL322. In simple terms, data.frame allows you to combine different variables into a unified data structure. In the example above, these variables include the mean, standard deviation, and sample size (n).

Problem 2:
Create and report code to generate a similar table to the one above but that also includes the median of each group (just after the mean).

In your file identify problem 2:
# Problem 2: enter your code here.
# continue your answer

Analysis of variance (ANOVA)

To conduct an Analysis of Variance (ANOVA), we use the aov function, which stands for analysis of variance. The aov function analyzes a response variable (dependent variable) as a function (indicated by the symbol ~) of a categorical predictor.

In this case, we aim to analyze the variation in shifts in circadian rhythm (response variable) as a function of light treatment (predictor). Once the aov model is created, the anova function is used to generate the ANOVA table. This table includes the F-value for the analysis and its associated P-value.

circadianANOVA <- aov(shift ~ treatment, data = circadian)
anova(circadianANOVA)

Problem 3:
No need for code to answer the two questions below.

3.a) Using an significance level (alpha) equal to 0.05, should we reject or not the null hypothesis?

3.b) Provide an alpha value in which the null hypothesis would not be rejected.

In your file identify problem 3:
# Problem 3: write your answer.
# continue your answer

Post-hoc test: Tukey-Kramer tests between all pairs of means

The term post-hoc means performed after the event, with the event in this context being the ANOVA. This means we first conduct the ANOVA and then proceed with the Tukey-Kramer post-hoc test. The term “post-hoc” is commonly used as jargon in statistics texts.

Next, we will use the Tukey-Kramer test to compare all pairs of means and determine which contrasts (the differences between specific pairs of means) are statistically significant. The TukeyHSD function facilitates this by utilizing the object created earlier with the aov function to evaluate all possible mean differences (contrasts) between groups (treatments).

posthoc <- TukeyHSD(circadianANOVA, conf.level=0.95)
posthoc

The adjusted probability (p adj) is the P-value calculated on the basis that 3 different statistical tests were performed for the same data.

Problem 4:
Using a significance level (alpha) equal to 0.05, which pairs of means should be considered significantly different? Provide an explanation of the results in the context of the research problem.

In your file identify problem 4:
# Problem 4: write your answer.
# continue your answer

Submit the report in an RStudio file through Moodle.