Tutorial 6: ANCOVA

February 21, 2023 (6th week of classes)
Analysis of Covariance

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!

Here we will use the overcompensation study involving the Grazing experiment covered in class. Data come from Belgelson and Crawley (1992) - Effects of grazing on fruit production.

Download the Grazing data

Load the data:

regrowth <- read.csv("GrazingAncova.csv")

Plotting data organized in groups (series)

One important aspect in biostatistics is to represent data in a way that depicts the research question you are tackling. In these data, we have two continuous and one categorical predictor. The categorical predictor has two groups (treatments).

Let’s plot these data using ggplot and learn some more about this system for creating graphs.

We should start by loading the package. Make sure that is installed (as seen in tutorial 3 on factorial ANOVA)


ggplots are based on a series of elements (in the form of functions) that can be added to one another using +. We can save the plot into objects that can be then used later to add more functionality. We will start with the basic graph and then add elements. The function ggplot sets the basic plot with the data that will be used; the function geom_point is then added to state that it will be a scatter plot and to determine the size of the points and the labels

grazing.plot <- ggplot(data=regrowth, aes(x=Root, y=Fruit, color=Grazing)) + geom_point(aes(shape=Grazing), size=2.5) + xlab("Root size") + ylab("Fruit production")

The default for ggplots are based on a grey grid as background. I prefer a more classic look but this is totally personal. Here is how to generate the graph in a more “classic” way: We add the initial plot saved in the object grazing.plot. We can either keep the same name for the new graph or create a new object. I will do the latter here so that we can see how the plot “evolved”:

grazing.plot.classicLook <- grazing.plot +  theme_classic()

We can also add regression lines for each group as follows. We can either use the classic theme (“look”) or the default.

grazing.plot.classicReg <- grazing.plot.classicLook + geom_smooth(method="lm",formula=y~x,se=FALSE)

Or we could had used also the default theme, which was saved in the object grazing.plot.

grazing.plot.default <- grazing.plot + geom_smooth(method="lm",formula=y~x,se=FALSE)

From now on I’ll use the classif theme but you can easily adapt the code to use the default theme. We may want to change the colors to our favourite ones:

grazing.plot.col <- grazing.plot.classicReg + scale_color_manual(values=c('blue','firebrick'))

Finally, we may also want to plot the mean values of fruit production for each group; for doing that we need also to plot the average value for root size. We start by calculating these mean values:

mean.treatment.fruit <- aggregate(Fruit~Grazing,FUN=mean,data=regrowth)
mean.treatment.root <- aggregate(Root~Grazing,FUN=mean,data=regrowth)

And the plot them:

grazing.plot.col.mean <- grazing.plot.col + 
           geom_point(data = data.frame(x=mean.treatment.root[,2],
           y=mean.treatment.fruit[,2],Grazing=regrowth$Grazing), aes(x=x, y=y),size=3,col="black")

And because we saved the plots using different names, we can now see their “evolution”:


Analysis of Covariance (ANCOVA)

As discussed in class, the first statiscal hypothesis that needs to be tested is whether the covariate (initial root size) can predict the response variable (fruit production). If fruit production is independent of initial size, then there is no need to adjust the values of fruit production and a simple ANOVA can be conducted.


The p-value is quite small, serving as strong evidence that fruit production can be predicted by root size.

The second relevant hypothesis is to test whether the groups (levels) of the factor (grazed and non-grazed) share a common slope (i.e., they do not differ significantly). If the interaction between the categorical (Grazing) and continuos variable (root size) is non-significant, then we can assume that the two (grazed/non-grazed) share a common slope for fruit production on root size that can be then used to predict a mean value that is common to both treatments:


The P-value for the interaction (Root:Grazing) is large (i.e., do not reject H0) and, as such, grazed and ungreazed treatments can be assumed to share a common slope for fruit production on root size. We can now proceed with assessing statistically whether “grazing affects fruit production once controlled for initial root size”. To do that, we remove the interaction term from the previous ANOVA but leave the factor grazing and covariate root size. Contrast the following two possible ANOVA results, one entering Root first and then Grazing; and the other ANOVA with the opposite entry:


Note how the sum-of-square (Sum Sq) and their respective F-values (and related probabilities) changed between the two model orders as we covered in class. Here, in either order, the contribution of the Grazing treatment was significant, i.e., the mean differences between grazed and ungrazed treatments was significantly independent of the initial root size. However, this may not always be the case and it could had been that in one order the factor (here grazing treatement) could had been significant but not in the other order. The problem here, as discussed in class, is the lack of orthogonality between grazing and root size (i.e., they are correlated by definition. So, here we need to use the type III sum of squares instead of the standard type I sum of squares in the function anova. To run ANOVAs based on type II and III sum of squares, we need the package car. You should have installed this package and used in tutorials 3 and 5.


The flexible command in package car is Anova and not anova as in the standard R (i.e., capital A for the first letter)

Anova(lm(Fruit~as.factor(Grazing)+Root,data=regrowth), type = "III") 

Note that it gives the same results (F-values and P-values) as the reversed order of factor + covariate:

Anova(lm(Fruit~Root+as.factor(Grazing),data=regrowth), type = "III") 

Now that we know that Grazing significantly affects fruit production independently of initial root size, the next important question underlies the direction of change, i.e., does grazing is still related to greater fruit production (as for the initial analysis) OR once the covariate is controlled for, the fruit production is greater in the absence of grazing?

To estimate the directionality of change, we start by calculating a common mean for the covariate, i.e., initial root size.


Now, build a regression object with the model:


Prepare a data frame for prediction, i.e., we want to predict what is the fruit production given a common mean (7.18115) for the initial root size. The predict data have to have the same names as is in the original model that will be used to predict otherwise the function predict won’t work:


Now, let’s predict the data:

predicted.values <- predict(lm.regrowth,data.predict)

Note that data.predict above has Grazed assigned as row 1 and Ungrazed as row 2. We need to know the order to pick the corrected predicted value for each case. So, the predicted mean fruit growth value for Grazed is 41.35888 and for Ungrazed is 77.46212. Here are the original uncorrected (by initial root size) fruit growth values for comparison (as you calculated earlier):


So the initial uncorrected mean values had fruit growth in the Grazed larger than Ungrazed. BUT, by adjusting the mean values for growth potential (i.e., initial root size), the outcome is the reverse (i.e., Ungrazed larger than Grazed). And the difference in corrected means is significant as tested earlier in Anova(lm(Fruit~Root+Grazing,data=regrowth),type = "III")

Let’s plot these adjusted mean values on the plot. We will simply add these as points in our ggplot

grazing.plot.col.mean + 
           geom_point(data = data.frame(x=c(7.18115,7.18115),
           y=predicted.values), aes(x=x, y=y),size=3,col="green")

Model adequacy steps

1) Assess normality; most data points fall on the line so we will assume normality.


To remove the caption ‘Normal Q-Q’ from the plot, you set caption as empty:


2) Assess residual heteroscedasticity:


The line looks “flat”, which suggests no tendency in increase in residual variation with predicted values.

We can test this assumption more formally using the Breusch-Pagan test against heteroscedasticity available in the package lmtest. The Levene’s test, as used previously in ANOVA, is appropriate when there are only categorical factors. Here we also have a continuous covariate (initial root size) so we can’t use the Levene’s test. You can think of the Breusch-Pagan test as testing the correlation between residual and predicted values.


To run the Breusch-Pagan test:


The P-values (0.4261) is reasonably large (greater than alpha=0.05), so the assumption of residual homoscedasticity holds for these data.