Tutorial 10: Mixed models - part 2

March 28, 2023 (10th week of classes)
Mixed-effects linear models part 2 (the regression case)

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!

This tutorial uses some of the graph codes produced here by James Santangelo (a colleague of mine):
https://uoftcoders.github.io/rcourse/lec08-linear-mixed-effects-models.html

The RIKZ data

Here we will use the RIKZ data used by Zuur et al. (2007; Analysing Ecological Data, Springer). As explained in class, the data allow assessing whether and how two predictors:

  1. NAP (the height of a sampling station compared to mean tidal level); and
  2. Exposure - a nominal index (high/low) composed of the following elements: wave action, length of the surf zone, slope, grain size, and the depth of the anaerobic layer are linked to marine benthic richness. In each inter-tidal area (9 in total denoted as ‘beach’), five samples were taken, and the macro-fauna and abiotic variables were measured. The interest here is to model how species richness relates to NAP and Exposure.

As discussed in our lecture, NAP and Exposure will be treated as fixed predictors (NAP is a continuous predictor and Exposure is a discrete factor) and beach will be treated as a random factor. Remember that NAP was measured at the site level (i.e., 5 sites within each beach) and Exposure was measured only at the beach level.

This is a nice data because it mixed different types of predictors and one of them is replicated and the other isn’t. Replication (sites within beaches) allows estimating random effects whereas lack of replication within the random factor of interest (here beach) doesn’t. This is an important realization because often biologists can do replication in their studies but they do not. If no replication exists for a given random factor of interest, then it’s not possible to consider it in the model.

Download the RIKZ data

Let’s start by uploading the data:

RIKZ.data <- read.csv("RIKZ.csv")
View(RIKZ.data)

To facilitate the analysis, let’s create a variable that codes beach as a factor directly in the RIKZ data matrix. This variable will be used to indicate that beach should be treated as a random factor in our models. Note that although the values in factor_Beach are exactly the same as in variable beach, R needs to recognize that factor_Beach needs to treated as a factor:

RIKZ.data$factor_Beach <- factor(RIKZ.data$Beach)
View(RIKZ.data)

Starting simple: regressing species richness just on NAP (fixed-effects regression model)

Let’s start by focusing for now on the importance of the continuous predictor NAP. Later on we will work on the more complex model.

We will start by simply evaluating whether species richness varies as a function of the fixed continous predictor NAP. This is then referred as to a fixed effect linear model. Let’s run the linear model and assess its predictive ability (using a fixed factor ANOVA):

fixed.lm <- lm(Richness ~ NAP, data = RIKZ.data)
summary(fixed.lm)

The influence of NAP on number of species (Richness) is significant and negative. Let’s plot the regression model:

library(ggplot2)
ggplot(RIKZ.data, aes(x = NAP, y = Richness)) +
    geom_point() +
    geom_smooth(method = "lm",formula = y ~ x) +
    theme_classic()

We won’t check assumptions right now so that we can concentrate the tutorial on mixed-models first. We will come back to the assumptions later on in the tutorial.

The hierarchical (nested) nature of the RIKZ data

The data contain information for multiple (5) sites within beaches across multiple (9) beaches. As such, there is the possibility the the way richness varies as a function of NAP varies across beaches:

grid.reg <- ggplot(RIKZ.data, aes(x = NAP, y = Richness)) +
    geom_point() +
    geom_smooth(method = "lm",formula = y ~ x) +
    facet_wrap(~ Beach) +
    xlab("NAP") + ylab("Richness") +
    theme_classic()
grid.reg

There is good indication that linear models of species richness on NAP differ across beaches. We could certainly estimate the regression model separately for each beach, i.e., how NAP influences Richness for each beach separately. But our interest is not that. Our interest is to estimate the overall importance of NAP on richness while accounting for the fact that sites within beaches are not independent and that there is variation among beaches. That’s the reason mixed-effects models come handy!

The random intercept model

We could have included Exposure in this model as a factor; but as explained earlier, we will be keeping it simple for now. Later in this tutorial, we will work on the more complex model containing both predictors.

When considering a continuous predictor, as discussed in our lecture, we can assess whether intercepts and slopes vary among levels of the random factor (here beaches). Traditionally when using a mixed-effects model, we start by fitting the simplest model asking whether intercepts change among beaches. This is the simpler model (in contrast to the model that estimates both intercept and slopes) because intercepts are calculated on the basis of the mean of the response variables (i.e., richness).

We will start by estimating the random intercept model to assess whether NAP is a relevant predictor of richness despite the variation in intercepts among beaches. In this tutorial, we will use another package that is also very popular for running mixed models in R, the package lme4.

install.packages("lme4")
library(lme4)
MixedModel.interceptOnly <- lmer(Richness ~ NAP + (1 | factor_Beach), data = RIKZ.data)
summary(MixedModel.interceptOnly)

(1 | factor_Beach) is the random effect term, where the 1 denotes that this is a random-intercept model and the term on the right of | is a factor to be used as the random effect. The factor here is beach, i.e, we are nesting sites within beaches to form the random effect.

NAP is the fixed effect and (1 | factor_Beach) is the random effect that accounts for variation in intercepts among beaches.

Note that the author of lme4 made a conscious decision to not estimate p-values for the random-effects. Douglas Bates, the author of the package explains some of these decisions here:

https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html

One issue underlies the calculate degrees for random effects in the model. There are different theories about it and some don’t agree is even possible (as discussed in lectures). Here we will use another package lmerTest to estimate the degrees of freedom of the fixed effect. Note that there is no universally-accepted way to to estimate them.

install.packages("lmerTest")
library(lmerTest)
ranova(MixedModel.interceptOnly)

The fixed effect NAP on species richness is significant and so is the random effect for intercepts based on variation of intercepts among beaches. Note that NAP is significant after including the random effect, i.e., despite the fact the random effect accounts for 48% of the residual variation once the fixed effect NAP (predictor) is included in the model.

How did we get the 48% values? This is the random variation in model intercepts that exists among beaches is assessed via the intraclass correlation. The package lme4 does not have a function that estimates the intraclass correlation, but the intraclass correlation can be also calculated by dividing the variance of the factor_Beach (Intercept) under the Random effects portion of the summary table above (8.668) by the total variation (i.e., residual variation left once the fixed effect NAP was entered in the model. The value is then 8.668 / (8.668 + 9.362) = 0.48, which is pretty high. This calculation was covered in our lecture 16.

Let’s plot the predicted values of this model:

RIKZ.data$fit_InterceptOnly <- predict(MixedModel.interceptOnly)

ggplot(RIKZ.data, aes(x = NAP, y = Richness, colour = factor_Beach)) +
    geom_abline(aes(intercept = `(Intercept)`, slope = NAP),
                size = 2,as.data.frame(t(fixef(MixedModel.interceptOnly)))) +
    geom_line(aes(y = fit_InterceptOnly), size = 1) +
    geom_point(size = 3) +
    theme_classic() +
    theme(legend.position = "none") +
    scale_colour_brewer(palette="Set1")

The thicker line is the fixed-effect regression line once the random effect is considered in the model (i.e., Richness = 6.58 + -2.57*NAP; found in summary(MixedModel.interceptOnly)) and the colored lines are the regression lines for each beach. Because this model only looks at variation in intercepts across beaches (i.e., The random intercept model), they all have the same slope as calculated estimated by the model.

The random intercept and slope model

Now let’s consider a model in which both intercepts and slopes are allowed to vary among beaches (i.e., richness on each beach also varies in its response to NAP). (1 + NAP | factor_Beach) below is the random effect term, where 1 denotes that we should consider variation in intercepts and also variation in slopes of NAP among beaches, i.e., NAP | factor_Beach, i.e., we are nesting regressions based on the sites within beaches to form the random effect.

MixedModel.IntSlope <- lmer(Richness ~ NAP + (1 + NAP|factor_Beach), data = RIKZ.data, REML = FALSE)
summary(MixedModel.IntSlope)

An error boundary (singular) fit: see ?isSingular. This warning just indicates that one or residual variances within beaches are very close to zero (i.e., perfect fit). You will notice in the plot grid.reg generated earlier that many observed points are on top of the predicted, i.e., residual variance is close to zero:

grid.reg

Note now that summary(MixedModel.IntSlope) p-values because we loaded the package lmerTest early on.

Let’s calculate the intraclass correlation, which is 0.65 (even higher than the intercept only model, meaning that the random effect in this model is stronger and accounts for even more variation in Richness (65%):

(10.949+2.502) / (10.949+2.502+7.174) 

Let’s plot the regression lines varying in slope and intercept for each beach as well as the general regression line for all the data considering the mixed model (thicker line in black):

RIKZ.data$fit_IntSlope <- predict(MixedModel.IntSlope)

ggplot(RIKZ.data, aes(x = NAP, y = Richness, colour = factor_Beach)) +
    geom_abline(aes(intercept = `(Intercept)`, slope = NAP),
                size = 2,
                as.data.frame(t(fixef(MixedModel.IntSlope)))) +
    geom_line(aes(y = fit_IntSlope), size = 1) +
    geom_point(size = 3) +
    theme_classic() +
    theme(legend.position = "none") +
    scale_colour_brewer(palette="Set1")

Which model to retain? Random intercept OR random intercept and slope model?

AIC(fixed.lm,MixedModel.interceptOnly,MixedModel.IntSlope)

AIC is a widely used metric of goodness of fit and the model with the lowest AIC value is the model with the best fit.

The best model is The random intercept and slope model

The more complex model

Let’s now consider both Exposure and NAP together in a series of models. Remember that there is only a single Exposure value per beach and therefore there is no need to treat it as a random effect.

Let’s start with the intercept only models:

  1. No interaction or main effect of exposure, i.e., just NAP (as seen earlier):
mixed_model_IntOnly_NAP <- lmer(Richness ~ NAP + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_NAP)                            
  1. No interaction or main effect of NAP, i.e., just exposure:
mixed_model_IntOnly_Exp <- lmer(Richness ~ Exposure + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_Exp)
  1. Model with the two fixed effects but no interactions:
mixed_model_IntOnly_NoInter <- lmer(Richness ~ NAP + Exposure + (1|factor_Beach), REML = FALSE,  data = RIKZ.data)
summary(mixed_model_IntOnly_NoInter)
  1. Model with the two fixed effects and their interaction
mixed_model_IntOnly_Full <- lmer(Richness ~ NAP*Exposure + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_Full)
  1. Model with only random effect, i.e., no fixed effects:
mixed_model_IntOnly_NoFix <- lmer(Richness ~ 1 + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_NoFix)

Models 1 to 4 are random intercept models; for each of them we could have considered the random intercept and slope model. Let’s just consider it for the more complex model, i.e., two main effects and their interaction:

mixed_model_IntSlope_Full <- lmer(Richness ~ NAP*Exposure + (1 + NAP|factor_Beach), REML = FALSE, data = RIKZ.data)

Note the error describe early here as well; again, this is not an issue.

What model best fit the data:

AIC(mixed_model_IntOnly_Full, mixed_model_IntOnly_NoInter,
    mixed_model_IntOnly_NAP, fixed.lm, mixed_model_IntOnly_Exp,
    mixed_model_IntOnly_NoFix,mixed_model_IntSlope_Full)

The best model is the one that has both predictors and their interaction for the random intercept model (AIC=242.11). The model with both fixed predictors and their interaction considering a random intercept and slope model provides a slightly worse fit (AIC=243.22)

Note that I ordered them knowing which one would have the best fit, i.e., lowest AIC! Let’s interpret the model:

summary(mixed_model_IntOnly_Full)

Increases in both NAP and Exposure results in a significant decrease in species richness. There is also a significant interaction between NAP and Exposure.

Assessing assumptions

Let’s take a look into the normality and homoscedasticity of residuals of the best model:

Q-Q normal residual plot:

qqnorm(resid(mixed_model_IntOnly_Full))
qqline(resid(mixed_model_IntOnly_Full))

Count data (such as species richness) tend to be quite non-normal. Often a square-root transformation makes residuals closer to normally distributed.

mixed_model_IntOnly_Full.sqrt <- lmer(sqrt(Richness) ~ NAP*Exposure + (1|Beach), REML = FALSE, data = RIKZ.data)
qqnorm(resid(mixed_model_IntOnly_Full.sqrt))
qqline(resid(mixed_model_IntOnly_Full.sqrt))

It does look much better even though we could look into other transformations to improve normality (e.g., third-root transformation, log of square root, etc). And these transformations would have to be used in all the models we performed above.

Now let’s look at the residual variance:

plot(mixed_model_IntOnly_Full.sqrt)

Residuals look homoscedastic across predicted values but we can test more formally:

library(car)
leveneTest(residuals(mixed_model_IntOnly_Full.sqrt) ~ RIKZ.data$factor_Beach)

Note that there are other sophisticated and robust ways to assess assumptions in mixed models: https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html

And it has been recently established that mixed models are also quite robust even in cases of homoscedasticity: https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13434