Tutorial 10: Mixed models - part 2

March 24, 2026 Mixed-effects linear models part 2 (the regression case)

CRITICAL: Regular practice with R and RStudio, the statistical software used in BIOL 322 and introduced during tutorial sessions, and consistent engagement with tutorial exercises are essential for developing strong skills in Biostatistics.

EXERCISES: No exercises for tutorial 10.

The RIKZ data

Here we will use the RIKZ data used by Zuur et al. (2007; Analysing Ecological Data, Springer). As explained in class, the data allow assessing whether and how two predictors:

NAP (the height of a sampling station compared to mean tidal level); and
Exposure - a nominal index (high/low) composed of the following elements: wave action, length of the surf zone, slope, grain size, and the depth of the anaerobic layer are linked to marine benthic richness. In each inter-tidal area (9 in total denoted as ‘beach’), five samples were taken, and the macro-fauna and abiotic variables were measured. The interest here is to model how species richness relates to NAP and Exposure.

As discussed in our lecture, NAP and Exposure will be treated as fixed predictors (NAP is a continuous predictor and Exposure is a discrete factor) and beach will be treated as a random factor. Remember that NAP was measured at the site level (i.e., 5 sites within each beach) and Exposure was measured only at the beach level.

This is a particularly useful dataset because it combines different types of predictors, including one that is replicated and another that is not. Replication (e.g., multiple sites within beaches) allows for the estimation of random effects, whereas the absence of replication within a factor of interest (such as beach) prevents this. This distinction is important: although replication is often feasible in biological studies, it is frequently overlooked. Without replication at the level of a given random factor, that factor cannot be properly included or interpreted within a mixed-effects model.

Download the RIKZ data

Let’s start by uploading the data:

RIKZ.data <- read.csv("RIKZ.csv")
View(RIKZ.data)

To facilitate the analysis, let’s create a variable that codes beach as a factor directly in the RIKZ data matrix. This variable will be used to indicate that beach should be treated as a random factor in our models. Note that although the values in factor_Beach are exactly the same as in variable beach, R needs to recognize that factor_Beach needs to treated as a factor:

RIKZ.data$factor_Beach <- factor(RIKZ.data$Beach)
View(RIKZ.data)

Starting simple: regressing species richness just on NAP (fixed-effects regression model)

Let’s start by focusing for now on the importance of the continuous predictor NAP. Later on we will work on the more complex model.

We will start by simply evaluating whether species richness varies as a function of the fixed continous predictor NAP. This is then referred as to a fixed effect linear model. Let’s run the linear model and assess its predictive ability (using a fixed factor ANOVA):

fixed.lm <- lm(Richness ~ NAP, data = RIKZ.data)
summary(fixed.lm)

The influence of NAP on number of species (Richness) is significant and negative. Let’s plot the regression model:

library(ggplot2)
ggplot(RIKZ.data, aes(x = NAP, y = Richness)) +
    geom_point() +
    geom_smooth(method = "lm",formula = y ~ x) +
    theme_classic()

For now, we will not assess model assumptions so that we can focus the tutorial on understanding mixed models. We will return to the evaluation of assumptions later in the tutorial.

The hierarchical (nested) nature of the RIKZ data

The data contain (I write data often in the plural form) information for multiple (5) sites nested within each of several (9) beaches. As such, there is the possibility that the relationship between richness and NAP varies across beaches:

grid.reg <- ggplot(RIKZ.data, aes(x = NAP, y = Richness)) +
    geom_point() +
    geom_smooth(method = "lm",formula = y ~ x) +
    facet_wrap(~ Beach) +
    xlab("NAP") + ylab("Richness") +
    theme_classic()
grid.reg

There is strong evidence that the relationship between species richness and NAP differs across beaches. One approach would be to fit separate regression models for each beach—that is, to estimate how NAP influences richness within each beach independently. However, this is not our primary goal. Instead, we aim to estimate the overall effect of NAP on richness while accounting for the lack of independence among sites within the same beach and the variability among beaches. This is precisely where mixed-effects models become particularly useful.

The random intercept model

We could have included Exposure in this model as a factor; but as explained earlier, we will be keeping it simple for now. Later in this tutorial, we will work on the more complex model containing both predictors.
When considering a continuous predictor, as discussed in the lecture, we can evaluate whether both intercepts and slopes vary among levels of the random factor (here, beaches). In practice, we typically begin with the simplest mixed-effects model by asking whether intercepts vary among beaches. This model is simpler than one that allows both intercepts and slopes to vary, because intercepts are based on differences in the mean of the response variable (i.e., richness) across beaches.

We will start by estimating the random intercept model to assess whether NAP is a relevant predictor of richness despite the variation in intercepts among beaches. In this tutorial, we will use another package that is also very popular for running mixed models in R, the package lme4.

install.packages("lme4")
library(lme4)
MixedModel.interceptOnly <- lmer(Richness ~ NAP + (1 | factor_Beach), data = RIKZ.data)
summary(MixedModel.interceptOnly)

(1 | factor_Beach) is the random effect term, where the 1 denotes that this is a random-intercept model and the term on the right of | is a factor to be used as the random effect. The factor here is beach, i.e, we are nesting sites within beaches to form the random effect.

NAP is the fixed effect and (1 | factor_Beach) is the random effect that accounts for variation in intercepts among beaches.

Note that the author of lme4 made a conscious decision to not estimate p-values for the random-effects. Douglas Bates, the author of the package explains some of these decisions here:

https://stat.ethz.ch/pipermail/r-help/2006-May/094765.html

One important issue concerns how to calculate degrees of freedom for random effects in mixed models. There is no clear consensus on how this should be done, and some approaches even question whether it can be defined in a meaningful way, as discussed in the lectures. In practice, we will use the lmerTest package to obtain approximate degrees of freedom for the fixed effects. It is important to keep in mind, however, that there is no universally accepted method for estimating these values.

install.packages("lmerTest")
library(lmerTest)
ranova(MixedModel.interceptOnly)

The fixed effect of NAP on species richness is significant, and so is the random intercept effect, indicating substantial variation in intercepts among beaches. Importantly, NAP remains significant even after accounting for this random effect, despite the fact that the random intercept explains 48% of the remaining variation once NAP is included in the model.

How do we obtain this 48% value? It corresponds to the intraclass correlation coefficient (ICC), which quantifies the proportion of total variance attributable to differences among beaches (i.e., variation in intercepts). Although the lme4 package does not provide a direct function to compute the ICC, it can be calculated manually by dividing the variance of the random intercept (factor_Beach (Intercept) = 8.668) by the total variance, defined as the sum of the random intercept variance and the residual variance. In this case:

8.668 / (8.668 + 9.362) = 0.48

This relatively high value indicates that a large proportion of the variation in richness is structured at the beach level. This calculation was discussed in Lecture 16.

Let’s plot the predicted values of this model:

RIKZ.data$fit_InterceptOnly <- predict(MixedModel.interceptOnly)

ggplot(RIKZ.data, aes(x = NAP, y = Richness, colour = factor_Beach)) +
    geom_abline(aes(intercept = `(Intercept)`, slope = NAP),
                size = 2,as.data.frame(t(fixef(MixedModel.interceptOnly)))) +
    geom_line(aes(y = fit_InterceptOnly), size = 1) +
    geom_point(size = 3) +
    theme_classic() +
    theme(legend.position = "none") +
    scale_colour_brewer(palette="Set1")

The thicker line is the fixed-effect regression line once the random effect is considered in the model (i.e., Richness = 6.58 + -2.57*NAP; found in summary(MixedModel.interceptOnly)) and the colored lines are the regression lines for each beach. Because this model only looks at variation in intercepts across beaches (i.e., The random intercept model), they all have the same slope as calculated estimated by the model.

The random intercept and slope model

Now let’s consider a model in which both intercepts and slopes are allowed to vary among beaches (i.e., richness on each beach also varies in its response to NAP). (1 + NAP | factor_Beach) below is the random effect term, where 1 denotes that we should consider variation in intercepts and also variation in slopes of NAP among beaches, i.e., NAP | factor_Beach, i.e., we are nesting regressions based on the sites within beaches to form the random effect.

MixedModel.IntSlope <- lmer(Richness ~ NAP + (1 + NAP|factor_Beach), data = RIKZ.data, REML = FALSE)
summary(MixedModel.IntSlope)

An error boundary (singular) fit: see ?isSingular. This warning just indicates that one of the residual variances within beaches are very close to zero (i.e., perfect fit). You will notice in the plot grid.reg generated earlier that many observed points are on top of the predicted, i.e., residual variance is close to zero (perfect prediction):

grid.reg

Note now that summary(MixedModel.IntSlope) p-values because we loaded the package lmerTest early on.

Let’s calculate the intraclass correlation, which is 0.65 (even higher than the intercept only model, meaning that the random effect in this model is stronger and accounts for even more variation in Richness (65%):

(10.949+2.502) / (10.949+2.502+7.174) = 65%

Let’s plot the regression lines varying in slope and intercept for each beach as well as the general regression line for all the data considering the mixed model (thicker line in black):

RIKZ.data$fit_IntSlope <- predict(MixedModel.IntSlope)

ggplot(RIKZ.data, aes(x = NAP, y = Richness, colour = factor_Beach)) +
    geom_abline(aes(intercept = `(Intercept)`, slope = NAP),
                size = 2,
                as.data.frame(t(fixef(MixedModel.IntSlope)))) +
    geom_line(aes(y = fit_IntSlope), size = 1) +
    geom_point(size = 3) +
    theme_classic() +
    theme(legend.position = "none") +
    scale_colour_brewer(palette="Set1")

Which model to retain? Random intercept OR random intercept and slope model?

AIC(fixed.lm,MixedModel.interceptOnly,MixedModel.IntSlope)

AIC is a widely used metric of goodness of fit and the model with the lowest AIC value is the model with the best fit.

The best model is The random intercept and slope model

The more complex model

Let’s now consider both Exposure and NAP together in a series of models. Remember that there is only a single Exposure value per beach and therefore there is no need to treat it as a random effect.

Let’s start with the intercept only models:

No interaction or main effect of exposure, i.e., just NAP (as seen earlier):

mixed_model_IntOnly_NAP <- lmer(Richness ~ NAP + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_NAP)

No interaction or main effect of NAP, i.e., just exposure:

mixed_model_IntOnly_Exp <- lmer(Richness ~ Exposure + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_Exp)

Model with the two fixed effects but no interactions:

mixed_model_IntOnly_NoInter <- lmer(Richness ~ NAP + Exposure + (1|factor_Beach), REML = FALSE,  data = RIKZ.data)
summary(mixed_model_IntOnly_NoInter)

Model with the two fixed effects and their interaction

mixed_model_IntOnly_Full <- lmer(Richness ~ NAP*Exposure + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_Full)

Model with only random effect, i.e., no fixed effects:

mixed_model_IntOnly_NoFix <- lmer(Richness ~ 1 + (1|factor_Beach), REML = FALSE, data = RIKZ.data)
summary(mixed_model_IntOnly_NoFix)

Models 1 to 4 are random intercept models; for each of them we could have considered the random intercept and slope model. Let’s just consider it for the more complex model, i.e., two main effects and their interaction:

mixed_model_IntSlope_Full <- lmer(Richness ~ NAP*Exposure + (1 + NAP|factor_Beach), REML = FALSE, data = RIKZ.data)
mixed_model_IntSlope_Full

Note the error describe early here as well; again, this is not an issue.

What model best fit the data:

AIC(mixed_model_IntOnly_Full, mixed_model_IntOnly_NoInter,
    mixed_model_IntOnly_NAP, fixed.lm, mixed_model_IntOnly_Exp,
    mixed_model_IntOnly_NoFix,mixed_model_IntSlope_Full)

The best model is the one that has both predictors and their interaction for the random intercept model (AIC=242.11). The model with both fixed predictors and their interaction considering a random intercept and slope model provides a slightly worse fit (AIC=243.22).

Note that I ordered them knowing which one would have the best fit, i.e., lowest AIC! Let’s interpret the model:

summary(mixed_model_IntOnly_Full)

Increases in both NAP and Exposure results in a significant decrease in species richness. There is also a significant interaction between NAP and Exposure.

Assessing assumptions

Let’s take a look into the normality and homoscedasticity of residuals of the best model:

Q-Q normal residual plot:

qqnorm(resid(mixed_model_IntOnly_Full))
qqline(resid(mixed_model_IntOnly_Full))

Count data (such as species richness) tend to be quite non-normal. Often a square-root transformation makes residuals closer to normally distributed.

mixed_model_IntOnly_Full.sqrt <- lmer(sqrt(Richness) ~ NAP*Exposure + (1|Beach), REML = FALSE, data = RIKZ.data)
qqnorm(resid(mixed_model_IntOnly_Full.sqrt))
qqline(resid(mixed_model_IntOnly_Full.sqrt))

It does look much better even though we could look into other transformations to improve normality (e.g., third-root transformation, log of square root, etc). And these transformations would have to be used in all the models we performed above.

Now let’s look at the residual variance:

plot(mixed_model_IntOnly_Full.sqrt)

Residuals look homoscedastic across predicted values but we can test more formally:

library(car)
leveneTest(residuals(mixed_model_IntOnly_Full.sqrt) ~ RIKZ.data$factor_Beach)

Note that there are other sophisticated and robust ways to assess assumptions in mixed models: https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html

And it has been recently established that mixed models are also quite robust even in cases of homoscedasticity: https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13434