Tutorial 8: Heteroscedasticity and GLMs
March 17, 2026
CRITICAL: Regular practice with R and RStudio, the statistical software used in BIOL 322 and introduced during tutorial sessions, and consistent engagement with tutorial exercises are essential for developing strong skills in Biostatistics.
EXERCISES: No exercises for tutorial 8 given that we need to cover tutorial 9 as well.
Dealing with Heteroscedasticity using general linear models (GLMs)Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!
General (not Generalized) Linear Models assume that residuals are normally distributed and have constant variance (homoscedasticity). In practice, linear models are often fairly robust to moderate departures from normality, especially with large sample sizes. However, they are much less robust to violations of the constant-variance assumption (residual heteroscedasticity), because unequal variances can distort standard errors and statistical tests.
It’s also important to note that heteroscedasticity can itself generate non-normal residual distributions. When residuals with different variances are combined, the overall distribution may appear skewed or heavy-tailed. Consequently, addressing heteroscedasticity can sometimes also improve the apparent normality of residuals.
Small set of simulations to understand the nature of residual homoscedasticity versus heteroscedasticity
Let’s start by understanding what residual homoscedasticity is by simulating a small example (as we saw in lectures 12 and 14). But first, it will be useful to understand how the function rnorm that generates normally distributed values in R works:
We can either state that for each value simulated we want the same variance (here 20 observations from the same population with mean = 0 and standard deviation = 10):
rnorm(20,0,10)
Alternatively, instead of specifying a single value for the standard deviation, we can generate data from several normally distributed populations that share the same mean (e.g., 0) but differ in their standard deviations.
std.dev.vector <- c(10,10,10,10,10,2,2,2,2,2,3,4,5,6,7,8,10,2,2,2)
rnorm(20,0,std.dev.vector)
In this case, values that share the same standard deviation come from the same normally distributed population, since they are generated with the same mean (zero) and the same standard deviation.
Now let’s generate samples from a regression population with an intercept of 12 and a slope of 2.5. The sample size is n = 100. This is considered a population because the parameters of the data-generating process (the intercept, slope, and error distribution) are fixed and known, allowing us to generate sample data directly from that underlying model.
The error term e (a vector of residuals) is assumed to have constant variance, meaning that all residuals are drawn from the same distribution (in this example, mean = 0 and standard deviation = 10). The predictor X is simply a sequence of integers from 1 to n (i.e., 1, 2, 3, …, 100).
n <- 100
X <- 1:n
X
e <- rnorm(n,0,10)
Y <- 12 + 2.5*X + e
View(cbind(X,Y))
Let’s fit the model:
fit <- lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)
Note how the residuals deviate from the regression line by roughly the same magnitude across all values of X, meaning that the variance of the residuals is constant (i.e., the residuals are homoscedastic). This happens because they were all sampled from the same distribution (in this example, mean = 0 and standard deviation = 10). We can verify this as follows.
We can see that residuals are normal:
plot(fit, which=2)
And residuals are homoscedastic:
plot(fit, which=3)
Let’s now generate residuals that come from populations with different variances (i.e., heteroscedastic residuals). Note that the expected mean of the residuals is always zero, as defined in regression and in any General Linear Model. Here, we will make the residual variance proportional to the predictor variable X, which is a pattern commonly observed in biological data sets: the variability of observations tends to increase (or sometimes decrease) along an environmental or biological gradient.
To create this pattern, we replace the constant standard deviation used previously with the vector X. As a result, the residual variance increases as X increases.
n <- 100
X <- 1:n
e <- rnorm(n,0,X)
Y <- 12 + 2.5*X + e
fit <- lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)
Note now how the residuals (vertical offsets between observed Y and predicted, i.e., regression line) increase with X. Let’s observe how the residuals change as a function of the predictor:
plot(X, resid(fit), xlab = "X",ylab = "Residuals",col="firebrick")
or:
plot(X, sqrt(abs(resid(fit))), xlab = "X",ylab = "Residuals",col="firebrick")
It’s clear that the residuals increase as a function of X. We can see that residuals do not appear normal (a common consequence of residual heteroscedasticity):
plot(fit, which=2)
And that residuals are clearly heteroscedastic:
plot(fit, which=3)
We could also have generated residuals in the opposite trend by using:
e <- rnorm(n,0,sort(X,decreasing=TRUE))
Y <- 12 + 2.5*X + e
fit <- lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)
And even more complex trends by using (residuals are greater for intermediate values of X)
e <- rnorm(n,0,c(rep(3,25),26:75,rep(3,25)))
Y <- 12 + 2.5*X + e
fit <- lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)
An empirical application to understand how to potentially find the origins of heteroscedasticity and use appropriate weights to fit models so that their residuals homoscedastic
Here we will see how residual heteroscedasticity can be analyzed and even interpreted as a biological phenomenon. This will also allow us to introduce mixed models, which are widely used tools for handling heteroscedasticity in biological data.
Now that you have a good understanding of residual heteroscedasticity, we can move on to a real example using the squid data set discussed in class. Our goal is to assess seasonal changes in somatic versus reproductive investment.
Let’s upload the data:
squid <- read.csv("squid.csv")
dim(squid)
View(squid)
Let’s transform month into a factor given that we will be treating it as a categorical variable in the analysis:
squid$fMONTH <- factor(squid$MONTH)
Let’s start by running a simple linear model with the response (testis weight, i.e., reproductive tissue) versus dorsal mantle length (DML; indicative of individual size):
lm.squid <- lm(Testisweight ~ DML * fMONTH, data = squid)
summary(lm.squid)
From the summary, Testis Weight changes as a function of DML and differs among months. The presence of several significant interactions between month and DML indicates that the relationship between Testis Weight (continuous response variable) and DML (continuous predictor) varies over time. In other words, the slopes of the regression lines change across months, reflecting temporal variation in how individuals allocate energy to somatic versus reproductive investment. In some months the slopes (the values in the Estimate column of the summary table) are steeper, indicating greater reproductive investment per unit increase in body size, whereas in other months they are shallower.
Before analyzing these differences in slopes among months in more detail, we first need to verify whether the assumptions of the model are satisfied.
Let’s first assess normality:
plot(lm.squid, which = 2,col="firebrick")
The residuals give some indication, particularly given the large sample size (768 individuals), that they are not normal. Let’s assess residual homoscedasticity:
plot(lm.squid, which = 3,col="firebrick")
Residuals seem to increase with fitted values. Let’s test this more formally:
library(lmtest) # install.packages("lmtest") in case you don't have it yet.
bptest(lm.squid)
The Breusch–Pagan test provides strong evidence that the residuals are not homoscedastic. Let’s therefore examine the residual variation more closely. By exploring how residuals change as a function of the predictors, we can better understand the structure of the heteroscedasticity and potentially incorporate it into the model.
Let’s start by assessing how they vary with one of the predictors (DML):
plot(squid$DML, resid(lm.squid), xlab = "DML",ylab = "Residuals",col="firebrick")
The residuals clearly increase with DML, similar to the pattern observed in the simulated example above. Residual variation also differs across months. In particular, variability is noticeably greater during September, October, and November, although residual variance varies among other months as well.
boxplot(resid(lm.squid)~squid$fMONTH,xlab = "Month",names=month.abb,ylab = "Residuals",col="firebrick")
The residual plot for the interaction between month and DML has been already produced above (i.e.,plot(lm.squid, which = 3,col="firebrick")).
By understanding the structure of the residual variation, we can specify a variance structure that changes according to month. This allows the model to account for differences in residual variance across observations. Observations associated with larger variance receive less weight in the fitting process.
This approach is called Generalized Least Squares (GLS) and is implemented in the nlme package, which is used to fit and compare Gaussian linear and nonlinear mixed-effects models.
Let’s install the package:
install.packages("nlme")
library(nlme)
Let’s start by specifying the residual variance as a function of Month. The function varIdent, used below, allows the model to estimate a different residual variance for each month. In other words, each observation is assigned the variance of the month in which that specimen was collected and measured.
Categorical variables such as Month are written as ~1 | fMONTH. Note that we do not write ~1 | squid$fMONTH; we write only ~1 | fMONTH because the variable is found automatically in the data frame supplied to the model through data = squid.
In other words, the function GLS will use the weight matrix assuming that this variable belongs to the data you entered in the function later on.
weight.I.month = varIdent(form= ~1|fMONTH)
This matrix is then used by the GLS procedure. The GLS procedure automatically uses the reciprocal (inverse) of the weights, i.e., 1/weight.I.month - large residuals will have less influence in the model fit than small residuals.
gls.I.month <- gls(Testisweight ~ DML*fMONTH,weights = weight.I.month, data = squid)
Just for curiosity sake, we can extract the reciprocal of the weights value for each observation as:
weight.per.month <- attr(gls.I.month$model$varStruct, "weights")
weight.per.month
Note that each month has the same weight but all observations in the model has its own weight.
We can plot the weights of the monthly variation of residuals beside a boxplot with the residual weights:
par(mfrow=c(1,2))
boxplot(resid(lm.squid)~squid$fMONTH,names=month.abb,xlab = "Month",ylab = "Residuals",col="firebrick")
plot(attr(gls.I.month$model$varStruct, "weights")~squid$fMONTH,xlab = "Month",ylab = "GLM weights",pch=16,cex=1,names=month.abb)
There are many ways (called variance structures) to assign weights to residuals in statistical models. We cannot cover all of them in detail in this course, but the key idea is straightforward: we give different weights to observations so that some residuals have more influence on the regression fit than others. Residuals associated with large variance are given less weight (because they are less reliable), whereas residuals with small variance are given more weight.
By applying different weights to different observations, the model explicitly accounts for heteroscedasticity in the residuals. There are many ways to weight residuals because heteroscedasticity can arise in many different forms (e.g., changing with predictors, time, groups, or other factors). In contrast, there is essentially only one way to be homoscedastic: when the residual variance is constant across all observations.
In the example above, months with larger residual variances receive smaller weights than months with smaller residual variances. For instance, February receives a relatively large weight in the model fit because residual variation is small, whereas September receives a smaller weight because residual variation is much larger.
Let’s now examine the residuals again. Because the residuals are associated with different variance structures, we must standardize them before comparing them on a common scale. Standardization ensures that the residuals have mean = 0 and variance = 1. Since the residual mean is already zero by definition, we simply divide each residual by its standard deviation (i.e., the square root of the variance for its month). This can be done with the following command:
e.gls.I.month <- resid(gls.I.month, type = "normalized")
You can read more about standardization of residuals in:
http://www.r-tutor.com/elementary-statistics/simple-linear-regression/standardized-residual
We can now plot the standardized residuals for each month. Before that, let’s clear the graph output using dev.off() (device off; device being the graphs)
dev.off()
boxplot(e.gls.I.month~squid$fMONTH,xlab = "Month",ylab = "Residuals",col="firebrick",names=month.abb)
Residual variance are now much more similar across months.
Let’s check for normality. In this case, objects generated by the GLS function need to be entered in a function called qqnorm instead of plot as for lm generated objects:
qqnorm(gls.I.month,col="firebrick",abline = c(0,1))
It looks normal, particularly in contrast to the original Q-Q normal residual plot generated based on the lm model. Remember that linear models are very robust against normality.
An important question, which will become even clearer shortly, is whether the variance structure used here actually improves the regression fit (i.e., reduces unexplained residual variation) compared with a simple regression model that assumes constant variance (homoscedasticity).
The function lm produces a model object that is not fully compatible with objects produced by the gls() function. Therefore, to make the models directly comparable, we must refit the model without weights using the gls() function as well. This ensures that both models are of the same type and can be properly compared:
lm.via.GLS <- gls(Testisweight ~ DML * fMONTH, data=squid)
Now we use the anova command to compare the two models (lm without weights) and GLM with weights:
anova(lm.via.GLS,gls.I.month)
This analysis provides two main results. First, the model that includes weights fits the data better, as indicated by its smaller AIC. AIC is a measure of model fit that balances goodness of fit and model complexity. Unlike R^2 (introduced in the regression lecture and tutorial), smaller AIC values indicate a better-fitting model.
Second, the comparison shows that the weighted model explains significantly more variation than the model without weights. This is supported by the significant p-value obtained when comparing the two models.
The GLS model above assumes that the residual variance changes only across months. However, we also observed that residual variance appears to change as a function of the continuous predictor DML, as well as through the interaction between DML and month. We will now explore how to identify a variance structure that better captures these patterns in the data.
Let’s therefore specify a variance structure that depends on both Month and DML, and then refit the GLS model using these weights.
weight.F.DML <- varFixed(~DML)
Re-run the GLS models for these two structures:
gls.F.DML <- gls(Testisweight ~ DML*fMONTH,weights = weight.F.DML, data = squid)
In this case, the variance in residuals is proportional to the square root of DML. Note how the weights generated by the gls function above (varFixed(~DML)) is the inverse of the square root of DML (1/sqrt(squid$DML)):
View(cbind(1/sqrt(squid$DML),attr(gls.F.DML$model$varStruct, "weights")))
Let’s determine which residual structure improved the model fit. We can use again the function anova to compare the three models having three redidual variance structures:
anova(lm.via.GLS,gls.I.month,gls.F.DML)
The model with the weight structure based on month provided the best fit (lowest AIC).
Stop and read this with attention
A common concern raised by students and users of statistics is: “There are many options, and I don’t fully understand them.” This is indeed a valid concern. However, many important statistical tools are routinely used without a complete understanding of all the details of how they are fitted. In many situations, a general understanding of their purpose and behavior is sufficient.
This is why it is important to compare models that use different approaches to weighting residuals. The underlying mathematics is not extremely difficult, but it is somewhat more complex than what we can cover at this level of statistical training.
In practice, the idea is to try several variance (weight) structures that are flexible enough to account for heteroscedasticity and potentially make the residuals approximately homoscedastic. We can then compare these models using appropriate diagnostic tools and model-selection criteria.
The key questions practitioners typically focus on are: Which variance structure provides the best fit? and Did the model succeed in making the residuals approximately homoscedastic?
Up to this point, the best residual variance structure we identified is the one based on month. However, we previously used a variance structure that assumes the residual variance changes linearly with the predictor. It is also possible that non-linear relationships between the variance and a predictor provide a better fit.
Let’s therefore try a few standard variance functions available in the nlme package to see whether they improve the model fit. Note that these variance transformations can only be applied to a covariate (continuous predictor)—in this case DML—and not to a factor (categorical predictor) such as Month.
In principle, we could also define custom variance functions so that the variance follows any mathematical relationship we specify and then incorporate them into the GLS model. However, creating such functions is beyond the scope of this tutorial.
vP.DML <- varPower(form= ~DML)
gls.vP.DML <- gls(Testisweight ~ DML*fMONTH,weights = vP.DML, data = squid)
Let’s check whether the varPower structure improved over the best Identity structure (here we will take the one based on month as it had the same fit for the interaction between month and DLS):
anova(gls.I.month,gls.vP.DML)
It is clear that the variance based on a power function of DML improved the model fit. Let’s try the other variance structures:
vCPower.DML <- varConstPower(form=~DML)
The final residuals structure that we will try here is one that combines different variance structures. In this case we can mix them. Let’s try the varIdent for month and the varExp for DML. You can obvious try other combinations as well.
vC.DML.Month <- varComb(varIdent(form =~ 1 | fMONTH) , varExp(form =~ DML))
Let’s run the GLS models for each of these structures:
gls.vCPower.DML <- gls(Testisweight ~ DML*fMONTH,weights = vCPower.DML, data = squid)
gls.vC.DML.Month <- gls(Testisweight ~ DML*fMONTH,weights = vC.DML.Month, data = squid)
Let’s test now these variance structure against each other:
anova(gls.vP.DML,gls.vCPower.DML,gls.vC.DML.Month)
The best weight structure is the combination of variance structures using month and DML. We could certainly had tried other combinations, but let’s stop here.
Now that we have evidence that we can trust the model estimation (i.e., residuals are invariant), let’s interpret the model:
anova(gls.vC.DML.Month)
The most important source of variation is DML but because the interaction is significant, we would need to interpret the slopes for each regression of testis on DML separately (i.e., per month). To do that, it is also important to check for the residuals against DML for each month:
e <- resid(gls.vC.DML.Month, type = "normalized")
coplot(e ~ DML | fMONTH, data = squid)
For most parts they all behave well, and is unlikely that we will get a structure for the residuals that better fit the data than the one we have here. Perhaps if each model is fit separately (i.e., per month) with a difference variance structure, but this would not allow for an easy interpretation of results:
Let’s check for normality:
dev.off() # just in case
qqnorm(gls.vC.DML.Month,col="firebrick",abline = c(0,1))
Looks great!
Since the slopes vary per month, let’s see how they vary through time. For that, we need to regress the response against the predicted values from the best model for each month. We can do that using a loop that goes through each month separately as follows:
plot(squid$Testisweight ~ squid$DML,pch=16,cex=0.5,xlab="DML",ylab="Testis weight")
slopes <- vector()
for (i in 1:12){
# which speciments were found in a particular month:
month <- which(squid$fMONTH == i)
data.tmp <- squid[month,]
vCPower.DML <- varConstPower(form=~DML)
gls.tmp <- gls(Testisweight ~ DML,weights = vCPower.DML,data=data.tmp)
abline(gls.tmp$coefficients[1],gls.tmp$coefficients[2],col=i,lwd=2)
slopes[i]=gls.tmp$coefficients[2]
}
Let’s observe how slopes change across months.
plot(1:12,slopes, xlab = "Month",ylab ="slopes",col="firebrick",pch=15,cex=2)
lm.time.slope <- lm(slopes~c(1:12))
abline(lm.time.slope)
The slopes increase from January to December. This means that the investment in energy for reproduction increases throughout the year. This is likely due to changes in environmental conditions that make reproduction more successful in later months.