Tutorial 8: Heteroscedasticity and GLMs

March 14, 2023 (8th week of classes)
Dealing with Heteroscedasticity using general linear models (GLMs)

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!

General (not Generalized) Linear Models assume normality and homogeneity of residual variance (residual homoscedasticity). (General linear models tend to be robust against normality but not against residual heteroscedasticity. Note, though, that residual heteroscedasticity can be a major source of non-normality in residuals. As such, learning how to deal with residual heteroscedasticity can also (sometimes) tackle the issue of non-normality as well.

Small set of simulations to understand the nature of residual homoscedasticity versus heteroscedasticity

Let’s start by understanding what residual homoscedasticity is by simulating a small example (as we saw in lectures 12 and 14). But first, it will be useful to understand how the function rnorm that generates normally distributed values in R works:

We can either state that for each value simulated we want the same variance (here 20 observations from the same population with mean = 0 and standard deviation = 10):

rnorm(20,0,10)

Or, instead of entering one single value for the standard deviation, we could also generate multiple values coming from different normally distributed populations having the same mean (say 0) but different standard deviations:

std.dev.vector <- c(10,10,10,10,10,2,2,2,2,2,3,4,5,6,7,8,10,2,2,2)
rnorm(20,0,std.dev.vector)

In this case, values sharing the same standard deviation come frome from the same normally distributed population as they all share the same mean value (zero) and same standard deviation.

Now let’s generate samples from a regression population with an intercept equal to 12 and a slope equal to 2.5; sample size (n) is 100 observations. e (vector of residuals) is a vector assumed to have constant variance (i.e., all values are sampled from the same population, in the example below mean=0 and standard deviation = 10). X is simply a vector of discrete values 1 to n (i.e., 1,2,3,…,100)

n <- 100
X <- 1:n
X
e <- rnorm(n,0,10)
Y <- 12 + 2.5*X + e
View(cbind(X,Y))

Let’s fit the model:

fit <- lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)

Note how the residuals deviate more or less in the same way from the regression line (i.e., residuals are homoscedastic) as they were all sampled from the same population, i.e., mean = 0 and standard deviation = 10. We can check that as follows:

We can see that residuals are normal:

plot(fit, which=2)

And residuals are homoscedastic:

plot(fit, which=3)

Let’s now generate residuals that come from populations varying in their variances (heteroscedastic residuals). Note that the mean of residuals is always zero (as per defined for regression and any General Linear model). Here we will make the residual variance proportional to the predictor variable X, which is a common pattern found in many real biological data sets: variance of residuals increase (or decrease) with the values of the response. Here, we simply replaced the standard deviation of the residual vector by X and as such, the variance will increase with X.

n <- 100
X <-  1:n
e <-  rnorm(n,0,X)
Y <-  12 + 2.5*X + e
fit <-  lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)

Note now how the residuals (vertical offsets between observed Y and predicted, i.e., regression line) increase with X. Let’s observe how the residuals change as a function of the predictor:

plot(X, resid(fit), xlab = "X",ylab = "Residuals",col="firebrick")

or:

plot(X, sqrt(abs(resid(fit))), xlab = "X",ylab = "Residuals",col="firebrick")

It’s clear that the residuals increase as a function of X.

We can see that residuals do not appear normal (a common consequence of residual heteroscedasticity):

plot(fit, which=2)

And that residuals are clearly heteroscedastic:

plot(fit, which=3)

We could also have generated residuals in the opposite trend by using:

e <- rnorm(n,0,sort(X,decreasing=TRUE))
Y <-  12 + 2.5*X + e
fit <-  lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)

And even more complex trends by using (residuals are greater for intermediate values of X)

e <- rnorm(n,0,c(rep(3,25),26:75,rep(3,25)))
Y <-  12 + 2.5*X + e
fit <-  lm(Y~X)
plot(Y~X,pch=16,cex=1.5,col="firebrick")
abline(fit,col="black",lwd=2)

An empirical application to understand how to potentially find the origins of heteroscedasticity and use appropriate weights to fit models so that their residuals homoscedastic

Here we will see how this residuals heteroscedasticity can be studied and even interpreted as a biological phenomenon. This will also allow us to set the basis for mixed models that are relatively modern and widely used tools to address heteroscedasticity, particularly in biology.

Now that you have a good grasp on what residual heteroscedasticity is, we can move on to a real example. Here we will use the squid data set described and analysed in class. The goal here is to assess seasonal changes in somatic versus reproductive investment.

Download the squid data

Let’s upload the data:

squid <- read.csv("squid.csv")
dim(squid)
View(squid)

Let’s transform month into a factor given that we will be treating it as a categorical variable in the analysis:

squid$fMONTH <- factor(squid$MONTH)

Let’s start by running a simple linear model with the response (testis weight, i.e., reproductive tissue) versus dorsal mantle length (DML; indicative of individual size):

lm.squid <- lm(Testisweight ~ DML * fMONTH, data = squid) 
summary(lm.squid)

From the summary, Testis Weight changes as a function of DML and differ among months. Given that many interactions between month (time) and DML are significant, this indicates that the slopes between Testis Weight (continuous variable) and DML (continuous variable) change as a function of time (months). In other words, there is temporal variation in the amount of tissue (energy) individuals invest in somatic versus reproductive (tissue). In some months the slopes (column Estimate in the summary table) are steeper (more investment) than others (less investment). Before we analyse the data further to understand the patterns of changes in slopes among months, we need to check if the assumptions hold.

Let’s first assess normality:

plot(lm.squid, which = 2,col="firebrick")

The residuals give some indication, particularly given the large sample size (768 individuals), that they are not normal. Let’s assess residual homoscedasticity:

plot(lm.squid, which = 3,col="firebrick")

Residuals seem to increase with fitted values. Let’s test this more formally:

library(lmtest) # install.packages("lmtest") in case you don't have it yet.
bptest(lm.squid)

The Breusch-Pagan test provides very strong evidence against residual homoscedasticity. Let’s study the residuals variation a little further. Let’s see how residuals change as a function of predictors so that we can have a good understanding about their structure and perhaps consider this structure in the model to make residuals heteroscedastic.

Let’s start by assessing how they vary with one of the predictors (DML):

plot(squid$DML, resid(lm.squid), xlab = "DML",ylab = "Residuals",col="firebrick")

They clearly vary (increase) with DML (similar to our small simulated data above). And they also vary with season. Note how the residuals is greater for the months of September, October and November. But residual variation also vary across other months.

boxplot(resid(lm.squid)~squid$fMONTH,xlab = "Month",names=month.abb,ylab = "Residuals",col="firebrick")

The residual plot for the interaction between month and DML has been already produced above (i.e.,plot(lm.squid, which = 3,col="firebrick")).

By having knowledge about the structure of the residuals, we can create a weight matrix with a variance structure that changes according to month. This weight matrix can be then used to weight down observations that have large associated residuals, thus reducing the influence of these observations in the model fit process. This method is called Generalized Least Squares (GLS) and is implemented by the library nlme (a package to “Fit and compare Gaussian linear and nonlinear mixed-effects models”):

Let’s install the package:

install.packages("nlme")
library(nlme)

Let’s start by setting up the weight matrix for the residuals as a function of Month. The function varIdent, as used below, will calculate the variance per month and make a weight matrix in which each observation will have the variance equal to the variance for its month (i.e., the month in which each the individual (called speciment in these data) was collected and measured) . Categorical variables (such as Month) are coded as 1|fMONTH. Note that you don’t write 1|squid$fMONTH below but rather 1|fMONTH. Then the function gls as below identifies it because we state that the data are squid.

In other words, the function GLS will use the weight matrix assuming that this variable belongs to the data you entered in the function later on.

weight.I.month = varIdent(form= ~1|fMONTH)

This matrix is then used by the GLS procedure. The GLS procedure automatically uses the reciprocal (inverse) of the weights, i.e., 1/weight.I.month - large residuals will have less influence in the model fit than small residuals.

gls.I.month <- gls(Testisweight ~ DML*fMONTH,weights = weight.I.month, data = squid)

Just for curiosity sake, we can extract the reciprocal of the weights value for each observation as:

weight.per.month <- attr(gls.I.month$model$varStruct, "weights")
weight.per.month

Note that each month has the same weight but all observations in the model has its own weight.

We can plot the weights of the monthly variation of residuals beside a boxplot with the residual weights:

par(mfrow=c(1,2))
boxplot(resid(lm.squid)~squid$fMONTH,names=month.abb,xlab = "Month",ylab = "Residuals",col="firebrick")
plot(attr(gls.I.month$model$varStruct, "weights")~squid$fMONTH,xlab = "Month",ylab = "GLM weights",pch=16,cex=1,names=month.abb)

There are lots of ways (called variance structures) to give weights to residuals in the literature and we can’t possibly cover them in details in this course and tutorial. The general idea (again) is to give weights so that some residuals have more importance in the regression fit than others. Large residuals are given less weight (we can trust them less) than small residuals. By applying different weights to different residuals, they often become heteroscedastic. Why there are different ways to weight residuals? Because there are many ways in which residuals can be heteroscedastic; but only one way to be homoscedastic).

You can see that months with larger variances of residuals have relatively smaller weights than months with small variances (e.g., contrast February with large weight to be used in the model fit, i.e., small residual variation, against September with small weight, i.e., large residual variation).

Let’s check the residuals again. Because residuals have different weight structures, in order to compare them in a common scale, we need to standardize them (mean = zero and variance = 1). Since their means are zero by definition, we just need to divide them by their standard deviation (i.e., square root of the variance per month). This is done by the following command:

e.gls.I.month <- resid(gls.I.month, type = "normalized")

You can read more about standardization of residuals in: http://www.r-tutor.com/elementary-statistics/simple-linear-regression/standardized-residual

We can now plot the standardized residuals for each month. Before that, let’s clear the graph output using dev.off() (device off; device being the graphs)

dev.off()
boxplot(e.gls.I.month~squid$fMONTH,xlab = "Month",ylab = "Residuals",col="firebrick",names=month.abb)

Residual variance are now much more similar across months.

Let’s check for normality. In this case, objects generated by the GLS function need to be entered in a function called qqnorm instead of plot as for lm generated objects:

qqnorm(gls.I.month,col="firebrick",abline = c(0,1))

It looks normal, particularly in contrast to the original Q-Q normal residual plot generated based on the lm model. Remember that linear models are very robust against normality.

One relevant question that will become even more obvious a bit later is whether the variance structure used here made the regression fit the data better (i.e., reduce residual variation) in contrast to a simple regression model (i.e., assuming constant variance or homoscedasticity). The function lm generates an object that is not fully compatible with the object of the function GLS. As such, to make them compatible, we need to re-run the model without weights using the GLSfunction as follows:

lm.via.GLS <- gls(Testisweight ~ DML * fMONTH, data=squid)

Now we use the anova command to compare the two models (lm without weights) and GLM with weights:

anova(lm.via.GLS,gls.I.month)

This analysis gives us two basic results. First, the better model is the one with weights (smaller AIC, which is a metric of goodness of fit; contrary to R² (seen in the regression lecture and tutorial), smaller AICs indicate greater model fit) and that this model explains a significantly larger amount of variation in contrast to the model without weights; this is indicated by the fact that the p-value is significant.

The GLS model above assumes a residual variance structure that changed per month, but we also observed that the variance changes as a function of the continuous predictor (DML) and the interaction of DML with month. Here, we will learn how to find the best residual variance structure that fits the data. Let’s now set weights based on these month and DML and re-run the GLS model with those weights.

weight.F.DML <- varFixed(~DML)

Re-run the GLS models for these two structures:

gls.F.DML <- gls(Testisweight ~ DML*fMONTH,weights = weight.F.DML, data = squid)

In this case, the variance in residuals is proportional to the square root of DML. Note how the weights generated by the gls function above (varFixed(~DML)) is the inverse of the square root of DML (1/sqrt(squid$DML)):

View(cbind(1/sqrt(squid$DML),attr(gls.F.DML$model$varStruct, "weights")))

Let’s determine which residual structure improved the model fit. We can use again the function anova to compare the three models having three redidual variance structures:

anova(lm.via.GLS,gls.I.month,gls.F.DML)

The model with the weight structure based on month provided the best fit (lowest AIC).

Stop and read this with attention

An important discussion that is often brought up by students and users of statistics is: “there is a bunch of options and I don’t understand them well”. That’s is indeed an important issue. That said, many major tools are used without a full understanding of their fitting processes and details; only a general knowledge. And that’s all right in many situations.

That’s the reason why we need to compare the fit of models using different ways to weight residuals. The math is not too complicated but just a bit complex for what we can cover at this level of statistical training.

So, the idea is to use a number of weight functions that are usually general enough to make the model homoscedastic. And we can compare these different functions using appropriate tools: how do they compare? Did they make the residuals homoscedastic? The latter questions are often what practitioners (i.e., users of statistics) centre around.

Up until now, the best residual variance structure is based on month. That said, here we used a variance structure that is linearly related to the residuals. There are non-linear functions worth trying to assess whether they improve the fit. Let’s try a few of them that are standard in the nlme package. Note, however, that these transformations can only be applied to a covariate (continuous predictor; here DML) and not a factor (discrete predictor; here month). Finally, we could write special functions so that the variance structure takes whatever function we want and we can then apply them to the GLS function (this won’t be covered here).

vP.DML <- varPower(form= ~DML)
gls.vP.DML <- gls(Testisweight ~ DML*fMONTH,weights = vP.DML, data = squid)

Let’s check whether the varPower structure improved over the best Identity structure (here we will take the one based on month as it had the same fit for the interaction between month and DLS):

anova(gls.I.month,gls.vP.DML)

It is clear that the variance based on a power function of DML improved the model fit. Let’s try the other variance structures:

vCPower.DML <- varConstPower(form=~DML)

The final residuals structure that we will try here is one that combines different variance structures. In this case we can mix them. Let’s try the varIdent for month and the varExp for DML. You can obvious try other combinations as well.

vC.DML.Month <- varComb(varIdent(form =~ 1 | fMONTH) , varExp(form =~ DML))

Let’s run the GLS models for each of these structures:

gls.vCPower.DML <- gls(Testisweight ~ DML*fMONTH,weights = vCPower.DML, data = squid)
gls.vC.DML.Month <- gls(Testisweight ~ DML*fMONTH,weights = vC.DML.Month, data = squid)

Let’s test now these variance structure against each other:

anova(gls.vP.DML,gls.vCPower.DML,gls.vC.DML.Month)

The best weight structure is the combination of variance structures using month and DML. We could certainly had tried other combinations, but let’s stop here.

Now that we have evidence that we can trust the model estimation (i.e., residuals are invariant), let’s interpret the model:

anova(gls.vC.DML.Month)

The most important source of variation is DML but because the interaction is significant, we would need to interpret the slopes for each regression of testis on DML separately (i.e., per month). To do that, it is also important to check for the residuals against DML for each month:

e <- resid(gls.vC.DML.Month, type = "normalized")
coplot(e ~ DML | fMONTH, data = squid)

For most parts they all behave well, and is unlikely that we will get a structure for the residuals that better fit the data than the one we have here. Perhaps if each model is fit separately (i.e., per month) with a difference variance structure, but this would not allow for an easy interpretation of results:

Let’s check for normality:

dev.off() # just in case
qqnorm(gls.vC.DML.Month,col="firebrick",abline = c(0,1))

Looks great!

Since the slopes vary per month, let’s see how they vary through time. For that, we need to regress the response against the predicted values from the best model for each month. We can do that using a loop that goes through each month separately as follows:

plot(squid$Testisweight ~ squid$DML,pch=16,cex=0.5,xlab="DML",ylab="Testis weight")
slopes <- vector()
for (i in 1:12){
   # which speciments were found in a particular month:
   month <- which(squid$fMONTH ==  i)
   data.tmp <- squid[month,]
   vCPower.DML <- varConstPower(form=~DML)
   gls.tmp <- gls(Testisweight ~ DML,weights = vCPower.DML,data=data.tmp)
   abline(gls.tmp$coefficients[1],gls.tmp$coefficients[2],col=i,lwd=2)
   slopes[i]=gls.tmp$coefficients[2]
}

Let’s observe how slopes change across months.

plot(1:12,slopes, xlab = "Month",ylab ="slopes",col="firebrick",pch=15,cex=2)
lm.time.slope <- lm(slopes~c(1:12))
abline(lm.time.slope)

The slopes increase from January to December. This means that the investment in energy for reproduction increases throughout the year. This is likely due to changes in environmental conditions that make reproduction more successful in later months.