Tutorial 7: Multiple regression in practice

February 24, 2026
Multiple regression in practice

CRITICAL: Regular practice with R and RStudio, the statistical software used in BIOL 322 and introduced during tutorial sessions, and consistent engagement with tutorial exercises are essential for developing strong skills in Biostatistics.

EXERCISES: Tutorial reports are independent exercises and are not submitted for grading; however, students are strongly encouraged to complete them. Some tutorials include solutions at the end to support self-assessment and review. Other tutorials do not require model answers because they are procedural and can be easily self-assessed by verifying that the code runs correctly and produces the expected type of output.

Understanding estimation in regression analysis - inference based on samples from a population regression model:

Let’s generate data under the assumptions underlying the multiple linear regression model. Here we will use a really large sample size. You will notice how close the estimates are from the population (true) model.

Let’s assume the following model as the population (true) model:

\[Y=42+2.3X_1+11.0X_2+e\]

Let’s set the model coefficients and generate the population data; note how different the means and the standard deviations are for the predictors X₁ and X₂. You will see that that the mean and variance values do not affect estimates of the model:

n <- 1000000
constant <- 42
slope_X1 <- 2.3
slope_X2 <- 11.0
X1 <- rnorm(n,1000,10) 
X2 <- rnorm(n,40,4)
error <- rnorm(n,0,10)

Generate the response variable Y as a function of the parameters (constant and slopes) and their associated predictors.

Y <- constant + slope_X1*X1 + slope_X2*X2 + error

Fit the model:

lm(Y ~ X1 + X2)

Note how close the estimates are to the real model generated (i.e., without error). This is because the sample size is huge (i.e., n = 1000000)

Now let’s generate a much smaller sample size:

n <- 30
X1 <- rnorm(n,1000,10) 
X2 <- rnorm(n,40,4)
error <- rnorm(n,0,10)
Y <- constant + slope_X1*X1 + slope_X2*X2 + error
lm(Y ~ X1 + X2)

Note the difference between the coefficient estimates and real parameter values. As sample size reduces, sampling error will be greater, i.e., the sample estimates of the coefficients will vary more around the parameter values for the true model. This can be easily observed by the fact that the estimates based on n = 1000000 observations are much closer to the true models than the estimates based on n = 30.

Finally, observe what happens when we add an irrelevant predictor:

X3 <- rnorm(n,10,4) 
lm(Y ~ X1 + X2 + X3)

Note that the contribution of X3 is much smaller than the contribution of the other predictors; as it should be. And its expectation (i.e., average over large number of samples) is zero. In other words, if you repeat the model above for 1000s of samples, the average slope of X3 will be zero; whereas for the other estimates, their averages over 1000s of samples will be equal to their population values, i.e., 2.3 and 11.0 for X₁ and X₂, respectively.

US Polution - complete regression analysis

Here we will use a data set on air pollution across US cities. The goal here will be to fit a multiple regression model to understand the drivers of SO₂ concentration (a measure of pollution) and make predictions that can be extended to other cities based on the same set of predictors.

Download the Population data

Let’s load the data:

data.pollution <- read.csv("USairpollution.csv")
View(data.pollution)

The data consist of the following variables:

City: City
SO2: Sulfur dioxide content of air in micrograms per cubic meter
temp: Average annual temperature in degrees Fahrenheit
manu: Number of manufacturing enterprises employing 20 or more workers
popul: Population size in thousands from the 1970 census
wind: Average annual wind speed in miles per hour
precip: Average annual precipitation in inches
raindays: Average number of days with precipitation per year

One should always start by looking at the general characteristics of the data. Note that the name of the city (column 1) is not of interest here, so we can suppress it from the data, i.e., by using [-1] below. The function summary generates summary statistics that can be helpful to characterize the data:

summary(data.pollution[-1])

The package GGally has an interesting function that facilitates producing more detailed data summaries: pairwise plots between variables (response and predictors), correlation values between predictors and histograms for each variable:

install.packages("GGally")
library(GGally)
ggpairs(data.pollution[-1])

Note that some variables are quite skewed; we will come back to this issue later on in the analysis.

Let’s start by estimating the model. You will need to look at the names of the variables. It is rather tedious write all the variable names as we do next:

colnames(data.pollution)
fit <- lm(SO2 ~ temp + manu + popul + wind + precip + raindays , data=data.pollution)

To facilitate the model bulding, we can consider all the variables in the data set by entering response ~ .; but remember that city name is not supposed to be in the model, so we can remove it by SO2 ~ .-city; if you would like to not consider other variables (e.g., rain) in the model, then you would enter instead SO2 ~ .-city-rain:

fit <- lm(SO2 ~ .-city, data=data.pollution)
fit

Let’s start by assessing whether the assumptions underlying multiple regressions are met:

Residual normality:

plot(fit,which=2)

Observations (cities) 32 and 33 seem somewhat off from what is expected under normality. We can more formally assess normality statistically using the Shapiro’s test:

shapiro.test(residuals(fit))

The null hypothesis of normality is rejected (p-value = 0.008535 <= alpha of 0.05), indicating that model residuals cannot be assumed as normal. Let’s start by looking at the distribution of the response variable (SO₂):

hist(data.pollution$SO2)

Normality affects inference (t-tests, F-tests), not unbiasedness of the slopes. With moderate sample sizes, regression is quite robust to mild non-normality.

The asymmetry in the response (SO₂) could easily be the reason why the residuals are not normal.

Often, a log transformation can help out in making the data more symmetric:

hist(log(data.pollution$SO2))

It looks a bit better. Let’s retest the normality of residuals by log transforming SO2:

fit.log <- lm(log(SO2) ~ .-city, data=data.pollution)
plot(fit.log,which=2)

Looks great now. We can reassess it using the Shapiro test:

shapiro.test(residuals(fit.log))

Indeed the distribution of residuals now meet the normality assumption.

Let’s assess the homoscedastic assumption of residuals:

plot(fit.log,which=3)

The line is not very straight and let’s assess the homoscedasticity using a formal test:

library(lmtest)
bptest(fit.log)

Ok, we don’t reject the null hypothesis of residual homoscedasticity.

We should also check the amount of collinearity in the model. Collinearity occurs when two or more predictors in a multiple regression model are strongly correlated with each other. When this happens, the model has difficulty separating their individual effects. In other words, the regression is trying to estimate the unique contribution of each predictor, but the predictors contain overlapping information.

Collinearity does not bias the estimated slopes (as we seen in class). The coefficients remain unbiased if the model assumptions hold. The problem is reduced precision. When predictors are highly correlated, the standard errors of their slopes increase, confidence intervals become wider, and statistical significance may disappear even if the variables are truly related to the response.

A useful diagnostic for collinearity is the Variance Inflation Factor (VIF). For any given predictor \(X_j\), the VIF is:

\(\text{VIF}_j = \frac{1}{1 - R_j^2}\)

where \(R_j^2\) is obtained by regressing \(X_j\) on all the other predictors. If \(X_j\) is weakly related to the other predictors, \(R_j^2\) is small and the VIF is close to 1. If \(X_j\) is highly predictable from the others, \(R_j^2\) is large and the VIF increases. A large VIF means that the variance of the estimated slope is inflated because of collinearity.

As a rough guideline, VIF values above 5 suggest moderate collinearity and values above 10 suggest serious collinearity. The key idea is that collinearity affects precision and interpretation, not overall model fit.

library(car)
vif(fit.log)

VIF > 5 or 10 indicates collinearity concerns; in this case, number of manufacturing enterprises employing 20 or more workers (manu) and population size in thousands from the 1970 census (popul) seem problematic. This strong correlation between the variables is not suprising.

Let’s take a look again into the model again:

summary(fit.log)

The table for the regression model indicates that the entire regression model is significant (F=10.72; P=0.000001126). The coefficient of determination (R²) is 0.6541. Note that the R² can be calculated in different ways (As far as I know, there are at least 13 ways). One simple way is the squared correlation between predicted and observed values:

cor(log(data.pollution$SO2),fit.log$fitted.values)^2

The R² is a biased estimate of the population R² and adjusted values should be used instead. From the summary(model) we find the Adjusted R² as 0.5931. Look at the bottom of the output:

summary(fit.log)

Let’s take a look just into the regression coefficients:

coefficients(fit.log)

Notice that the size of the coefficients (how big positive or negative) does not relate to the statistical significance (i.e., probability of rejection of their slopes via t-tests. For instance, the slope for precipitation is 0.0173723 and for manufacturing is 0.0012639. As such, the slope for precipitation is greater than manufacturing, but it is not significant, while manufacturing is significant. This is because slopes are scaled in relation to their units in relation to the response variable SO₂. You will see just below how the data should be transformed in a way that slopes are directly comparable among each other. The original coefficients are, however, necessary for prediction, i.e., one should predict in the original units of SO₂ (in logged values here because of the log transformation of SO₂ early on).

Two useful commands are:

fitted(fit.log) # estimate predicted values
residuals(fit.log) # estimate residuals

Let’s now standardize the data so that coefficients are comparable with one another (i.e., large values (negative or positive) reflect their relative importance and not the original units and variation of their respective variables). We will use the function scale so that the mean of all variables (response and predictors) will be in a common scale (i.e., each variable will have a mean of zero and a variance of 1 after standardization). The standardized data have to placed into a data frame to run the regression model:

standardized.data <- data.frame(scale(data.pollution[-1]))

The log transformation has to be performed prior to standardizing SO₂. The easiest is to do it separately:

log.SO2 <- scale(log(data.pollution$SO2))

Now we need to replace the SO₂ values in the standardized data from the standardized log SO₂:

standardized.data$SO2 <- log.SO2

Now let’s re-estimate the regression model; note that city was removed above [-1] so we don’t need to remove it from the model this time.

fit.standardized <- lm(SO2~ ., data=standardized.data)
summary(fit.standardized)

Note now that the standardized coefficients (slopes) for manufacturing (1.1014) is now much larger than the standardized slope of precipitation (0.2912), indicating their contribution (importance) in predicting SO₂ concentration. Note (obviously) that the transformation does not affect the R² or the ANOVA results for the overall significance of the model.

Let’s produce the plot of predicted values against observed values. Note that we should use the model with the log-transformed data (i.e., non-standardized). We will use native R instead of ggplot2; note, though, that you could adapt ggplot code for scatter plots for this plot here:

plot(predict(fit.log),log(data.pollution$SO2),las=1,cex.axis=1,col="firebrick",pch=16,cex=2,ylab="Observed SO2",xlab="Predicted S02")
abline(a=0,b=1,lwd=3)

Note though, that the “2” in SO is not in a subscript format. If you want to get it “perfect”, we can use the bquote function as bellow. This function can also be used in ggplot.

plot(predict(fit.log),log(data.pollution$SO2),las=1,cex.axis=1,col="firebrick",pch=16,cex=2,ylab=bquote("Observed" ~ S0[2]),xlab=bquote("Predicted" ~ S0[2]))
abline(a=0,b=1,lwd=3)

More examples on how to write math symbols in R plots and ggplots can be found in: https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html https://trinkerrstuff.wordpress.com/2018/03/15/2246/

Note how well the observed values are to the predicted values, indicating a reasonably good fit of the regression model.

Let’s calculate the relative contribution of each predictor to the model (i.e., variation partitioning). Here we also need to use the standardized model. First, let’s get the adjusted R² value for the whole model (i.e., with all predictors):

R2.adj.total <- summary(fit.standardized)$adj.r.squared
R2.adj.total

To estimate the contribution of temperature, we need to calculate the model without temperature first:

fit.scaled.minus.temp <- lm(SO2~ .-temp, data=standardized.data)
R2.adj.temp <- summary(fit.scaled.minus.temp)$adj.r.squared
R2.adj.temp

Then the relative contribution of temperature is estimated via subtraction from the full model minus the model without temperature:

R2.adj.total - R2.adj.temp

The relative contribution of temperature to the prediction of SO₂ in this regression model is 10.37%. I created a function that allows to calculate (estimate) the contribution of each predictor independent of one another in a more automatic way:

Download the unique contribution function

source("UniqueContribution.R")

In the function, you enter Y and X separately. SO₂ is the first variable in standardized.data and the other predictors are from column 2 to column 7:

var.contr <- UniqueContribution(standardized.data[1],standardized.data[,2:7])
var.contr

The variable contributions are directly related to the partial regression slopes. They can be interpreted as the relative contribution of each predictor in explaining the total variation of SO₂. Note that the relative contribution of number of rain days is negative (-0.0115). This is possible though it deserves a much longer mathematical explanation. For now, an easy way to interpret negative contributions is to state that the contribution associated to their predictors are nil (i.e., zero).

Note the sum of the individual contributions:

sum(var.contr)

It is 28.9% which is smaller than the total variation explained by the entire model (R2.adj.total = 59.1%). The reason is that the predictors overlap quite a bit in their contributions to SO₂. To calculate these shared variations we would need a more complex function. The package rdacca.hp conducts variation partitioning (known in the statistical literature as commonality analysis or hierarchical partitioning). The paper and examples using the package can be found in: https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13800

Finally, let’s predict what is the expected SO₂ concentration for two cities with the following characteristics (e.g., the first city has a temp of 44.5F and the second city a temperature of 51.2F, etc):

new.cities <- data.frame(city=as.character(c("A","B")),temp=c(44.5,51.2),manu=c(1232.0,1456.0),popul = c(1567.0,1242.0), wind = c(8.4,9.3), precip = c(12.4,10.4), raindays = c(58.7,62.3))

Look at the predicted values:

new.cities

Let’s predict the SO₂ concentration of the two cities based on these values. Note that we use the log-transformed model and not the standardized log-transformed one (i.e., we are interested in predictions and not comparisons of predictors). But first, we need to remove the variable city from the data and refit the model. Even though the variable city was not included in the model, the function lm still considers it its object (i.e., fit.log). As such, we need to remove the variable city and this can be done as:

fit.log=update(fit.log, . ~ . -city)

We can then proceed with the prediction of SO₂ for the two cities of interest:

predict(fit.log, newdata = new.cities)

As discussed in class, predictions are reliable only within the range of observed predictor values. Extrapolation beyond observed ranges can be highly unreliable.

You can back-transform predictions to the original (non-logged) scale, but there is ongoing debate about whether this is always appropriate when the model was fitted on the log scale. Because the log transformation changes the error structure, simply exponentiating fitted values can introduce bias. For this reason, some analysts prefer to interpret and present results on the log scale, especially when the model assumptions are more naturally satisfied there.

exp(predict(fit.log, newdata = new.cities))

Exercise

Stability of Regression Coefficients and the Role of Collinearity

In this exercise, you will study how stable regression coefficients are when the data are repeatedly sampled from the same population, and how collinearity affects their precision.

Using the log-transformed model (fit.log) as your reference model: 1. Write a short loop that repeatedly (e.g., 500 times) samples 30 cities with replacement from the US pollution data. 2. For each sample, fit the model \(\log(SO_2) \sim temp + manu + popul + wind + precip + raindays\) 3. Store the estimated slopes for manu and popul. 4. Plot histograms of their estimated coefficients across the 500 samples. 5. Compute the mean and standard deviation of the estimated slopes.

Then answer the following conceptual questions:

Are the average slopes across repeated samples close to the original estimates?
Which variable shows more variability in its slope estimates?
Compare their VIF values from the full dataset. Does higher collinearity appear associated with greater variability in slope estimates?
Does the overall model R^2 vary as much as the individual slopes?

Finally, reflect on this: if two predictors are moderately correlated but both appear important in the full dataset, how confident should we be about interpreting their individual slopes?

In this exercise, the bootstrap (resampling) is used to assess the stability of regression coefficients. We only have one dataset, but regression is about inference from a sample to a population. By repeatedly sampling 30 cities with replacement and refitting the model each time, we mimic the idea of taking many similar samples from the same population. Bootstrap is a widely used technique in Biology for many purposes, including assessing stability of estimators.

The distribution of slopes across these resamples represents the sampling variability of the coefficients. If the histogram is narrow and centered away from zero, the estimate is stable. If it is wide or frequently crosses zero, the estimate is sensitive to sampling variation.

Here, the bootstrap helps us visualize uncertainty and understand how much our conclusions depend on the particular sample we observed.

Answer:

# Data
data.pollution <- read.csv("USairpollution.csv")

# Fit model on full dataset (log-scale)
fit.log <- lm(log(SO2) ~ temp + manu + popul + wind + precip + raindays,
              data = data.pollution)

summary(fit.log)

## VIF from full dataset (diagnostic for collinearity)
library(car)
vif_full <- vif(fit.log)
vif_full

# Resampling settings
B <- 500     # number of resample bootstrap samples
nboot <- 30  # sample size per bootstrap dataset

## Store values
b_manu  <- numeric(B)
b_popul <- numeric(B)
r2_boot <- numeric(B)

# Bootstrap loop (sample cities with replacement) 
for (b in 1:B) {

  idx <- sample(1:nrow(data.pollution), size = nboot, replace = TRUE)
  dat_b <- data.pollution[idx, ]

  fit_b <- lm(log(SO2) ~ temp + manu + popul + wind + precip + raindays,
              data = dat_b)

  coefs <- coef(fit_b)
  b_manu[b]  <- coefs["manu"]
  b_popul[b] <- coefs["popul"]

  r2_boot[b] <- summary(fit_b)$r.squared
}

# Summaries
cat("\nReference slopes (full dataset):\n")
print(coef(fit.log)[c("manu", "popul")])

cat("\nBootstrap slope summaries (B =", B, ", n =", nboot, "):\n")
cat("manu:  mean =", round(mean(b_manu), 4),
    " SD =", round(sd(b_manu), 4), "\n")
cat("popul: mean =", round(mean(b_popul), 4),
    " SD =", round(sd(b_popul), 4), "\n")

cat("\nBootstrap R^2 summary:\n")
cat("R^2: mean =", round(mean(r2_boot), 3),
    " SD =", round(sd(r2_boot), 3), "\n")

# Plots
op <- par(no.readonly = TRUE)
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))

hist(b_manu,  breaks = 25, main = "Bootstrap slopes: manu",
     xlab = "Estimated slope (manu)")
abline(v = coef(fit.log)["manu"], lwd = 2, lty = 2)

hist(b_popul, breaks = 25, main = "Bootstrap slopes: popul",
     xlab = "Estimated slope (popul)")
abline(v = coef(fit.log)["popul"], lwd = 2, lty = 2)

hist(r2_boot, breaks = 25, main = expression("Bootstrap " * R^2),
     xlab = expression(R^2))
abline(v = summary(fit.log)$r.squared, lwd = 2, lty = 2)

par(op)

# show VIFs for just manu and popul
cat("\nVIFs (full dataset):\n")
print(vif_full[c("manu", "popul")])

Important points:

What do the histograms show?

The dashed vertical line in each histogram is the slope estimate from the full dataset (your “reference” fit). The histogram shows how much that estimate moves around when you only have n = 30 cities.

Are the average slopes across repeated samples close to the original estimates? Usually yes. Over many resamples, the mean of the bootstrap slope estimates should be near the full-data estimate (and, conceptually, near the population value). Any small mismatch is just Monte Carlo noise plus the fact that you’re resampling from a finite dataset.

Comparing variability of the slopes:

Which variable shows more variability in its slope estimates?

manu has the larger bootstrap SD (printed by the code). In many runs of this dataset, it shows larger spread because it is more collinear with the other predictors.

Now compare their VIF values from the full dataset. Does higher collinearity appear associated with greater variability in slope estimates? Typically yes.

A higher VIF means that predictor is more predictable from the others, so its “unique” information is smaller. This inflates the standard error and makes the slope estimates jump around more across samples. You are checking this possibility empirically by comparing VIF(manu) vs VIF(popul) and SD(β̂_manu) vs SD(β̂_popul)

Stability of \(R^2\) versus individual slopes

Does the overall model \(R^2\) vary as much as the individual slopes? \(R^2\) often looks relatively stable compared to the individual coefficients. This is the key lesson here in which te model can predict reasonably consistently, while the allocation of credit among correlated predictors is unstable. As such, prediction can remain stable even when interpretation becomes unstable.

Interpreting slopes under sollinearity

If two predictors are moderately correlated (collinear) but both appear important in the full dataset, how confident should we be about interpreting their individual slopes? Be cautious.

Collinearity does not necessarily hurt prediction, but it can make individual slopes hard to interpret because the model is trying to separate effects that overlap. If slopes change sign across bootstrap samples, or standard errors are very large, it becomes difficult to make credible statements about individual effects.

If the goal is prediction, this may be acceptable. If the goal is interpretation, i.e., understanding the unique contribution of each predictor, instability matters much more.

Conceptual distinction versus redundancy among predictors

Do the two predictors (manu and population) represent something conceptually distinct? If both variables measure “urbanization” (e.g., population size and number of manufacturing enterprises), then keeping both may not be necessary. But if they represent distinct mechanisms, removing one may oversimplify the biology. Statistical instability should trigger scientific reflection, not automatic deletion.

When is instability concerning?

Instability becomes concerning when: (1) The bootstrap distribution frequently changes sign; (2) Confidence intervals often include zero; (3) VIF is large; and (4) Small perturbations in the data change conclusions.

That is when interpretation becomes unreliable. If instead the slope distribution is centered clearly away from zero, variability is moderate, and VIF is under 5, then instability is probably acceptable. In this dataset, the estimated slopes cross zero (going from positive to negative) quite a lot across resamples, indicating that their estimates are not reliable.

What can we do instead?

Instead of eliminating one of the two variables arbitrarily, consider combining them into a single “urbanization index”. For example, use Principal Component Analysis combining popul and manu (PCA; we will see it later in the course) and retain the first principal component as an index of urbanization. Or fit two alternative models: one with manu and one with popul, and comparing their interpretability.

Correlated predictors and latent processes

Often, correlated predictors reflect a latent process. When we say that correlated predictors may reflect a latent process, we mean that the variables we observe are often indirect measurements of some underlying phenomenon that we are not measuring directly. In this case, both variables likely reflect the broader process of urbanization. Recognizing the latent process can help shift the interpretation from “Which variable is significant?” to “What underlying process are these variables capturing?”

Further reading

For a broader discussion of methods to deal with collinearity, see: Dormann et al. (2013). Methods to deal with collinearity in ecological models. Ecography: https://nsojournals.onlinelibrary.wiley.com/doi/10.1111/j.1600-0587.2012.07348.x