Tutorial 7: Multiple regression in practice

March 7th, 2023 (7th week of classes)
Multiple regression in practice

Remember: The tutorials were produced in such a way that you should be able to easily adapt the codes to your own data!

Understanding estimation in regression analysis - inference based on samples from a population regression model:

Let’s generate data under the assumptions underlying the multiple linear regression model. Here we will use a really large sample size. You will notice how close the estimates are from the population (true) model.

Let’s assume the following model as the population (true) model:


Let’s set the model coefficients and generate the population data; note how different the means and the standard deviations are for the predictors X1 and X2. You will see that that the mean and variance values do not affect estimates of the model:

n <- 1000000
constant <- 42
slope_X1 <- 2.3
slope_X2 <- 11.0
X1 <- rnorm(n,1000,10) 
X2 <- rnorm(n,40,4)
error <- rnorm(n,0,10)

Generate the response variable Y as a function of the parameters (constant and slopes) and their associated predictors.

Y <- constant + slope_X1*X1 + slope_X2*X2 + error

Fit the model:


Note how close the estimates are to the real model generated (i.e., without error). This is because the sample size is huge (i.e., n = 1000000)

Now let’s generate a much smaller sample size:

n <- 30
X1 <- rnorm(n,1000,10) 
X2 <- rnorm(n,40,4)
error <- rnorm(n,0,10)
Y <- constant + slope_X1*X1 + slope_X2*X2 + error

Note the difference between the coefficient estimates and real parameter values. As sample size reduces, sampling error will be greater, i.e., the sample estimates of the coefficients will vary more around the parameter values for the true model. This can be easily observed by the fact that the estimates based on n = 1000000 observations are much closer to the true models than the estimates based on n = 30.

Finally, observe what happens when we add an irrelevant predictor:

X3 <- rnorm(n,10,4) 

Note that the contribution of X3 is much smaller than the contribution of the other predictors; as it should be. And its expectation (i.e., average over large number of samples) is zero. In other words, if you repeat the model above for 1000s of samples, the average slope of X3 will be zero; whereas for the other estimates, their averages over 1000s of samples will be equal to their population values, i.e., 2.3 and 11.0 for X1 and X2, respectively.

US Polution - complete regression analysis

Here we will use a data set on air pollution across US cities. The goal here will be to fit a multiple regression model to understand the drivers of SO2 concentration (a measure of pollution) and make predictions that can be extended to other cities based on the same set of predictors.

Download the Population data

Let’s load the data:

data.pollution <- read.csv("USairpollution.csv")

The data consist of the following variables:

City: City
SO2: Sulfur dioxide content of air in micrograms per cubic meter
temp: Average annual temperature in degrees Fahrenheit
manu: Number of manufacturing enterprises employing 20 or more workers
popul: Population size in thousands from the 1970 census
wind: Average annual wind speed in miles per hour
precip: Average annual precipitation in inches
raindays: Average number of days with precipitation per year 

One should always start by looking at the general characteristics of the data. Note that the name of the city (column 1) is not of interest here, so we can suppress it from the data, i.e., by using [-1] below. The function summary generates summary statistics that can be helpful to characterize the data:


The package GGally has an interesting function that facilitates producing more detailed data summaries: pairwise plots between variables (response and predictors), correlation values between predictors and histograms for each variable:


Note that some variables are quite skewed; we will come back to this issue later on in the analysis.

Let’s start by estimating the model. You will need to look at the names of the variables. It is rather tedious write all the variable names as we do next:

fit <- lm(SO2 ~ temp + manu + popul + wind + precip + raindays , data=data.pollution)

To facilitate the model bulding, we can consider all the variables in the data set by entering response ~ .; but remember that city name is not supposed to be in the model, so we can remove it by SO2 ~ .-city; if you would like to not consider other variables (e.g. rain) in the model, then you would enter instead SO2 ~ .-city-temp:

fit <- lm(SO2 ~ .-city, data=data.pollution)

Let’s start by assessing whether the assumptions underlying multiple regressions are met:

  1. Residual normality:

Observations (cities) 32 and 33 seem somewhat off from what is expected under normality. We can more formally assess normality statistically using the Shapiro’s test:


The null hypothesis of normality is rejected (p-value = 0.008535 <= alpha of 0.05), indicating that model residuals cannot be assumed as normal. Let’s start by looking at the distribution of the response variable (SO2):


The asymmetry in the response (SO2) could easily be the reason why the residuals are not normal.

Often, a log transformation can help out in making the data more symmetric:


It looks a bit better. Let’s retest the normality of residuals by log transforming SO2:

fit.log <- lm(log(SO2) ~ .-city, data=data.pollution)

Looks great now. We can reasses it using the Shapiro test:


Indeed the distribution of residuals now meet the normality assumption.

Let’s assess the homoscedastic assumption of residuals:


The line is not very straight and let’s assess the homoscedasticity using a formal test:


Ok, we don’t reject the null hypothesis of residual homoscedasticity; we are good to go!

Let’s take a look again into the model again:


The table for the regression model indicates that the entire regression model is significant (F=10.72; P=0.000001126). The coefficient of determination (R2) is 0.6541. Note that the R2 can be calculated in different ways (As far as I know, there are at least 13 ways). One simple way is the squared correlation between predicted and observed values:


The R2 is a biased estimate of the population R2 and adjusted values should be used instead. From the summary(model) we find the Adjusted R2 as 0.5931. Look at the bottom of the output:


Let’s take a look just into the regression coefficients:


Notice that the size of the coefficients (how big positive or negative) does not relate to the statistical significance (i.e., probability of rejection of their slopes via t-tests. For instance, the slope for precipitation is 0.0173723 and for manufacturing is 0.0012639. As such, the slope for precipitation is greater than manufacturing, but it is not significant, while manufacturing is significant. This is because slopes are scaled in relation to their units in relation to the response variable SO2. You will see just below how the data should be transformed in a way that slopes are directly comparable among each other. The original coefficients are, however, necessary for prediction, i.e., one should predict in the original units of SO2 (in logged values here because of the log transformation of SO2 early on).

Two useful commands are:

fitted(fit.log) # estimate predicted values
residuals(fit.log) # estimate residuals

Let’s now standardize the data so that coefficients are comparable with one another (i.e., large values (negative or positive) reflect their relative importance and not the original units and variation of their respective variables). We will use the function scale so that the mean of all variables (response and predictors) will be in a common scale (i.e., each variable will have a mean of zero and a variance of 1 after standardization). The standardized data have to placed into a data frame to run the regression model:

standardized.data <- data.frame(scale(data.pollution[-1]))

The log transformation has to be performed prior to standardizing SO2. The easiest is to do it separately:

log.SO2 <- scale(log(data.pollution$SO2))

Now we need to replace the SO2 values in the standardized data from the standardized log SO2:

standardized.data$SO2 <- log.SO2 

Now let’s re-estimate the regression model; note that city was removed above [-1] so we don’t need to remove it from the model this time.

fit.standardized <- lm(SO2~ ., data=standardized.data)

Note now that the standardized coefficients (slopes) for manufacturing (1.1014) is now much larger than the standardized slope of precipitation (0.2912), indicating their contribution (importance) in predicting SO2 concentration. Note (obviously) that the transformation does not affect the R2 or the ANOVA results for the overall significance of the model.

Let’s produce the plot of predicted values against observed values. Note that we should use the model with the log-transformed data (i.e., non-standardized). We will use native R instead of ggplot2; note, though, that you could adapt ggplot code for scatter plots for this plot here:

plot(predict(fit.log),log(data.pollution$SO2),las=1,cex.axis=1,col="firebrick",pch=16,cex=2,ylab="Observed SO2",xlab="Predicted S02")

Note though, that the “2” in SO is not in a subscript format. If you want to get it “perfect”, we can use the bquote function as bellow. This function can also be used in ggplot.

plot(predict(fit.log),log(data.pollution$SO2),las=1,cex.axis=1,col="firebrick",pch=16,cex=2,ylab=bquote("Observed" ~ S0[2]),xlab=bquote("Predicted" ~ S0[2]))

More examples on how to write math symbols in R plots and ggplots can be found in: https://astrostatistics.psu.edu/su07/R/html/grDevices/html/plotmath.html https://trinkerrstuff.wordpress.com/2018/03/15/2246/

Note how well the observed values are to the predicted values, indicating a reasonably good fit of the regression model.

Let’s calculate the relative contribution of each predictor to the model (i.e., variation partitioning). Here we also need to use the standardized model. First, let’s get the adjusted R2 value for the whole model (i.e., with all predictors):

R2.adj.total <- summary(fit.standardized)$adj.r.squared

To estimate the contribution of temperature, we need to calculate the model without temperature first:

fit.scaled.minus.temp <- lm(SO2~ .-temp, data=standardized.data)
R2.adj.temp <- summary(fit.scaled.minus.temp)$adj.r.squared

Then the relative contribution of temperature is estimated via subtraction from the full model minus the model without temperature:

R2.adj.total - R2.adj.temp

The relative contribution of temperature to the prediction of SO2 in this regression model is 10.37%. I created a function that allows to calculate (estimate) the contribution of each predictor independent of one another in a more automathic way:

Download the unique contribution function


In the function, you enter Y and X separately. SO2 is the first variable in standardized.data and the other predictors are from column 2 to column 7:

var.contr <- UniqueContribution(standardized.data[1],standardized.data[,2:7])

The variable contributions are directly related to the partial regression slopes. They can be interpreted as the relative contribution of each predictor in explaining the total variation of SO2. Note that the relative contribution of number of rain days is negative (-0.0115). This is possible though it deserves a much longer mathematical explanation. For now, an easy way to interpret negative contributions is to state that the contribution associated to their predictors are nil (i.e., zero).

Note the sum of the individual contributions:


It is 28.9% which is smaller than the total variation explained by the entire model (R2.adj.total = 59.1%). The reason is that the predictors overlap quite a bit in their contributions to SO2. To calculate these shared variations we would need a far more complex function. The package MBESS conducts variation partitioning (known in the statistical literature as commonality analysis). The outputs are quite complex and we won’t cover the package here. A paper and examples using the package can be found in: https://pdfs.semanticscholar.org/ba86/79cfce154f566d3bef504f286f6d8bc3b3cf.pdf

Finally, let’s predict what is the expected SO2 concentration for two cities with the following charateristics (e.g., the first city has a temp of 44.5F and the second city a temperature of 51.2F, etc):

new.cities <- data.frame(city=as.character(c("A","B")),temp=c(44.5,51.2),manu=c(1232.0,1456.0),popul = c(1567.0,1242.0), wind = c(8.4,9.3), precip = c(12.4,10.4), raindays = c(58.7,62.3))

Look at the predicted values:


Let’s predict the SO2 concentration of the two cities based on these values. Note that we use the log-transformed model and not the standardized log-transformed one (i.e., we are interested in predictions and not comparisons of predictors). But first, we need to remove the variable city from the data and refit the model. Even though the variable city was not included in the model, the function lm still considers it its object (i.e., fit.log). As such, we need to remove the variable city and this can be done as:

fit.log=update(fit.log, . ~ . -city) 

We can then proceed with the prediction of SO2 for the two cities of interest:

predict(fit.log, newdata = new.cities)

You can rescale these to the original (non-logged) form but there are some debate whether this is appropriate or not in regressions when responses were logged. Some prefer to leave predicted values in the log form.

exp(predict(fit.log, newdata = new.cities))