Tutorial 11: Simple Linear Regression

Simple Linear Regression
Week of November 18, 2024
(11th week of classes)


How tutorials work

The DEADLINE for your report is always at the end of the tutorial.

The INSTRUCTIONS: the report is based on answering a couple of questions along the tutorial.

While students may eventually be able to complete their reports independently, we strongly recommend attending the synchronous lab/tutorial sessions. Please note that your TA is not responsible for providing assistance with lab reports outside of these scheduled sessions.

The REPORT INSTRUCTIONS (details on what is required to earn full marks for this report) are provided at the end of this tutorial.

Your TAs

Section 0101 (Tuesday): 13:15-16:00 - Aliénor Stahl ()
Section 0103 (Thursday): 13:15-16:00 - Alexandra Engler ()


Simple Regression

We will use the Lion data set involving the relationship between proportion of black spots on male Lions’ noses and their age. Managing the trophy hunting of African lions is an important part of maintaining viable lion populations. Knowing the ages of the male lions helps, because removing males older than six years has little impact on lion social structure, whereas taking younger males is more disruptive. Whitman et al. (2004) showed that the amount of black pigmentation on the nose of male lions increases as they get older and so might be used to estimate the age of unknown lions for trophy hunting purposes.

Download the Lion data file

Now upload and inspect the data:

Lion <- read.csv("LionNose.csv")
View(Lion)

Plot the data:

plot(Age~PropBlack,data=Lion,xlab="Proportion black spots",ylab="Age (years)", cex=1.5, pch=16,col="firebrick",las=1)

Let’s produce the regression model using the function lm; the function is used to fit linear models such as simple linear regression:

model.Lion <- lm(Age~PropBlack,data=Lion)
model.Lion

Plot the regression line making sure that the scatter plot produced earlier is still active (i.e., it is still showing in the plotting window). This is necessary so that we can plot the regression line on the early plot. The parameter lwd below controls the regression line width on the plot.

abline(model.Lion,col="blue",lwd=2)

Problem 1:

Download the bird latitude data and reproduce the scatter plot and regression line for the data. Produce all the code involving the necessary steps that are needed: from data uploading to regression plot.

Download the Bird data file

In your file identify problem 1:
# Problem 1: write your answer.
# continue your answer

Estimating predicted values

The object model.Lion produced by the lm function contains the predicted value for each value in the predictor variable. They can be listed using the following command:

model.Lion$fitted

The residual values can be listed as follows:

model.Lion$residuals

Predicting data for values that are not in the original predictor, however, requires a slightly different approach. Let’s say we are interested in predicting what is the expected lion age for one that has 28% and 33% of their noses covered by black spots? We have to first to create a data structure (data frame) containing these two values.

new.data <- data.frame(PropBlack = c(0.28,0.33))
predicted.values<-predict(model.Lion,new.data)
predicted.values

The predicted values for 28% and 33% are 3.86 and 4.39 years respectively. To plot these predicted values on the regression line, we can use the function points. Again make sure that your last graph is still active (appear on the screen):

points(c(0.28,0.33),c(3.86,4.39),col="green",cex=2,pch=16)

Problem 2:
2a) Based on these predicted values, should these two lions be considered for hunting? Why?

2b) Using the bird latitude data, produce the code necessary to predict the number of bird species at the following three latitudes: 38.1, 38.4 and 39.4.

2c) Produce code to plot the predicted values on the regression line.
2d) What are the intercept and slope values for the bird data set?

Simple linear regression - statistical inferences

As we saw in our regression lecture, the summary of the model below reports the significance of both regression coefficients (intercept and slope), and the R2 value:

summary(model.Lion)

The ANOVA approach to regression, if you remember from our regression lectures, also tests for the same null hypothesis, i.e., whether the population slope equals zero.

anova(model.Lion)

The confidence intervals for predictions (prediction intervals) that we saw in our lecture is reproduced next. Assume a lion with 50% of their noses covered by black spots is being considered for hunting.

What is the predicted age in years?

new.data <- data.frame(PropBlack = 0.50)
predicted.value <- predict(model.Lion,new.data,se.fit = TRUE,interval = "prediction",level = 0.95)
predicted.value

The predicted value is 6.20 years as it appears under fit in the object predicted.value.How much can we trust this prediction? Note above that we set the confidence level for the prediction interval at 95% (0.95). The confidence limits for the prediction of 6.2 years is given by lwr and upr in the object, i.e., lower (lwr) and upper (upr) levels. For our predictive value the confidence interval is between 2.7 and 9.7 years. Unfortunately, this confidence is not very good! Under normality assumptions, we are 95% confident (a strong chance) that an individual with 50% of black spots could have between 2.7 and 9.7 years. Given that the age for hunting is 6 years and older, hunters should not consider this individual for hunting.

Let’s plot the confidence interval for individual observation (lions). Here we will use the package ggplot that can produce very nice confidence interval plots:

install.packages("ggplot2")

Let’s call the package so that we can use its functions:

library(ggplot2)

Next, lets produce confidence intervals for every predicted value in the lion data set.

predictions <- predict(model.Lion, interval = "prediction")

It produced the following warning message, which is not a problem (see below)

Warning message:
In predict.lm(model.Lion, interval = "predict") :
  predictions on current data refer to _future_ responses

The function is warning us that we are making predictions for multiple observations. We are estimating multiple predicted values for the data we have already in the data and that may generate issues but not for the exercise at hands (we see that in advanced courses such as BIOL422). We can though suppress the error by simply using the following command instead:

suppressWarnings(predictions <- predict(model.Lion, interval = "predict"))

Now we are ready to plot the data. ggplot takes a very different approach to plotting and since our course is not an R course, but rather a course in which we use R to conduct statistical analysis. But, since this is our last tutorial, I thought it would be a good idea to show you a little more of the power of R. Obviously you can copy and past the commands I entered below for you. Some of them seemn more logic than others. The command ggplot is the basic plot where you entered the data. Then to that basic plot, we add a number of different aspects through function using +. geom_point is where we define the size and the colour of plots (and other aspects not covered here). geom_smooth runs the regression and plot the regression line. labs stands for labels. theme_classic is a preset style of plotting that I like but there are many others. And each geom_line below plots the lower and upper confidence lines for individual predicted values.

new_data <- cbind(Lion,predictions)
ggplot(new_data, aes(PropBlack, Age)) + 
  geom_point(size = 2, col = "firebrick") + 
  geom_smooth(method = "lm", se = FALSE, col = "blue") +
  labs(x = "Proportion black", y = "Age (years)") +
  theme_classic() + 
  geom_line(aes(y = lwr), col = "coral2", linetype = "dashed") + #lwr pred interval
  geom_line(aes(y = upr), col = "coral2", linetype = "dashed")  #upr pred interval

Note that the statement geom_smooth() using formula 'y ~ x' is not an error. It just tells us that is regressing y on x as set in the aes part of the ggplot function.

Let’s say all you wanted was to produce a simple scatter plot of the lion data then all you have to enter is:

ggplot(new_data, aes(PropBlack, Age)) + 
  geom_point(size = 2, col = "firebrick") +
  theme_classic()

As we cover in our 2nd lecture on regression, one important aspect to determine whether a linear regression model is appropriate, is a plot of regression residuals against the predictor variable. This helps determining whether the residuals are homoscedastic:

new_data <- cbind(Lion,residuals=model.Lion$residuals)
ggplot(new_data, aes(PropBlack, residuals)) + 
  geom_point(size = 2, col = "firebrick") + 
  labs(x = "Prop. black", y = "residuals") +
  geom_hline(yintercept=0,color = "black", size=1)+
  theme_classic()

There is a series of diagnostics and formal statistical tests for regression models that we won’t cover in BIOL322 and leave for more advanced courses. Normality diagnostics, however, will be covered in tutorial 12.

Submit the RStudio file containing the report via Moodle.