Tutorial 12: Correlation and test assumptions

Correlation and verifying the normality assumption for all tests covered in the course
Week of November 29, 2022
(11th week of classes)


How tutorials work

The DEADLINE for your report is always at the end of the tutorial.

The INSTRUCTIONS for this report is found at the end of the tutorial.

Students may be able after a while to complete their reports without attending the synchronous lab/tutorial sessions. That said, we highly recommend that you attend the tutorials as your TA is not responsible for assisting you with lab reports outside of the lab synchronous sections.

The REPORT INSTRUCTIONS (what you need to do to get the marks for this report) is found at the end of this tutorial.

Your TAs

Section 101 (Wednesday): 10:15-13:00 - John Williams ()
Section 102 (Wednesday): 13:15-16:00 - Hammed Akande ()
Section 103 (Friday): 10:15-13:00 - Michael Paulauskas ()
Section 104 (Friday): 13:15-16:00 - Alexandra Engler ()


Correlation

We will use the wolves inbreeding data covered in class.

Download the Wolves data file

Now upload and inspect the data:

wolf <- read.csv("chap16e2InbreedingWolves.csv")
View(wolf)

Let’s plot the data:

plot(nPups ~ inbreedCoef, data = wolf,col="firebrick",pch=16,las=0)

Calculate the correlation; note that the function cor does not have an argument for data so that each variable has to be entered in the function; as we learned in previous tutorials, we can use the $ to call the name of a variable within a data matrix:

cor(wolf$inbreedCoef,wolf$nPups,method = "pearson")

To test the significance of the correlation:

cor.test(wolf$inbreedCoef,wolf$nPups,method = "pearson")

Problem 3:
Produce code to perform a regression model of number of pups on Inbreeding coefficient.
a) What are the intercept and slope values?
b) What is the R2 value?
c) Produce a plot using ggplot of the data and regression line (no need to plot the confidence interval for predictions of individual observations)

# Problem: write your answer.


Assessing normality via Q-Q plot

As we saw in class, the Q-Q plot is used to assess normality. If the data points more or less fall on the Q-Q plot line, then one can assume that the data is normally distributed. If that is the case, then we can apply a parametric test to the data. If it is not the case, then we need to resort to other types of statistical tests, such as ranked-transformed tests as we cover at the end of this tutorial.

Let’s start with some simulated data to understand the use of Q-Q plots. First generate a sample from a normally distributed population:

sample.normal <- rnorm(n=100,mean=10,sd=4)
hist(sample.normal)

Now, let’s produce the Q-Q plot for the data:

qqnorm(sample.normal, pch = 1, frame = FALSE)

The following command will plot the line on the plot:

qqline(sample.normal, col = "steelblue", lwd = 2)

As the simulated data are normal, then the dots fall quite well on the line.

Let’s now simulate a sample from a non-normal population (let’s use the uniform distribution)

sample.uniform <- runif(n=100)
hist(sample.uniform)

Now, let’s produce the Q-Q plot and associated line:

qqnorm(sample.uniform, pch = 1, frame = FALSE)
qqline(sample.uniform, col = "steelblue", lwd = 2)

Note that many data points fall quite off the line, indicating that the assumption that the sample comes from a normally distributed population does not hold (i.e., the assumption of normality is not met). In this case, a non-parametric test should be used.

Let’s now put the Q-Q plot in practice by reviewing some of the data we used in our other tutorials.

Q-Q plot for the one sample t-test

Let’s produce the Q-Q plot the random sample of body temperature of 25 people. If the normality assumption holds, then our t-test performed in tutorial 7 to compare the sample with “expected” temperature 98.6°F taught at schools was appropriate.

Download the human temperature data file

Now upload data and generate the Q-Q plot:

heat <- as.matrix(read.csv("chap11e3Temperature.csv"))
qqnorm(heat, pch = 1, frame = FALSE)
qqline(heat, col = "steelblue", lwd = 2)

Note how well the temperatures of the 25 individuals fall on the Q-Q plot line; so normality can be assumed and our t-test was the appropriate choice for these data.

Q-Q plot for the paired t-test

The problem used in the tutorial to learn how to run a paired-t test was based on the following question: Are males with high testosterone paying a cost for this extra mating success in other ways? We will use the data on individual differences in immunocompetence before and after increasing the testosterone levels of red-winged blackbirds. Immunocompetence was measured as the log of optical measures of antibody production per minute (ln[mOD/min]).

Download the Testosterone data file

Let’s upload and inspect the data:

blackbird <- read.csv("chap12e2BlackbirdTestosterone.csv")
View(blackbird)

Calculate the differences between after and before (not log transformed):

differences.d <- blackbird$afterImplant - blackbird$beforeImplant

Produce the Q-Q plot:

qqnorm(differences.d, pch = 1, frame = FALSE)
qqline(differences.d, col = "steelblue", lwd = 2)

Note how a couple of data points fall far from the line. Perhaps a log-transformation can help here and make the data closer to normally distributed:

differences.d.log <- log(blackbird$afterImplant) - log(blackbird$beforeImplant)
qqnorm(differences.d.log, pch = 1, frame = FALSE)
qqline(differences.d.log, col = "steelblue", lwd = 2)

Although the log transformations made the data points closer to the line, there is still one point far from the line. One point is not strong evidence against normality and we should be fine with using the paired t-test (which assumes normality) for these data.

Q-Q plot for the two sample t-test

There, we compared the spike length of the horned lizard between living and killed individuals.

Download the Lizard data file

Let’s upload and inspect the data:

lizard <- read.csv("chap12e3HornedLizards.csv")
View(lizard)

Given that we need to assess normality for all data combined (as seen in class), here we will use the residuals from the t-test that can be generated as follows:

fit = lm(squamosalHornLength ~ Survival,data=lizard)
View(fit$residuals)

Now let’s generate the Q-Q plot for the residuals:

qqnorm(fit$residuals, pch = 1, frame = FALSE)
qqline(fit$residuals, col = "steelblue", lwd = 2)

Note that some of the smaller residuals do not quite fall on the line; but the large majority does. As such, we will assume that the data come from a normally distributed population.

Q-Q plot for ANOVAs

Here we will revisit the variation in melatonin production (response) as a function of light treatment (predictor).

Download the Knees data file

Load and inspect the data:

circadian <- read.csv("chap15e1KneesWhoSayNight.csv")
View(circadian)

Again, as seen in class, we will analyse the residuals of the ANOVA instead of the original data. Unlike the function t.test, the function aov does calculate the residuals. As such, we don’t need to use the function lm as we did for the two sample t-test; though we could have chosen the function lm to run the ANOVA analysis as well.

circadianANOVA <- aov(shift ~ treatment, data = circadian)
View(circadianANOVA$residuals)

Now let’s generate the Q-Q plot for the ANOVA residuals:

qqnorm(circadianANOVA$residuals, pch = 1, frame = FALSE)
qqline(circadianANOVA$residuals, col = "steelblue", lwd = 2)

The residuals do follow a linear trend and we can then assume that the data are normally distributed. As such, we can trust the results from the ANOVA analysis as performed in tutorial 10.

Q-Q plot for simple regression

Let’s revisit the lion data. Let’s generate the Q-Q plot for the residuals:

qqnorm(model.Lion$residuals, pch = 1, frame = FALSE)
qqline(model.Lion$residuals, col = "steelblue", lwd = 2)

Again, the residuals follows the straight line and, as such, we can assume that the assumption of normality holds for these data. As such we can trust the slope estimate and the ANOVA results for the regression on these data.


The Kruskal-Wallis test - Non-parametric testing based on rank differences

Let’s use the Fst data measured based on protein and DNA to answer the following question: Do protein differ in Fst values in contrast to anonymous DNA polymorphisms?

Download the Fst data file

Let’s upload and inspect the data:

Fst<-read.csv("Fst.csv",header=TRUE)
View(Fst)

Now let’s generate the Q-Q plot for the residuals:

fit <- lm(Fst ~ class,data=Fst)
qqnorm(fit$residuals, pch = 1, frame = FALSE)
qqline(fit$residuals, col = "steelblue", lwd = 2)

The residuals indicate that the assumption of normality does not seem to hold for these data. Small and large residuals fall quite off the line. As such, we better use a non-paramametric procedure to test for the differences in Fst values between DNA and protein:

kruskal.test(Fst$Fst~Fst$class)

Given that the P-value is 0.8365, we should not reject the null hypothesis that the Fst values are the same for both classes (i.e., DNA and protein).

You need to send the report to your TA via email by the end of this tutorial. We have a lot of experience with tutorials and they are designed so that students can complete them at the end to the tutorial in question. Don’t forget to send the RStudio and the relevant files if that’s the case for the tutorial (e.g., csv file). No written report (e.g., in a word file) is necessary, just the two files attached to an email to your TA. Some students fail or forget to send their report with the necessary attachments. Please don’t forget it.