Tutorial 5: Rank-transformation and Heteroscedasticity

February 07, 2023 (5th week of classes)
Dealing with non-normal and heteroscedastic data in ANOVA designs

REMEMBER: There are no reports or grades for tutorials. Note though that reports and midterm 2 are heavily based on tutorials and your knowledge of R. Therefore you should attend tutorials and/or work on them weekly on your own time. Also consider using the Forum for tutorials with you have any specific questions along the way.

General setup for the tutorial
A good way to work through is to have the tutorial opened in our WebBook and RStudio opened beside each other.

*NOTE: Tutorials are produced in such a way that you should be able to easily adapt the commands to your own data! And that’s the way that a lot of practitioners of statistics use R!

Ranked transformation to tackle data that cannot be assumed to be normally distributed

As we discussed in our lectures, even though parametric tests (e.g., t-test, ANOVAs, etc) assume normality, they are quite robust when the assumption of normality is not met. That said, depending on the field of research, there is a tradition in terms of requiring that non-parametric options should be used even in cases where data are not too different from being normally distributed. One of the most widely used alternatives approaches for when data cannot be assumed to be normal are based on rank transformed data (as seen in our lectures).

Here we will use the F_st data from McDonald et al. (1996) seen in class.

Description: F_st is a measure of the amount of geographic variation in genetic polymorphism. McDonald et al. (1996) compared two populations of the American oyster using F_st estimated on six anonymous DNA polymorphisms (variation in random bits of DNA of no known function) and 13 proteins.

Question: Do protein differ in their F_st values in contrast to anonymous DNA polymorphisms?

Download the F_st data
Load the data:

Fst <- read.csv("McDonald_etal_1996.csv")
View(Fst)

Although the F_st data only have two groups (i.e., DNA and protein), this is a good place to remember that ANOVA (usually applied to data with more than two groups) and two-sample t-tests (applied to data with 2 groups) are the same. Let’s first run the ANOVA; we start by running the linear model (using function lm) where F_st values are the response (dependent) variable and factor origin (DNA or protein) is the predictor variable treated as a factor. The function as.factor transforms the code for origin in a way that the function lm runs an ANOVA instead of a standard regression. Although we can use the function aov to run the ANOVA (as seen in previous R tutorials), the function lm is more useful when running more complex analyses as we will see later in our R tutorials.

lm.Fst <- lm(Fst~as.factor(origin),data=Fst)
anova(lm.Fst)

The F-value = 0.1045 and the and P-value = 0.7502.

Note that the numerator degrees of freedom (df) is 1 because we only have two groups and the denominator df is 18 because we have 20 observations (i.e., number of F_st values) and we loose 2 degrees of freedom because the mean of each group (DNA and protein) were used to calculate their variances, which in turn are used to estimate the residuals (i.e., within group sum-of-squares as seen in our ANOVA lecture).

Now let’s calculate the two-sample (two-groups) t-test:

t.test(Fst~as.factor(origin),data=Fst,var.equal=TRUE)

Look over the results and you will note first that the P-value is exactly the same as for the ANOVA, i.e., P-value = 0.7502. The t-value is 0.3233. Let’s square the t-value:

0.3233^2

The squared t-value is exactly the F-value for the ANOVA calculated earlier (i.e., 0.1045). This demonstrates that the t-test is equivalent to an ANOVA when based on two samples. Indeed, if we square a t-distribution, it equals the F-distribution with 1 degree of freedom in the numerator (i.e., 2 groups) and n-2 degrees of freedom in the denominator.

Now, let’s get back to our F_st example. The first thing we should do is to verify if a parametric ANOVA could be used to analyze the data by assessing whether residuals are normal via the Q-Q normal residual plot.

plot(lm.Fst,which=2,col="firebrick",las = 1,cex.axis=1,pch=16)

The residuals depart quite a lot from what would be expected if they were from normally distributed data within groups. Let’s now run the Kruskal-Wallis test:

kruskal.test(Fst~origin,data=Fst)

Now let’s see how similar the P-value from the Kruskal-Wallis test is to the ANOVA based on the ranked transformed data. Remember (as seen in class) that their P-values are very similar. Because the Kruskal-Wallis can’t be applied to multi-factorial designs, the ANOVA on ranks becomes the generalized analysis for assessing differences among groups using ranked transformed data (e.g., multifactorial ranked-based ANOVAs). Remember that ranked transformed data is a general solution to deal with non-normal data.

In R, the function rank transforms data on ranks:

Fst.ranked<-rank(Fst$Fst)
Fst.ranked

Let’s combine the original data with the transformed data:

ranked.data <- data.frame(Fst=Fst$Fst,Fst.ranked=Fst.ranked,origin=Fst$origin)
View(ranked.data)

Now let’s conduct the ANOVA on the ranked transformed data:

lm.Fst.ranked <- lm(Fst.ranked~as.factor(origin),data=ranked.data)
anova(lm.Fst.ranked)

The P-value of the Kruskal-Wallis is 0.8365; note how similar it is to the one estimated by the ANOVA on ranks (i.e., 0.8429).

Issues caused by heteroscedasticity: inflated type I errors

One of the issues underlying ANOVA when data are heteroscedastic is that false positives will be more common than the ones set by alpha. So, when the null hypothesis in fact is true, you will reject it more often than the risk of false positives you accepted for your study, i.e., alpha. In other words, the actual type I error may be greater than the one you allowed and expected when you chose an alpha (e.g., 0.05). We refer to this as “inflated type I errors”.

We will start by estimating type I error rates under homoscedasticity and then heteroscedasticity so that you can better understand the issues involved. We have done this in our lectures 7 and 8.

Let’s assume here that we sampled observations from 3 normally-distributed populations with the same means (10) and the same standard deviations (30). By using the same means, we are making the null hypothesis be true. Obviously this is a simulation to demonstrate to you the issues involved and in reality we never know the conditions we have, i.e., whether the null hypothesis is true or not (obviously). But simulations allow us to understand how statistical tests are affected if we don’t consider assumptions properly.

Remember that the square of the standard deviation equals variance. We will then test using an ANOVA whether their group means vary. We will repeat this operation 10 000 times and keep the P-value for each ANOVA. We will then count how many times we rejected the ANOVA H₀, i.e., that at least 2 groups varied in their means. The number of observations in each group was kept at 20 observations but remember that groups can vary in their sample sizes. Let’s assume an alpha=0.05

Groups <- c(rep(1,each=20),rep(2,each=20),rep(3,each=20))
Groups
alpha <- 0.05
n.tests <- 10000
data.simul <- data.frame(replicate(n.tests,c(rnorm(n=20,mean=10,sd=30),rnorm(n=20,mean=10,sd=30),rnorm(n=20,mean=10,sd=30))))
dim(data.simul)

Note that the simulated data is 60 rows (60 values based on 3 groups of 20 observations each, i.e., 20 x 3 = 60) and 10000 columns (each column containing one data set). Each of the 10000 data sets are then analyzed using ANOVA as follows. This will take are little while (2 or 3 minutes)

lm.results <- lapply(data.simul, function(x) {
  lm(x~Groups, data = data.simul)
})
ANOVA.results <- lapply(lm.results, anova)

Now let’s extract the p-values for each ANOVA. Remember that this is an one-factorial ANOVA and, as such, only one p-value is generated:

P.values <- sapply(ANOVA.results, function(x) {x$"Pr(>F)"[1]})
TypeIerror <- length(which(P.values<=0.05))/n.tests
TypeIerror

Let’s plot the frequency distribution of p-values:

hist(P.values,col="firebrick")

The Type I error is pretty close to alpha (0.05), right? If you had run the simulation infinite times, the value in TypeIerror would had been exactly equal to 0.05, i.e, 5% of tests rejected H₀ when H₀ was true. Note that the frequency distribution of P-values (histogram) is uniform (as seen in our FDR module in lecture 6 when H₀ was set to be true, as it was the case here).

Let’s now assume that we sample observations from 3 normally-distributed populations with the same means (here set to be 10, H₀ is true) but varying in their standard deviations (3, 30 and 300, respectively). Let’s keep the same sample sizes per group.

Groups <- c(rep(1,each=20),rep(2,each=20),rep(3,each=20))
Groups
n.tests <- 10000
data.simul <- data.frame(replicate(n.tests,c(rnorm(n=20,mean=10,sd=3),rnorm(n=20,mean=10,sd=30),rnorm(n=20,mean=10,sd=300))))
dim(data.simul)
lm.results <- lapply(data.simul, function(x) {
  lm(x~Groups, data = data.simul)
})
ANOVA.results <- lapply(lm.results, anova)
P.values <- sapply(ANOVA.results, function(x) {x$"Pr(>F)"[1]})
alpha <- 0.05
TypeIerror <- length(which(P.values <= alpha))/n.tests
TypeIerror

The type I error is much greater than the expected alpha (around 0.11). In other words, because variances among populations from which the samples were taken vary among them (i.e., populations are heteroscedastic), the expected alpha of 0.05 is no longer the true alpha that we have set at 5%. The consequence is that if we were to use samples from these populations to test for the H₀ under ANOVA, we would have 11% of the samples as false positives (type I errors) instead of the 5% as set by alpha.

As such, the risks of false positives (i.e., type I error) are much greater than the one we allowed initially (i.e., alpha = 0.05 = 5%). Note, however, that depending on how the variances (or standard deviations) of the populations vary, the Type I error may vary and can be bigger or smaller. Bottom line, we cannot trust that samples from heteroscedastic populations equals the alpha levels that we assumed.

Note that even though the H₀ for the ANOVA was true (i.e., group means are equal), the frequency distribution (histogram) of P-values is asymmetric to the right, i.e., P-values tend to be smaller for heteroscedastic populations.

hist(P.values)

Issues caused by heteroscedasticity: reduced statistical power

As we saw in our lecture, when the null hypothesis is false for ANOVAs (i.e., at least two groups differ in their means), heteroscedasticity (differences in variance) reduce statistical power (i.e., increase false negatives).

We won’t demonstrate this issue here computationally; we will request this demonstration in report 2. But you will need to set a simulation where the null hypothesis for ANOVA is false and data are heteroscedastic. The codes above can be easily modified to achieve this goal.

Some solutions for heteroscedasticity

1) Welch’s ANOVA on ranks: As discussed in class, there are different solutions for dealing with heteroscedasticity in ANOVA designs. We will cover two here that are well used in biology. The first one is to use the Welch’s ANOVA for heteroscedastic data using ranks. This is done in R using the function oneway.test in which the argument var.equal is set to FALSE (i.e., variance are assumed different) and the Welch’s correction for degrees of freedom when variances come from populations with different variances. We will use the F_st data here.

oneway.test(Fst.ranked~origin, var.equal = FALSE,data=ranked.data)

Note that the P-value (0.8732) is slightly greater than the one produced by both the Kruskal-Wallis test (P-value = 0.8365) and the ANOVA based on ranks (P-value = 0.8429). This is because the Welch’s test is more conservative (i.e., it makes it more difficult to reject H₀, hence usually generating larger P-values than other procedures). Because heteroscedasticity tends to increase Type I error (i.e., rate of false positives is greater than the chosen alpha), the Welch’s test makes the test more difficult to reject H₀.
For example, imagine that you chose an alpha = 0.05 (5%) to establish whether you should rejected or not the H₀. This alpha is the rate of Type I error for normally distributed data and under homoscedasticity (i.e., the populations where they were sampled from have the same variance). Let’s refer to this as the expected alpha. Unfortunately, it is possible that more than 5% of tests will be significant when they should not (false positives) for heteroscedastic data (as we saw just above). As such, the Welch’s test make probabilities greater, thus decreasing the chance of false positives.

2) Weighted least squares: Another way to deal with the issue of heteroscedasticity is to consider weights in the regression model, known as weighted least squares (WLS) as we saw in our lecture. The advantage of the WLS approach is that it generalizes to multi-factorial ANOVAs and regressions, whereas the Welch’s approach can be used only in the case of single-factorial ANOVAs.

To understand what the WLS approach does, we can start by understanding another way of assessing homoscedasticity (in addition to the Levene’s test). As seen in our lecture, we can assess whether the square root of the absolute values of the standardized ANOVA residuals are independent of the ANOVA predicted values. This can be performed as follows (if you are curious to see how the plot is produced using more R code-like, go later to the miscellanea section (#1) at the end of this tutorial):

plot(lm.Fst.ranked,which=3)

You can observe that the residual variances are quite different between the two groups (fitted values here represent each group), which causes their absolute values to be dependent on the predicted values (predicted values in ANOVA represent predicted F_st mean values per group). When the red line is either positive or negative (i.e., not very straight), it suggests heteroscedasticity.

To reduce the heteroscedasticity in the ranked transformed data, we can make the fitting of the model have weights that are the reciprocal of the group residual variances (i.e., 1/group residual variance; as covered in lecture 8). In this case, groups with large residual variances have less weight in the ANOVA (regression) estimation (fitting) process. This fitting process is generally called “weighted least squares” (WLS).

The easiest way to do conduct the WLS based on reciprocal residual variance per group is to use the package nlme. However, if you are curious to see how this procedure is done using more “standard” R code, go through miscellanea section (#2) at the end of this tutorial.

Installing and loading nlme:

install.packages("nlme")
library("nlme")

Now, let’s fit the WLS model using the function gls in nlme

mod.wls = gls(Fst.ranked ~ factor(origin), data=ranked.data, weights=varIdent(form= ~ 1 | origin))
summary(mod.wls)

Note how the P-value associated to F_st origin (P-value = 0.8702) is quite similar to the Welch’s test (P-value = 0.8732). Note, again, that the WLS can be generalized to multiple factorial ANOVA which is not the case of the Welch’s ANOVA. As such, the WLS becomes a general approach to deal with the issue of heteroscedasticity.

WLS based on estimated variances per group is often called a two-stage estimation because it fits two-models:

The regular ANOVA model to estimate group residual variances (see Miscellanea);
A WLS model using the information on residual variances per group from the first fit.

More importantly, the WLS procedure can be also used for ANOVA designs in which data are distributed normally but variances vary among groups. In this case, instead of using the ranked transformed data, simply apply the WLS on the raw data instead.

Let’s see how the WLS approach does for samples from heteroscesdatic populations.

Let’s simulate heteroscedastic data and see how the WLS procedure behaves. We will keep the same data structure as below, i.e., 3 groups and 20 observations per group:

Groups <- c(rep(1,each=20),rep(2,each=20),rep(3,each=20))
Groups

Unfortunately, the function sapply is extremely slow when dealing with object generated by the gls function from package nlme to conduct WLS. We will use a for loop that improves on speed (sapply is a wrapper for loops anyway). for loops are at the core of any programming environment and has a multitude of applications. Results will be kept in a list that can be created as follows:

wls.results <- list()

Run the for loop; this may take a couple of minutes:

n.tests <- 1000
for (count.data.sets in 1:n.tests){
    data.simul <- c(rnorm(n=20,mean=10,sd=3),rnorm(n=20,mean=10,sd=30),rnorm(n=20,mean=10,sd=300))
    wls.results[[count.data.sets]] <- gls(data.simul ~     
         factor(Groups), weights=varIdent(form= ~ 1 | Groups))
}

The result for each WLS analysis based on the gls function is kept in the list wls.results. Each result and the associated ANOVA can be run calling the appropriate data set. For instance, for simulated data 3 we simply:

wls.results[[3]]
anova(wls.results[[3]])

To pull only the p-value (necessary to estimate the type I error rate) from the anova table we can do the following:

anova(wls.results[[3]])$"p-value"[2]

Let’s run now the ANOVA for all the gls models and estimate the type I error:

ANOVA.results <- lapply(wls.results, anova)
P.values <- sapply(ANOVA.results, function(x) {x$"p-value"[2]})
alpha <- 0.05
TypeIerror <- length(which(P.values <= alpha))/n.tests
TypeIerror

You will notice that the type I error should be around 0.05; in some cases it could be 0.065 or so. Note that we ran 1000 data sets (and not infinite). But, the point here is to demonstrate computationally that the inflated type I error due to heteroscedasticity can be tackled using WLS.
Let’s plot the frequency distribution of p-values:

hist(P.values)

Note that it is much more uniform than for when we used the standard ANOVA approach on heteroscedastic data.

Observation: In the next tutorial, we will see an example of WLS for a two-factorial ANOVA and for post-hoc tests to compare means as well.

Miscellanea:

part 1

Code to reproduce plot(ranked.ols,which=3) as used above. Instead of using the standard R function, we are doing the calculations involved and then plotting. This should help understanding how the plot works.

plot(sqrt(abs(scale(lm.Fst.ranked$residuals)))~lm.Fst.ranked$fitted)
abline(lm(sqrt(abs(scale(lm.Fst.ranked$residuals)))~lm.Fst.ranked$fitted),col="red",ylim=c(0,1.2))

part 2

How to run WLS using more “standard” R code instead of the gls function. This will hopefully allow you to understand a bit better what WLS does; as we saw in our lecture.

2.1) Calculate the residual variance per group (i.e., origin)

var.per.origin<-aggregate(lm.Fst.ranked$residuals,list(as.factor(ranked.data$origin)),var)
var.per.origin

2.2) Duplicate variances respecting the order of origin in the data:

var.vector <- vector()
var.vector[which(ranked.data$origin=="DNA")] <- 66.44167
var.vector[which(ranked.data$origin=="Protein")] <- 25.40797
var.vector

Run the ANOVA using the standard lm function as used to conduct ANOVAs:

lm.Fst.WLS <- lm(Fst.ranked~as.factor(origin), weights = 1/var.vector, data=ranked.data)
summary(lm.Fst.WLS)

Now using the gls function

gls.result <- gls(Fst.ranked~as.factor(origin), weights=varIdent(form= ~ 1 | origin),data=ranked.data)
anova(gls.result)

Note that the p-values from summary(lm.Fst.WLS) and anova(gls.result) are the same!