Tutorial 2: Statistical Hypothesis Testing

January 17, 2023 (2nd week of classes)

The principles of statistical hypothesis testing: generating evidence-based conclusions without complete biological knowledge!

REMEMBER: There are no reports or grades for tutorials. Note though that reports and midterm 2 are heavily based on tutorials and your knowledge of R. Therefore you should attend tutorials and/or work on them weekly on your own time. Also consider using the Forum for tutorials with you have any specific questions along the way.

General setup for the tutorial
A good way to work through is to have the tutorial opened in our WebBook and RStudio opened beside each other.


Now that you are ready, let’s start our tutorial. READ! READ! and READ! Don’t simply copy and paste code.
As we discussed in lecture 2, statistical hypothesis testing is an intimate stranger!! Most users know how to implement and interpret it, but most users of statistics don’t really understand how it works, its intricacies & philosophy. This tutorials is designed to assist you towards gaining this much needed understanding.

Statistical hypothesis testing is the most used statistical framework to generate evidence for or against questions based on data (research questions or otherwise).

Remember as discussed in lecture 2:

The statistical hypothesis framework is a quantitative method of statistical inference that allows to generate evidence for or against a research hypothesis. Practitioners of statistics, however, often get confused that statistical hypothesis testing can only generates supports against (but not for) the null hypothesis. And statistical hypothesis testing doesn’t generate support for (or against) the alternative hypothesis.

The principles of statistical hypothesis testing using a simple problem: Do toads exhibit handedness?

The right hand of toads (as seen in class, but a small summary is provided here)

From Whitlock and Schluter (2014; book used in BIOL322): “Humans are predominantly right handed. Do other animals exhibit handedness as well? Bisazza et al. (1996) tested this possibility on the common toad. They sampled (randomly) 18 toads from the wild. They wrapped a balloon around each individual’s head and recorded which forelimb each toad used to try to remove the balloon. Do other animals exhibit handedness as well?”

First step: transform the research question into a statistical question

Statistically, the question above can be translated into: “Do right-handed and left-handed toads occur with equal frequency in the toad population, or is one type more frequent than the other?

Result of the study: 14 toads were right handed and four were left handed. Are these results evidence of handedness in toads?

The statistical hypothesis framework for inference is then used to pose the following question:

Do these results provide evidence to say that the sample value we obtained is consistent or inconsistent with the samples from a “theoretical” population of toads where right- and left-handed individuals are equal in proportion?

If inconsistent, then the data would have generated evidence that other animals exhibit handedness. This part of the tutorial was built using Whitlock and Schluter’s code (https://rpubs.com/mdlama/153914).

Since we sampled 18 toads, we need to use this sample size to produce our sampling distribution for the theoretical population in which there is no dominance of one limb (hands) over the other limb.

Let’s start by learning how to take a random sample from a categorical variable having two possible outcomes from a theoretical population of known probabilities (here, 50%/50% chance of each hand) with the desired sample size (i.e., 18 individuals).

The function below will simulate the following “manual” implementation:

  1. put millions of small pieces of papers into a bag where half are written “Left” and the other half “Right”. In practice we only need 36 pieces so by chance alone zero “Left” and 18 “Right” or any combination would be possible.

  2. Sample with replacement 18 pieces of paper from the bad. Replacement means that you will take one piece of the paper, write down whether you got a “L” or “R” and then put back in the bag before taking another piece of paper. Replacement here is critical because otherwise you would change the original theoretical population while sampling it. If you were to keep the piece of paper (i.e., sampling without replacement), you would reduce the probability of sampling “Ls” and “Rs”. For example, imagine that your first 3 values were “Ls” and that you did not replace them. Then, you increased the probablity of sampling an “R”. Remember one of the criterion for random sampling “The selection of observational units in the population (e.g., individual piece of paper here) must be independent, i.e., the selection of any unit (e.g., L or R) of the population must not influence the selection of any other unit.”

The function below is pretty self explanatory. But some details perhaps are useful. c("L", "R") sets the categories that we want; we could have written “Left” and “Right” but “L” and “R” makes the code look “cleaner”. If three categories were required, we could had set them as c("A", "B","C"). size is the sample size. prob is the probability of each category. Let’s say we had 3 categories with 33.33% each as probability. Then prob = c(0.3333, 0.3333,0.333). Lastly replace = TRUE states (for the purpose of the problem here) that each observation (“toad”) has the same probability of being sample.

Sample1 <- sample(c("L", "R"), size = 18, prob = c(0.5, 0.5), replace = TRUE)
Sample1

We can then calculate how many individuals in sample 1 were of a particular category (here we decided right-handed); you will later that selecting the left-handed wouldn’t make a difference as we know how many there are:

sum(Sample1 == "R")

Let’s take a second sample and calculate its sum:

Sample2 <- sample(c("L", "R"), size = 18, prob = c(0.5, 0.5), replace = TRUE)
Sample2
sum(Sample2 == "R")

Most likely Sample1 and Sample2 have different numbers of right-handed individuals.

Now, let’s generate a huge number of samples from the theoretical population (i.e., computer approximation of the infinite sampling process done via calculus) and for each of them keep the score of right-handed toads:

number.samples <- 100000
samples.n18 <- replicate(number.samples,sample(c("L", "R"), size = 18, prob = c(0.5, 0.5), replace = TRUE))
dim(samples.n18)

Let’s print the first 3 samples:

samples.n18[,1:3]

Let’s now calculate for each sample, the number of individuals that are right-handed:

results.n18 <- apply(samples.n18,MARGIN=2,FUN=function(x) sum(x == "R"))

The number of left-handed toads can be calculated simply as:

18-results.n18

Take a look into what the vector results.n18. Each value in the vector results.n18 represents the number of right-handed toads from a random sample of 18 individuals from a theoretical population in which the number of right- and left-handed toads are the same. Remember the function head lists the first 6 values in a vector or matrix:

head(results.n18)

Let’s build the frequency distribution table for the samples (i.e., sampling distribution of rigt-handed toads from the theoretical population):

Table.TheoreticalPop <- table(results.n18, dnn = "numberRightToads")
Table.TheoreticalPop

Obviously the number of samples in Table.TheoreticalPop is 100000:

sum(Table.TheoreticalPop)

We can also calculate the relative frequency of samples of different number of right-handed toads as:

Table.TheoreticalPop/100000

As you could have predicted, the most common category is samples of 18 toads containing 9 toads in which half of them are right- and the other half left-handed. But notice that many samples differed from this value by chance alone, demonstrating sampling variation around the theoretical value (50%/50% right- and left-handed). To make the table to look a bit better use the function data.frame:

data.frame(Table.TheoreticalPop)

We can also calculate the probability of each class (i.e., for each number of individuals that are right-handed) as follows:

Table.TheoreticalPop/100000

Because we are dealing with a discrete variable (number of right-handed toads), the histogram will lump discrete values (number of right-handed toads) into classes. In this case, we rather use a barplot using the table we generated for the theoretical population so that each discrete value appears separately:

barplot(height = Table.TheoreticalPop, space = 0, las = 1, cex.names = 0.8, col = "firebrick", xlab = "Number of right-handed toads", ylab = "Frequency")

The distribution is symmetric but note that the relative frequency 8 left- handed toads may not be exactly the same as 10-right handed toads; and the same is the case for the relative frequencies of 2-right handed and 16-right handed toads as we did not generate the distribution for infinite number of samples; note, however, that they are pretty similar!

Now, let’s go back to our original data (i.e., real sample) in which 14 toads were right-handed. What is the the probability of obtaining a sample as extreme or more extreme than 14 toads

frac14orMore <- sum(results.n18 >= 14)/number.samples
frac14orMore

Given that the sampling distribution is symmetric, the value above is pretty similar to:

frac4orLess <- sum(results.n18 <= 4)/number.samples
frac4orLess

If infinite samples were taken, then frac14orMore and frac4orLess would had been the same. That’s the reason we concentrated in one limb (right) because the counts for left-handed toads can be easily calculated by subtraction.

Given that the interest here is not to generate evidence that the dominant limb of toads is their right (or left) limb but rather generate evidence as to whether they have a dominant limb (either right or left), we adopt a probability that reflects both sides of the sampling distribution from the theoretical population:

2*frac14orMore

This probability can be equally estimated by:

frac4orLess+frac14orMore

Note that the difference between the two values above are due to the fact that our distribution was computanionally generated. The one generated by infinite sampling would had given exactly the same value, i.e., frac4orLess+frac14orMore would had been exactly the same as 2*frac14orMore.

The computational approach developed here allows you to understand the principle used to generate statistical evidence regarding a scientific hypothesis. In real applications, however, we would had used a method that generates the sampling distribution based on infinite sampling (referred as to an exact test). The probability based on an exact test can be estimated as follows. The first value is the proportion of right-legged toads (or left), 14; the second value is the total number of toads, i.e., 18 and the 3rd value is the proportion assumed under H0, i.e., 50%/50% (or 0.5):

binom.test(14,18,0.5,alternative="two.sided")$p.value 

The probability is 0.03088379, which is pretty close to our computational approach. In my case, 2 x frac14orMore was equal to 0.03194.

Just for sake of completion, let’s estimate the p-value based on the number of left-legged frogs, i.e., 4 individuals:

binom.test(4,18,0.5,alternative="two.sided")$p.value 

Obviously the same p-value is produced!

Regardless of the approach (infinite or computational), the probability of finding a sample of 18 toads (from a theoretical population in which the number of right- and left-handed individuals are equal) in which 14 are right-handed and 4 are left-handed (hence calculating the probabilities from both sides of the curve) is around 0.03. This probability tell us how unusual is to find a sample like the one found by the study (i.e., 14 right-handed and 4 left-handed) in a theoretical population. The probability is quite small (0.03111), thus providing evidence to say that the toads in the real data exhibit handedness. In other words, the probability of finding a sample value with 14 toads as right-handed (4 left-handed) is quite implausible (inconsistent) with what is expected from a theoretical population with half of the individuals being right-handed and the other half being left-handed!

As such, our original sample of 14 right-handed toads and 4 left-handed toads seem to come from a population different from our theoretical population of 50% right-handed and 50% left-handed. We then use this probability to generalize the results and state that “the study generated evidence that toads present handedness.

Now, let’s put this into an estimation perspective. The estimate of the proportion of right-legged frogs over left-handed frogs based on the sample is obviously:

14/18 = 0.7777778

Let’s remember what a confidence interval is: “A confidence interval is a range of values surrounding the sample estimate that is likely to contain the population parameter.” If you need a refresher on this concept, please go to the WebBook for BIOL322 (access available in Moodle).

The 95% confidence interval for the estimated proportion above (i.e., 0.7777778) is also given by the function binomial.

binom.test(14,18,0.5,alternative="two.sided")

As you can see, the 95% confidence interval is 0.5236272 - 0.9359080. As you can imagine, this is quite wide but qualitative agree (and will always do) with the results from the statistical hypothesis testing. Here, based on an alpha of 0.05 (the complement of a confidence interval of 95%) and p-value = 0.03088, we rejected the null hypothesis. Note (obviously) that the confidence interval DOES NOT contain the uninteresting value assumed under the null hypothesis 50%. Now let’s estimate the confidence interval for the proportion of left-legged over right-legged:

binom.test(4,18,0.5,alternative="two.sided")

The proportion is 22.22%, or 0.2222 and the confidence interval of 0.06409205 - 0.47637277 (as expected) also does not contain the value assumed under H0 of 0.5 (50%).

Now let’s estimate the confidence interval around the value assumed under the null hypothesis, i.e,, 50%/50% or 0.5; which means 9 individuals left-legged and 9 individuals right-legged:

binom.test(9,18,0.5,alternative="two.sided")

The confidence interval is 0.2601906 - 0.7398094. Note that the interval does not include the observed proportion 0.78 (i.e., 14/18). So,

1) the confidence interval for the observed value doesn't not include the value under the null hypothesis (0.50); and 

2) the observed confidence interval for the value assumed under the null hypothesis doesn't include the observed value (0.78)

Note, however, that the fact that they agree, that doesn’t mean that parameter estimation is not important and that we should aim at improving it (and its confidence interval) when possible. By always just aiming at statistical hypothesis testing and not parameter estimation, we restrain and become content with just a qualitative approach (reject or don’t reject the null hypothesis) instead of a quantitative one.

A critical note on sample size and its effect on statistical hypothesis testing and inference

By improving estimation, (and its confidence interval), we can improve quantitative characterization of a research problem. One way to improve estimation is to increase sample size, which may be extremely timely and financially consuming, and even ethically irresponsible (e.g., collecting and manipulating too many individuals that can put biological populations and species at risk). Let’s assume, for instance, that 1000 frogs were sampled instead but the number of right-legged were the same in proportion (i.e., 14/18 = 778/1000)

binom.test(778,1000,0.5,alternative="two.sided")

The 95% confidence interval (0.7509431 - 0.8034091) for 1000 individuals is much more informative than the original based on 18 individuals (0.5236272 - 0.9359080).

We could had used the computational approach to estimate confidence intervals (and/or p-values) for any desired proportions and sample sizes; this also demonstrates that the sampling apprach we use to create null distributions are the same as for estimating confidence intervals. For that, we need to change the expectation of right- and left-legged toads as observed.
We used here the command set.seed below so that the random number generator starts at the same point for all of us and we obtain the same results.

set.seed(100)
number.samples <- 20000
n <- 1000
samples <- replicate(number.samples,sample(c("L", "R"), size = n, prob = c(1-0.778,0.778), replace = TRUE))
results <- apply(samples,MARGIN=2,FUN=function(x) sum(x == "R"))
proportions <- results/n
quantile(proportions, probs = c(0.025, 0.975))

The estimated 95% (two-tailed, hence 0.025 and 0.975 above) confidence interval using the computation approach is 0.752-0.804. Note that this confidence interval extremelly similar to the one obtained using the binomial distribution (via the binom.test function), which was 0.7509431 - 0.8034091. If it were possible to set the number of samples for the computational approach as infinite, then the two intervals would had been the same.

We use this sort of simulations so that you can gain a better understanding of what statistical distributions are. Also, most distributions have closed-form solutions, i.e., they can be solved using a formula that has been built to generate probabilities and confidence intervals (among other types of statistical inferential tools).

Taken together, the goal here was to demonstrate that there is no real difference between them (statistical hypothesis testing) in their calculations. But there is a big differece in philosophy, aims and information.

Finally, for the sake of discussion, let’s just assume that the proportion found was different, say 1 left-legged and 17 right-legged toads:

binom.test(1,18,0.5,alternative="two.sided")

The p-value is obviously much smaller (0.000145) and the confidence interval much smaller, providing greater confidence about the sample estimate (0.001405556 - 0.272943600); and also agrees with the rejection of the null hypothesis, i.e., does not include the value assumed under H0 of 0.5!

As we saw it, the qualitative answer from the statistical hypothesis testing will always agree with the quantitative answer based on the confidence interval estimate. Why do we often use statistical hypothesis then? There are also two other important issues to consider:

  1. We love yes & no answers; much easier to make decisions (but there are issues involved as discussed in Lecture 3.

  2. Some estimates of effects are difficult to interpret on an easy and straightforward quantititave value. Let’s assume an ANOVA design for estimating the effect of four different temperature levels (low, intermediate, high and very high) on fish growth.

The data are transformed into an F-value to summarize all the information into one value. We can certainly estimate the confidence interval for the F-statistic. But that’s often difficult to translate quantitatively. So, researchers resort to an easier, value, i.e., a p-value for the F-value to arrive to a qualitative conclusion. That said, we should also consider always in reporting the confidence interval for each sample mean as they are indeed more informative than statistical hypothesis testing.

We hope that this tutorial assisted in gaining a better understanding of the elements we covered in lectures 2 and 3. The paper found in the link below, “Interpreting Research Findings With Confidence Interval”, is also very helpful to understand the points made here:

https://orthodontics-endodontics.imedpub.com/interpreting-research-findings-with-confidence-interval.php?aid=7224


Problem: for those that would like to apply the concepts covered here; this is also a good way to practice for Midterm 2.

All the information and code needed for you to solve this problem is in the tutorial.

The problem is based on the following problem adapted from Wikipedia (https://en.wikipedia.org/wiki/Ideal_free_distribution): The ideal free distribution (IFD; Fretwell and Lucas 1970) is a theory in ecology that states that the number of individuals from a given species that will aggregate in various patches is proportional to the amount of resources available in each patch. For example, if patch A contains twice as many resources as patch B, there will be twice as many individuals foraging in patch A as in patch B. The IFD theory predicts that the distribution of individuals among patches will minimize resource competition and maximize fitness.

Let’s assume that you run an experiment to test the IFD in a species of grasshoppers. You set two patches (two tanks linked by a tube) and put 50% of the food in one tank and 50% of the food in the other tank; you then place 10 grasshoppers in one tank and 10 grasshoppers in the other tank. Then you go back after 6 hours and count the number of grasshoppers on each tank (individuals may have moved between tanks). The result is that 3 individuals were found in tank 1 (and obviously 17 individuals were found in tank 2 assuming no individual died during the experiment).

Research question: Do these results provide evidence for or against the Ideal Free Distribution?

Develop code to estimate whether the results are consistent or not with the IDF. Code should include both the computational and the exact test.

What would be your answer on the basis of the code and the concepts you have learned above?

Reference: Fretwell, S. D. & Lucas, H. L., Jr. 1970. On territorial behavior and other factors influencing habitat distribution in birds. I. Theoretical Development. Acta Biotheoretica 19: 16–36.