# Tutorial 4: Describing data

**Describing data with R**

Week of September 26, 2022

(4th week of classes)

**How the tutorials work**

The DEADLINE for your report is always at the end of the tutorial period.

REPORT INSTRUCTIONS: In addition to the problem at the end of the tutorial, there may be questions along this tutorial that need to be answered in the report (in the RStudio file).

Students may be able after a while to complete their reports without attending the synchronous lab/tutorial sessions. That said, we highly recommend that you attend the tutorials as your TA is not responsible for assisting you with lab reports outside of the lab synchronous sections.

**Your TAs**

Section 101 (Wednesday): 10:15-13:00 - John Williams (j.p.w@outlook.com)

Section 102 (Wednesday): 13:15-16:00 - Hammed Akande (hammed.akande@mail.mcgill.ca)

Section 103 (Friday): 10:15-13:00 - Michael Paulauskas (michael.paulauskas@mail.mcgill.ca)

Section 104 (Friday): 13:15-16:00 - Alexandra Engler (alexandra.engler@hotmail.fr)

**General Information**

Please read all the text; don’t skip anything. Tutorials are based on R exercises that are intended to make you practice running statistical analyses and interpret statistical analyses and results.

This tutorial will allow you to develop basic to intermediate skills in plotting in R. It was produced using code adapted from the book “The Analysis of Biological Data” (Chapter 2). Here, however, we provide additional details and features of many of the R graphic functions.

**Note:** Once you get used to R, things get much easier. Since this is only your 3rd in-person tutorial, try to remember to:

- set the working directory
- create a RStudio file
- enter commands as you work through the tutorial
- save the file (from time to time to avoid loosing it) and
- run the commands in the file to make sure they work.
- if you want to comment the code, make sure to start text with: # e.g., # this code is to calculate…

**Intro video for tutorial 4 - recorded in 2020 so don’t follow report instructions if any is given in the video - follow the instructions in the tutorial**
by Timothy Law (with input from the TA team)

**General setup for the tutorial**

A good way to work through is to have the tutorial opened in our WebBook and RStudio opened beside each other.

**Mean, median, variance, standard deviation & coefficient of variation**

We will start the tutorial by using the data on Gliding snakes as seen in lecture 4.

Download the gliding snakes data file

First open the file “chap03e1GlidingSnakes.csv” in excel and try to understand how the data are organized.

Now upload and inspect the data. You probably noticed by now that the function `View`

allows you to see all the data, including the names of variables (i.e., columns of the data matrix). The function `head`

prints the first 6 rows of the data matrix and helps identifying the name of the variables in case you don’t remember them; knowing the names of variables is very important as you need to often refer to them during analyses. One often forgets variables names even when you created them.

Not that you could also use the function `names`

to print only the name of the variables (i.e., column names of the data matrix).

```
snakeData <- read.csv("chap03e1GlidingSnakes.csv")
View(snakeData)
head(snakeData)
names(snakeData)
```

As you learned in tutorial 3, data matrices are usually organized by having variables as matrix columns. Here we have variable (column) `snake`

indicating the number of each observational unit (i.e., individual snake); and variable `undulationRateHz`

containing the speed (cycles per second for each snake). Again, there are several ways to assess variables in a matrix separately, but the easiest way here (as seen in tutorial 3) is probably to use the `$`

sign; in this case, the name of the data matrix precedes `$`

and the name of the variable of interest succeeds `$`

.

Calculate location statistics:

```
mean(snakeData$undulationRateHz)
median(snakeData$undulationRateHz)
```

You can also save the values into a variable for later use:

```
mean.undulation <- mean(snakeData$undulationRateHz)
mean.undulation
median.undulation <- median(snakeData$undulationRateHz)
median.undulation
```

Calculate spread (often referred as dispersion) statistics - the function `sd`

stands for *standard deviation* and the function `var`

for *variance*.

```
sd(snakeData$undulationRateHz)
var(snakeData$undulationRateHz)
```

Coefficient of Variation - R does not have a default function that calculates the coefficient of variation, but we can do that easily as follows:

`sd(snakeData$undulationRateHz)/mean(snakeData$undulationRateHz) * 100`

We can also calculate the coefficient of variation (CV) as follows:

```
sd.undulation <- sd(snakeData$undulationRateHz)
sd.undulation/mean.undulation * 100
```

Obviously you can also save CV into a variable as follows:

`CV.undulation <- sd.undulation/mean.undulation * 100`

Let’s plot the frequency distribution of the data (i.e., histogram):

`hist(snakeData$undulationRateHz)`

Making the histogram looks better:

`hist(snakeData$undulationRateHz, breaks = seq(0.8,2.2,by=0.2), col = "firebrick",xlab="Undulation rate (Hz)",main="")`

You could have used argument `main`

to insert a title for the graph that would appear at the top. Usually in reports and publications we use legends for graphs so that we left it empty, i.e., main=“” (i.e., nothing between quotes).

Note that the X and Y axes are not touching. By setting the arguments `xaxs`

and `yaxs`

for axes x and y, respectively, to “i” we can make them touch. i stands for “internal” as oppose to regular (which would be “r” for regular); we refer it as “internal” in the sense that you set the axes to a different configuration other than the standard. Get used, R has a great number of “idiosyncrasies”; or at least as many other languages (natural and computer languages). I often say: “don’t fight the language, accept it!”

`hist(snakeData$undulationRateHz, breaks = seq(0.8,2.2,by=0.2), col = "firebrick",xlab="Undulation rate (Hz)",main="",xaxs = "i",yaxs="i")`

**Boxplots & interquartile range**

Let’s use a different data set, the spider running speed data set (as seen in lecture 5).

Download the running spider data file

First open the file “chap03e2SpiderAmputation.csv” in excel and try to understand how the data are organized.

Upload and inspect the data:

```
spiderData <- read.csv("chap03e2SpiderAmputation.csv")
View(spiderData)
head(spiderData)
names(spiderData)
```

Note below that we changed the way we call variables. We will not use the `$`

sign. Some functions such as the `boxplot`

also accept the name of the variables directly. In these functions, the name of the variables within the data matrix can be easily recognized as they use the argument `data`

, facilitating typing and reducing the code:

`boxplot(speed ~ treatment, data = spiderData)`

Obsviously we could have used `$`

instead of simply using the argument `data`

in the function above. It would seem a little bit of more work to type though:

`boxplot(spiderData$speed ~ spiderData$treatment)`

What is the problem with the boxplot? The issue is that the `after`

speeds appear before the `before`

speed. This is because these variables are not being treated as numerical but rather as strings (i.e., a sequence of characters) and as such the letter `a`

comes before the letter `b`

in the alphabet. In this case, we need to transform the variable `treatment`

into a numerical code (referred as to `factor`

in R) in which we can also change the order of their appearance. This is done by using the function `factor`

. This function is quite useful and we’ll use it in several types of analyses and tutorials in BIOL322:

`spiderData$treatment <- factor(spiderData$treatment, levels = c("before", "after"))`

Now plot the histogram again; now `before`

will appear before `after`

; Ok, writing that was pretty circular but true!

`boxplot(speed ~ treatment, data = spiderData)`

Now, let’s make the boxplot look much nicer. It it easy to forget the arguments in most R functions, including the ones used in `boxplot`

. So, here is some additional information that you will be able to use in the future to make boxplot nicer than the ones produced by the default function `boxplot`

; argument `boxwex`

controls the width of the boxes so that we can make them narrower; here `boxwex`

was set to 0.5 but “play” with the value to see what happens.

The argument `whisklty`

sets the type of line for the whiskers - setting it to 1 the whiskers appear as solid lines (and not as a dahsed line as in the default, i.e., in the way was generated in the earlier boxplot). Finally, argument `las`

makes the labels of axes appear in different directions - try `las=0`

, `las=1`

and `las=2`

to see what happens (try in class but no need to include in the report). Here we used `las=1`

(needed to be included in the report):

`boxplot(speed ~ treatment, data = spiderData, ylim = c(0,max(spiderData$speed)),col = "goldenrod1", boxwex = 0.5, whisklty = 1, las = 1,xlab = "Amputation treatment", ylab = "Running speed (cm/s)")`

Just by looking at the boxplot, you can notice that the distributions of `before`

and `after`

speeds are asymetric. Look into our lecture slides and try to figure out which type of asymmetric distribution each variable represents (right or left skewed). Write this information in the tutorial report code file (see below) as follows:

`# before amputation the distribution type is "YOUR ANSWER" and after amputation the distribution type is "YOUR ANSWER".`

Don’t forget to use the symbol `#`

so that the text you wrote will not be interpreted as commands.

Because the data are quite asymmetric, we will use location and spread statistics based on median and quartiles, respectively, instead of mean and standard deviations. Here we have two groups in the data, i.e., `before`

and `after`

. In tutorial 3, we introduced the function `aggregate`

. This function selects data belonging to a similar categorical value in a list (here treatment, i.e., `before`

or `after`

) and applies a function on the basis of these values separately. Here we will use the median per treatment, hence `median`

as the third argument in the function `aggregate`

. The median is then calculated for each group as:

`aggregate(spiderData$speed, list(spiderData$treatment), median)`

The use of the function `aggregate`

with the function `quantile`

is a bit “complex” (see below). To make it simpler, let’s create variables that contain the speed values before and after voluntary amputation separately. This is done using the function `subset`

and setting its argument `treatment`

as the name of the code used for each individual spider. Note that here, the argument `treatment`

received a “==” instead of a “=”. Note that the double equal sign “==” is needed when using a logical statement indicating “is equal to” in R:

```
speedBefore <- subset(spiderData, treatment == "before")
speedBefore
speedAfter <- subset(spiderData, treatment == "after")
speedAfter
```

Now let’s calculate the median, 25% (Q1) and 75% (Q3) quartiles using a function called `quantile`

in which we establish the values in the distribution in which quartiles are calculated using the argument `prob`

. The word *prob* is used here (i.e., probability) because we are inferring that there is a 25% chance of finding a value equal or smaller than Q1 (which is based on the sample values) in the distribution of the entire statistical population of spiders (i.e., all male spiders of that species during mating season either before or after amputation)! and there is a chance equal to 75% chance of finding a value equal or greater than Q3. Note that here `chance`

is based on the statistical populational concept in which the sample values are used to estimate population parameters. As such, we use the sample quartiles (Q1, Q2 & Q3) to infer about these two statistical populations: all male spiders of that species during mating season either before (population 1) or after amputation (population 2)

```
median(speedBefore$speed)
median(speedAfter$speed)
quantile(speedBefore$speed, probs = c(0.25, 0.75))
quantile(speedAfter$speed, probs = c(0.25, 0.75))
```

Note that the median `Q2`

of a distribution represents the value that divides the distribution of values into half; i.e., there is a 50% chance of finding a value equal or smaller than Q2; and a 50% chance of finding a value equal or greater than Q2 (i.e., prob for Q2 = 0.50); so that we could have calculated Q1, Q2 and Q3 just using the function `quantile`

as follows:

```
quantile(speedBefore$speed, probs = c(0.25, 0.50, 0.75))
quantile(speedAfter$speed, probs = c(0.25, 0.50, 0.75))
```

The interquartile range (IQR) for both speeds (before or after) can be calculated using the function `IQR`

as follows:

```
IQR(speedBefore$speed)
IQR(speedAfter$speed)
```

Remember that IQR = Q3 - Q1; so we could have equally calculated IQR as follows:

```
quantile(speedBefore$speed, probs = c(0.75))-quantile(speedBefore$speed, probs = c(0.25))
quantile(speedAfter$speed, probs = c(0.75))-quantile(speedAfter$speed, probs = c(0.25))
```

**Learning and practicing R further using the stickleback lateral plates data**

Let’s practice the concepts of location and spread statistics using yet a different data set. Often the nature of the data is different so that you can develop further an understanding of data structure and R by using different data. We will use the stickleback lateral plates data (lecture 6).

Download the stickleback data file

First open the file “chap03e3SticklebackPlates.csv” in excel and try to understand how the data are organized.

Upload and inspect the data:

```
sticklebackData <- read.csv("chap03e3SticklebackPlates.csv")
View(sticklebackData)
head(sticklebackData)
```

Next, calculate the mean for each genotype. As in real languages, theare are many ways to say something, including words that are synonyms. Instead of the function `aggregate`

, here we will calculate `IQR`

using the function `tapply`

instead; we will also learn how to round values using the function `round`

:

```
meanPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype,FUN = mean)
meanPlates <- round(meanPlates, digits = 1)
meanPlates
```

Calculate the median for each group:

```
medianPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype,FUN = median)
medianPlates <-round(medianPlates, digits = 1)
medianPlates
```

And the standard deviations for each group:

```
sdPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype, FUN = sd)
sdPlates <- round(sdPlates, 1)
sdPlates
```

The interquartile range by group:

```
iqrPlates <- tapply(sticklebackData$plates, INDEX = sticklebackData$genotype,FUN = IQR, type = 5)
iqrPlates
```

Let’s learn something new. Let’s produce a table containing the number of individuals for each genotype. Here we will use the function `table`

as follows:

```
sticklebackFreq <- table(sticklebackData$genotype)
sticklebackFreq
```

The total number of individuals independent of the genotype can be calculated in several ways in R but here we can simply:

`sum(sticklebackFreq)`

The relative frequency of each genotype is given by:

`sticklebackFreq/sum(sticklebackFreq)`

Finally, let’s produce a boxplot; note how we simply changed the names of the variables we used in the 1st boxplot for the spider data - so pretty much a job of coyping and pasting and changing the variable names from your previous applications. This is often what we do in R, so keeping files organized and well explained makes it “half of the job done” in R!

`boxplot(plates ~ genotype, data = sticklebackData, ylim = c(0,max(sticklebackData$plates)), col = "goldenrod1", boxwex = 0.5, whisklty = 1, las = 1, xlab = "Genotype", ylab = "Number of plates")`

**Miscellanea - Some advanced notes**

If you are curious about developing stronger R skills, this will help you to practice it further. That said, you won’t need to use these functions in your reports.

Calculate Q1, Q2 and Q3 using the function aggregate for both speeds at the same time. The argument `function (x)`

is used because the function `quantile`

has `probs`

as an argument.

`aggregate(spiderData$speed, list(spiderData$treatment), function (x) quantile(x, probs = c(0.25, 0.50, 0.75)))`

Note that the above is a little more complex . IQR could had been equally calculated for both speeds using aggregate in the usual way we learned how to use the function `aggregate`

. Because we are not using arguments for the function `IQR`

as we needed for the function `quantile`

, we don’t need to use the argument `function (x)`

in `aggregate`

.

`aggregate(spiderData$speed, list(spiderData$treatment),IQR)`

We could also have calculated IQR using the function `tapply`

(as seen earlier) as follows:

`tapply(spiderData$speed, INDEX = spiderData$treatment, FUN = IQR)`

and the more complicated usage of the function `quantile`

as follows given that `quantile`

requires the argument `probs`

.

`tapply(spiderData$speed, INDEX = spiderData$treatment, function(x) FUN = quantile(x, probs = c(0.25, 0.50, 0.75)))`

Why all these different ways? As in natural languages, depending on the context, we may prefer to use different forms of expression depending on the circumstance, e.g., “I have a problem”, “I have an issue” or “I have a dilemma” all these may sound quite similar or different depending on the context! Functions in R as well. Here the functions `tapply`

and `aggregate`

were used to produce the same calculations as before, but each of then can do things that the other can’t depending on the circumstances.

**ACADEMIC INTEGRITY & AVOIDING ACADEMIC MISCONDUCT.** We hold the principles of academic integrity, that is, honesty, responsibility and fairness in all aspects of academic life, at their highest values.

Some of our tutorials take years to perfect and they are built to enhance your learning of Biostatistics. Some of the reports here are similar to previous years and you are not allowed to consult reports from students in this year or previous years. WE HAVE SOFTWARE that can calculate the similarity among reports from students this year and in past years. This may trigger academic infractions to you and those previous students that passed you their reports even after they graduate. Academic infractions can be brought on students even after they graduate and have their records modified to reflect past academic misconducts.

Using previous reports also demonstrates a complete lack of respect for the work of professors and instructors.

**Report instructions**

Describe a fictional biological problem (e.g., an experiment or observational study you designed) that has one quantitative variable. Observational units should be divided into two categories (e.g., male/female, genotype, freshwater/marine water, low/high pH, ar any type of categorical variable with two categories). Invent a data set (make up numbers) for the problem at hands and place them in a csv file. You can use excel and then save it in `csv`

format.

Then create R code where the data files is read and you calculate mean, median, standard deviation, variance and the interquartile range for each category. Plot the boxplot as well. You can describe your fictional problem using `#`

to comment the code; for example:

```
# I set up a study that .....
# use as many lines as necessary,
# always starting with #
Code here:
```

Submit the RStudio file and the relevant csv file via Moodle. We have a lot of experience with tutorials and they are designed so that students can complete them at the end to the tutorial in question. No written report (e.g., in a word file) is necessary, just the two files.