Tutorial 2: The R and RStudio environments

Week of September 9, 2024 (2nd week of classes)


How the tutorials work
The in-person lab sections start this week. The one last week (tutorial 1) was asynchronous. This week’s tutorial is number 2. Tutorials = Labs; we use these terms interchangeably. Lab/Tutorial attendance is not mandatory but reports are.

The DEADLINE for your report is always at the end of the tutorial.

The INSTRUCTIONS for this report is found at the end of the tutorial.
Students may be able after a while to complete their reports without attending the in-person lab/tutorial sessions. That said, we highly recommend that you attend the tutorials as your TA is not necessarily responsible for assisting you with lab reports outside of the in-person lab sections.

Your TAs

Section 0101 (Tuesday): 13:15-16:00 - Aliénor Stahl ()
Section 0103 (Thursday): 13:15-16:00 - Alexandra Engler ()

General Information

This tutorial is meant to help you get acquainted with the R environment for statistical computing and its basic commands, ways to handle data and plot graphs.


General setup for the tutorial

A helpful approach is to have both the tutorial open in our WebBook and RStudio side by side. Please note that in our tutorials, code results are shown after ‘##’ (two hashtags), but this will not occur in RStudio. Additionally, we do not always display the output of every operation in the tutorial, as it is often provided for demonstration purposes only.


Your first R steps

This tutorial includes code adapted from R and Data Mining: Examples and Case Studies by Yanchang Zhao (2013), as well as from Raphael Gottardo’s lectures on R and basic statistics (Exploratory Data Analysis and Essential Statistics using R).

Below, you’ll find the results of various operations. Outputs are indicated by ‘##’ at the start of each result. For example, in the first set of commands, when you enter 1 + 1, the result 2 will be displayed, preceded by ‘##’.

1+1
## [1] 2
exp(-2)
## [1] 0.1353353
pi
## [1] 3.141593
exp(10)
## [1] 22026.47
round(3.123,digits=0)
## [1] 3
round(3.123,digits=2)
## [1] 3.12
round(sin(178.54),digits=3) # you can use functions together, i.e., pi, sin and round to produce a final value 
## [1] 0.506
sqrt(10)
## [1] 3.162278

Note: The symbol # is used to add comments to the code, meaning any text that appears after # will not be executed by R when you press . For example, in the command -1/0 # don’t be afraid, the text after # is simply a comment.

When assigning values to a variable, keep in mind that the term ‘variable’ in computer programming differs from its usage in statistics, as discussed in class. In programming, a variable (or scalar) refers to a storage location identified by a memory address, paired with an associated symbolic name (an identifier), which holds a value—either known or unknown.

x=2 # variable here is x and it contains the number 2
y=2
x+y
## [1] 4
c=2

Creating a sequence of numbers:

x=0:5
x
## [1] 0 1 2 3 4 5

Note that many R users prefer to use the <- symbol, known as the assignment operator, instead of = (equal) for assigning values. While this distinction may not seem immediately important for this course, it’s something to be aware of when reading R materials both online and in this course. In our tutorials, we generally use <-, but for all purposes in this course and R programming, either <- or = is perfectly acceptable.

Re-creating a sequence of numbers now using <-:

x<-0:5
x
## [1] 0 1 2 3 4 5

It’s the same thing!

Your first plot in R:

x<-1:7
x
## [1] 1 2 3 4 5 6 7
y<-11:17
y
## [1] 11 12 13 14 15 16 17
plot(x,y)

Let’s plot the same graph but with red dots instead of black as above:

plot(x,y,col="red")

Notice that the vectors x and y didn’t need to be redefined, as they are stored in your computer’s memory. However, once you close R, these vectors will be erased from memory. In a later tutorial, you will learn how to save scripts so you don’t need to retype commands.

Creating a series of numbers (in a vector) by using function c (for combine values):

x <-  c(2,3,5,2,7,1)
x
## [1] 2 3 5 2 7 1

More on calculations and dealing with vectors;

weight <- c(60,72,75,90,95,72)
# calling a particular cell
weight[1]
## [1] 60
weight[2]
## [1] 72
weight
## [1] 60 72 75 90 95 72
height <- c(1.75,1.80,1.65,1.90,1.74,1.91) 
bmi <- weight/height 
bmi
## [1] 34.28571 40.00000 45.45455 47.36842 54.59770 37.69634

Les’t calculate the mean of a series:

(1+2+3+4+5)/5
## [1] 3

One of the major strengths of R lies in its ability to use scripted functions to calculate metrics of interest and perform statistical analyses. Here are two examples:

sum(height)
## [1] 10.75
mean(height)
## [1] 1.791667

We will learn lots of functions in BIOL322.

R allows you to assign values calculated by functions (such as the mean) to a variable. You can name a variable anything you like. In this example, I used mean.x for clarity, but you could just as easily name it doNotGetDistracted or anything else you prefer:

mean.x<-mean(height)
mean.x
## [1] 1.791667
doNotGetDistracted<-mean(height)
doNotGetDistracted
## [1] 1.791667

We can sort values in ascending order:

sort(x) 
## [1] 1 2 2 3 5 7

or descending order:

sort(x,decreasing = TRUE) 
## [1] 7 5 3 2 2 1

Earlier, we learned that R functions come with ‘default’ settings. For example, the default behaviour of the sort function is to rank numbers in ascending order. However, you can modify this by setting the ‘decreasing’ parameter to TRUE. By default, ‘decreasing’ is set to FALSE. You can find this information and other details about the function by requesting help as shown below:

? sort

We’ll dive deeper into functions throughout the tutorials, so there’s no need to worry about their defaults or how they work just yet. These concepts will be introduced gradually as we progress through the material.


Reading data files

Let’s learn now how to read a file containing data. Download the file example_file.csv:

Download data file

This file is in the widely used CSV (comma-separated values) format, which can be created by programs like Excel, among others. CSV files are popular for data storage and sharing (learn more here: https://en.wikipedia.org/wiki/Comma-separated_values). While R can read various file formats, CSV and text files (.txt) are among the most commonly used. In BIOL322, CSV is the preferred format.

Open the file in excel to take a look into how it is structured. Two variables

The ‘read.csv’ function reads a file in CSV format and stores the data in a variable (in the example below, we named this variable my.first.data). The option ‘file.choose()’ prompts R to open a window, allowing you to select the file by clicking. While there are other methods to read files in R, this is the simplest and most user-friendly approach for beginners.

my.first.data<-read.csv(file.choose())
my.first.data

Let’s open the data in a window in which you can see the values in a much better format

View(my.first.data)

Often data is quite large (lot’s of rows) and you want just to observe the first few rows and see the names of variables (columns):

head(my.first.data)

Let’s produce a scatter plot of the data:

plot(wingSize~bodySize,data=my.first.data)


A small tour in RStudio

The default R interface can feel somewhat basic, but RStudio provides a more user-friendly environment, making it easier to work with R.

The four corners of RStudio
RStudio features four main panes that help organize the various tasks you can perform in R, such as writing and running code, viewing graphs, and manipulating data.

How to set the working directory
The ‘Working Directory’ is the folder where R will, by default, look for files you want to open (e.g., data files, scripts) and where it will save any files you create. In RStudio, you can set your Working Directory by navigating to ‘Session -> Set Working Directory -> Choose Directory…’. It’s often a good idea to create separate directories for different projects to keep your work organized.

How to create and save R scripts
Entering and running code directly in the R command line is straightforward. However, a downside is that you must retype your commands each time you want to run them again. This issue is easily resolved by creating and saving R scripts.

In BIOL322, we will work primarily with R scripts—files that contain a series of commands and comments. These scripts can be saved for later use, allowing you to rerun commands from previous analyses or adapt older scripts for new data or applications. Think of R scripts as a blank canvas where you can organize your analyses, document your code, and note decisions made during coding. This will become clearer as you gain more experience with R.

To create a new script in RStudio, click on the ‘New File’ icon in the top left corner of the RStudio toolbar and select ‘R Script’ (see figure below). Once the new script is created, it will be ready for code entry in the console tab.

How to execute R scripts
To execute the code written (i.e., Input) in an R Script, you must select the line and use the shortcut CTRL+ENTER in Windows or cmd+enter in MacOS. You will see the outputs of your code in the console tab. See a screenshot below.

Notice that RStudio assigns a number to each line of code. These line numbers will become particularly useful as you work with more complex code and need to track multiple outputs.

Saving your newly created R script is straightforward. You can click the Save icon at the top of the script editor or use the shortcut CTRL + S (or cmd + S on Mac). To open a previously saved R script, simply click ‘File -> Open File’ or use the shortcut CTRL + O.

Importing data into R
In common applications, we use a data format called csv (comma separated ). Data in excel, for instance, can be saved in csv. To import data into RStudio is fairly simple using the function read.csv().In this example, we will assign our dataset (here, the file example_file.csv) to an object “mydata” using <- (an arrow formed out of < and - ).

mydata <- read.csv(“example_file.csv”)

Once the object mydata is created, it will automatically appear in your Working Space (upper right window in RStudio). To view your dataset, you can click on the object or use the function View() in the console as follows:

View(mydata)

Exporting files from R
Throughout the tutorials and coursework, you will need to export data or results generated in R and save them in various formats, such as CSV. The write.csv() function allows you to create CSV files, which will automatically be saved in your Working Directory. Here’s an example:

write.csv(my.first.data, file= "body.csv", row.names = FALSE)

Stating the argument row.names as FALSE, it avoids generating an extra columns with row numbers in the saved file. This may be useful in other applications.

Saving and loading workspaces
You can save the objects (variables, matrices, results of analyses, etc) currently loaded into RStudio using the function save.image(). The combinations of objects during an R session is called “workload”. By typing in the console save.image(file=“Tutorial.RData”), you will save the workload. Using the command load(“Tutorial.RData”), you can reload all your previously work contained in that workspace.

You will learn lots more about R and graphs in the tutorials. For now, you have finished your first tutorial! Don’t hesitate to ask additional questions to your TA.


Report instructions

Reports are based on completing a series of tasks using R and RStudio. To solve these tasks, you will make simple adaptations to the code learned in the corresponding tutorial. Write all your code in an RStudio file, save it, and submit it via Moodle, along with the CSV file generated below. While attendance at the tutorials is not mandatory, you are required to complete and submit the corresponding reports by their deadlines. No written report (e.g., in a Word file) is necessary.

Your report should consist of the following:

  1. Create 2 series of 8 numbers each (any numbers you want) using the c command as learned above.

  2. Calculate the mean (using the function mean seen above) of each series.

  3. Plot them using the command plot above.

  4. Create a csv file with 7 rows and two columns representing two variables, nitrogen and temperature. Plot these values where temperature is in the Y axis and nitrogen in the X axis.
    That’s your report for today!