Introduction to R
1: What is R?
R is a statistical programming language that is very commonly used for data management and analyses throughout the sciences.
In this online tutorial, we will practice the basics of working with R. There are boxes of R code embedded throughout this tutorial that you can run and edit.
In your own analyses, you would have a choice of how to use R:
Download the current version of R and write code within the R software environment directly
Download both R and an Integrated Development Environment (IDE) for R, which provides additional tools in the interface for working with your code documents. (Think of this, roughly, as the difference between writing a text file in Notepad and using a word processor like Microsoft Word.) RStudio is a very commonly used (and free) IDE for R, but others are also available.
Use a cloud-based IDE for R, which allows you access to equivalent tools without needing to download anything.
After this tutorial, we will be using Posit Cloud to work with R. Posit Cloud is an online platform through which you can access the RStudio development environment without needing to download or install anything. Posit Cloud allows us to share files with each other, so I can access your code without you needing to save or submit any files, and we can co-edit your work. It does have some limitations on processing speed, and it is a paid services with costs escalating depending on usage time per month. So, in your own analyses after this class, you may prefer to download R and RStudio to avoid such limitations.
R is almost infinitely customizable. By default, it can perform a wide range of mathematical operations and statistical calculations on data tables.
Thousands of user-written packages of additional specialized R functions are available for free, and give you extra tools to carry out various tasks more efficiently.
Between base R and these packages, we can use R for summarizing data, statistically analyzing data, and data visualization, including very specialized methods used in particular fields (phylogenetic analysis, map-making, etc.).
If a task involves data, there is probably a way to do it in R.
2: Commands
To use R, you have to write and execute commands. You can give R one command at a time to run directly, or you can write a script - a file containing contains a series of commands.
We will start getting practice with writing and executing commands by using R to perform some basic arithmetic and data summaries.
Arithmetic Commands
In the box below, there is a command to add two numbers in R. Before you change anything in this code, click Run Code to see how it returns the output of this code.
After you observe the output, replace the numbers with two other values of your choice and hit Run Code again to see the new answer.
Notice that the command is preceded by ‘1’, and the answer is preceded by ‘[1]’. These are line numbers and not part of the code or answer.
With multiple lines in the command, this becomes clearer (hit Run Code below):
Now we can see that the three lines of code are numbered 1 - 3, but each answer is still preceded by [1]. That’s because each is a separate answer from a separate command, so each is Row 1 of a given result.
Other basic arithmetic is formatted similarly with the appropriate symbol. In the box below, there is one example each of the common mathematical operations: addition +, subtraction -, multiplication *, division /, exponents ^, log log(), and square root sqrt().
Starting with this example, you will see that there are comments in the box in addition to the commands. Comments allow you to write text adjacent to your code, without R trying to interpret the text as code. Comments have to be preceded by the # symbol to mark that they should not be run as code. Comments can be on their own lines or after a command on the same line.
Click Run Code below to see the result.
Now try adding your own comment to the box and run the code again. Finally, try removing the # from a comment and run the code again to see what happens when a comment is not correctly marked.
You can hit the Start Over button (↻) in the box to reset the edited code to the original example.
There won’t be a Start Over button in Posit Cloud, but we will have another strategy for setting up your work in a way that always lets you get back to the original starting point. Don’t worry, you can’t break anything permanently!
Formatting Code
As you work more with R, you will find that it is very strict about formatting in some respects, and very loose or inconsistent in others. This is the nature of working with an open-source language, especially when working with packages written by many authors that might have different styles.
We will learn more about formatting requirements as we learn different skills in R throughout the semester (often through trial and error), but we will start here with a couple of examples.
R is, generally, flexible about the use of spaces around operators like the arithmetic symbols we have practiced with. Above, we had a space between each number and its mathematical operator. Below are the same commands with more or fewer spaces. Run this code to confirm that you get the same result.
Try to change the spacing and run again.
While we saw flexible formatting in this example, in other cases a command will fail without the correct formatting. For instance, many functions are case sensitive (as we will see soon), and symbols are usually not flexible (e.g., you can’t swap single and double quotation marks).
As we progress, you will encounter various common coding errors, many related to formatting issues. If you are experiencing an unexpected error in your R code, one of the first things you should check is whether there is a formatting issue. This will be frustrating, but is just part of working with R! It becomes second nature after some practice. It is a good idea to keep a list of such issues to use as a troubleshooting reference in future code.
Even where R gives formatting flexibility, it is good practice to develop and use a consistent code style that works best with your habits of how you read, write, and edit code. Where R gives you flexibility, try to make the same choice consistently (do you use a space or not? when do you use a line break?). We’ll keep practicing this throughout the semester - it’s not the main priority right now, but it is a good time to start developing good habits.
3: Objects and Functions
It’s rare that we want to do arithmetic calculations in R by typing values directly into a command.
Instead, we almost always want to write commands that reference data frames - tables of data that have been loaded into R to work with.
In this class, we will sometimes use sample datasets that are directly accessible within R, but we will mostly import datasets of interest - after all, that is what you will need to do to work with your own data! We will almost never enter data the way that is shown in the example below (except when we need to create a small sample dataset for demonstration). We will discuss how data impot works in Posit Cloud another time.
In this demonstration, we will write a command to create a small dataset. We will then use that dataset to practice some summary statistics.
Summary statistics are any value that is calculated to summarize a whole group of numbers - means, medians, minima and maxima, ranges, standard deviations, etc.
For instance, if I want to calculate the mean of five numbers, I could manually code all of the necessary arithmetic:
… but that would get tedious very quickly. You don’t want to have to type all your data in as a command every time! It’s also not a replicable process if we update our dataset - we’d have to rewrite the command.
What else can we do?
Objects
I could instead enter my data once, and save it as a named object in R.
Objects are a fundamental unit of information in R. Objects refer to any information that is named and stored for later use in the programming environment, and could contain a single value, text, a data table, a list, a command or series of commands, a graphical plot, and more. Understanding how to create and assign value to objects is a key skill in R. More on objects in R.
Here we are creating an object named data, and using the assignment operator <- to define what we want to save to the object.
We then use the c() function to list the values we want in our dataset. Inside the parentheses of the function, we provide a comma-separated list of the values in the dataset.
Functions and Arguments
Functions and arguments will be fundamental to your work with R. Functions are shortcuts for code someone else has already written to carry out a certain task. Each function has a name followed by parentheses. When you need to use a function in a command, you first write the name of the function, and then provide the function whatever additional information it needs (arguments) within the parentheses. Different functions have different requirements for the input format within the parentheses. We’ll come back to all of that shortly and will get a lot of practice with different functions throughout the semester.
If I run the code above as-is, you will see that it returns no output. That is because the command does not request any output - it just requests that the information is saved, which is done silently with no immediate feedback from the code. You can see the saved object by simply using the name of the object alone as a command:
And now we can see that it returns the list of numbers we saved.
Now that we have an object saved, we could write a command manually to calculate the mean of the values in that object, rather than adding the values to our command directly.
We know that a mean is the sum of all values in a list divided by the numbers in that list, so one option would be to use the sum() and length() functions, which add up and count the values in a list, respectively.
But, there is also a function specifically for calculating a mean that would make this a lot easier:
In the code above, try changing the capitalization of mean() to Mean(), and run the code again. You’ll see that you get an error. As hinted earlier, the names of functions are case-sensitive. When you follow examples in writing your own code, make sure you are using the same capitalization.
As we’ve seen, there are multiple ways to solve the same problem in R.There was more than one correct way to write a command to calculate the mean of our values. This will frequently be true in our work in R.
You may find that my approach differs to the approach you find in other examples on the internet, or that your code differs from another person in the class while still producing the same result. That’s okay!
Some approaches are more efficient than others (in terms of either your own time to write it or the time needed for the computer to execute the command), but we don’t need to worry about maximizing the efficiency of our code at this stage.
4: Summary
By the end of this tutorial, you should know what a command, an object, and a function are in R. You should also know how to create objects using the assignment operator. Those are our core building blocks of working with data via code. You should also know what a comment is, and how to mark the difference between a comment and code.
Everything else will build from here - we can move to Posit Cloud for next steps.