Chapter 1 Introduction
Welcome! In this guide, we will teach you about statistics using the statistical software R with the interface provided by R Studio. The purpose of this chapter to is provide you with a set of activities that get you up-and-running in R quickly so get a sense of how it works. In later chapters we will revisit these same topics in more detail.
1.1 A focus on workflow
An important part of this guide is training you in a workflow that will avoid many problems than can occur when using R.
1.2 R works with plug-ins
R is a statistical language with many plug-ins called packages that you will use for analyses. You can think of R as being like your smartphone. To do things with your phone you need an App (R equivalent: a package) from the App Store (R equivalent: CRAN). Apps need to be downloaded (R equivalent: install.packages) before you can use them. To use the app you need Open it (R equivalent: library command). These similarities are illustrated in Table 1.1 below.
Smart Phone Terminology | R Terminology |
---|---|
App | package |
App Store | CRAN |
Download App from App Store | install.packages(“apaTables”, dependencies = TRUE) |
Open App | library(“apaTables”) |
1.3 Software Choice
The software for this course is free for Mac and PC/Windows. If you do not have a Mac or PC but instead are working with a tablet - contact the instructor during office hours.
1.3.1 Mac
On a Mac you need to download and install R and RStudio. The RStudio software is a “front end” for using R. Although you need to download and install both - you will only ever load RStudio.
1.3.1.1 Mac Download R
Go to this link to download the latest version of R. On the screen that appears select “Download R for (Mac) OS X”.
On the next screen you will see something like the image below. However, the number (4.0.3) will be different you. Download this file.
Run the file you just downloaded to install R.
Remember you will never run R - you just need it installed so you can use RStudio.
Now install RStudio (see below)
1.3.1.2 Mac Download RStudio
After installing R you will install RStudio. This is the software you will actually use in this course.
- Go to this link to download the latest version of RStudio. On the page that appears scroll down until you see the following:
Look at the line for Mac users and download the file. Here it is: RStudio-1.3.1093.dmg your numbers will be different - but this is the file you need. Download it.
Open the file you downloaded. You should see something like the screenshot below. Drag the RStudio icon to the Applications folder to install it.
- In the future, when you want to run RStudio you should do so from the Applications folder. You can access the Applications folder using the Menu shown below. This will open the applications folder.
- In the Applications folder there is now an RStudio icon. Click this icon to run RStudio each time you use it. You can also drag the RStudio icon in the Applications folder to the Dock for easy access if you prefer.
1.3.2 PC
1.3.2.1 PC Download R
Go to this link to download the latest version of R. On the screen that appears select “Download R for Windows”.
On the next screen you will see something like the image below. Select “base”.
- On the next screen you will see something like the image below. As before the specific file numbers will differ for you. Select “Download R 4.0.3 for Windows” and download it.
Run the file you just downloaded to install R.
Remember you will never run R - you just need it installed so you can use RStudio.
Now install RStudio
1.3.2.2 PC Download RStudio
After installing R you will install RStudio. This is the software you will actually use in this course.
- Go to this link to download the latest version of RStudio. On the page that appears scroll down until you see the following:
Look at the line for “Windows 10/8/7” users and download the file. Here it is: RStudio-1.3.1093.exe your numbers will be different - but this is the file you need. Download it.
Open the file you downloaded and run it to install RStudio.
In the future you will only use RStudio to access use R.
1.4 Run RStudio not R
As noted above, in the future you will run RStudio to use R. That is, you click the icon below to use RStudio moving forward. On macOs make sure it is the icon in the Applications folder. On Windows make sure it is the icon in the Program Files folder.
1.5 RStudio configuration, First run only
Prior using RStudio it is important it to configure it properly to avoid problems getting your code to work. Therefore, the very first time you run RStudio you should configure it correctly.
As illustrated below, go to the Tools menu and select Global Options…
Next, ensure General is selected on the left side. Then ensure that “Restore .RData into workspace at startup” is not checked. I repeat, NOT CHECKED. Also, make sure that “Save workspace to .RData on exit” is set to NEVER. Then click the OK button at the bottom of the window.
1.6 Begin with an RStudio Project
It can be hard to keep track of files you use and create when using RStudio. Therefore, you should always use an RStudio Project to keep track of things. This is basically just a Folder on your computer that RStudio associates with a particular task. Later in this introduction to RStudio we will use the file data_okcupid.csv. Right now we will set up a Project to make it easier to use that file.
Create a folder on your computer called “psyc3290_project_1”. Be sure you know where you create this folder so you can find it again easily.
Download the data_okcupid.csv file. NOTE:You must get this file from your Downloads folder. For many people (on macOS) the file may Open in Numbers - then they are tempted to save it from Numbers. This will not work. Just close Numbers. Then retrieve the file from you Downloads folder.
Place the data_okcupid.csv file into the psyc3290_project_1 folder.
Load RStudio.
Go to the File menu and select “New Project…”
You will see the window below. Select “Existing Directory”.
- You will see the window below. Click the “Browse” button and then locate the folder you created. Once this is done click the “Create Project” button.
- You’re ready to begin. Follow this process of creating an RStudio project for each class project/activity.
1.7 Exploring the R Studio Interface
Once you have opened (or created) a Project folder, you are presented with the R Studio interface. There are a few key elements to the user interface that are illustrated in Figure 1.1 In the lower right of the screen you can see the a panel with several tabs (i.e., Files, Plots, Packages, etc) that I will refer to as the Files pane. You look in this pane to see all the files associated with your project. On the left side of the screen is the Console which is an interactive pane where you type and obtain results in real time. I’ve placed two large grey blocks on the screen with text to more clearly identify the Console and Files panes. Not shown in this figure is the Script panel where we can store our commands for later reuse.
1.7.1 Console panel
When you first start R, the Console panel is on the left side of the screen. Sometimes there are two panels on the left side (one above the other); if so, the Console panel is the lower one (and labeled accordingly). We can use R a bit like a calculator. Try typing the following into the Console window: 8 + 6 + 7 + 5. You can see that R immediately produced the result on a line preceded by two hashtags (##).
## [1] 26
We can also put the result into a variable to store it. Later we can use the print command to see that result. In the example below we add the numbers 3, 0, and 9 and store the result in the variable my_sum. The text “<-” indicate you are putting what is on the right side of the arrow into the variable on the left side of the arrow. You can think of a variable as cup into which you can put different things. In this case, imagine a real-world cup with my_sum written on the outside and inside the cup we have stored the sum of 3, 0, and 9 (i.e., 12).
We can inspect the contents of the my_sum variable (i.e., my_sum cup) with the print command:
## [1] 12
Variable are very useful in R. We will use them to store a single number, an entire data set, the results of an analysis, or anything else.
1.7.2 Script Panel
Although you can use R with just with the Console panel, it’s a better idea to use scripts via the Script panel - not visible yet. Scripts are just text files with the commands you use stored in them. You can run a script (as you will see below) using the Run or Source buttons located in the top right of the Script panel.
Scripts are valuable because if you need to run an analysis a second time you don’t have to type the command in a second time. You can run the script again and again without retyping your commands. More importantly though, the script provides a record of your analyses.
A common problem in science is that after an article is published, the authors can’t reproduce the numbers in the paper. You can read more about the important problem in a surprising article in the journal Molecular Brain. In this article an editor reports how a request for the data underlying articles resulted in the wrong data for 40 out of 41 papers. Long story short – keep track of the data and scripts you use for your paper. In a later chapter, it’s generally poor practice to manipulate or modify or analyze your data using any menu driven software because this approach does not provide a record of what you have done.
1.8 Writing your first script
1.8.1 Create the script file
Create a script in your R Studio project by using the menu File > New File > R Script.
Save the file with an appropriate name using the File menu. The file will be saved in your Project folder. A common, and good, convention for naming is to start all script names with the word “script” and separate words with an underscore. You might save this first script file with the name “script_my_first_one.R”. The advantage of beginning all script files with the word script is that when you look at your list of files alphabetically, all the script files will cluster together. Likewise, it’s a good idea to save all data files such that they begin with “data_”. This way all the data files will cluster together in your directory view as well. You can see there is already a data file with this convention called “data_okcupid.csv”.
You can see as discussed previously, we are trying to instill an effective workflow as you learn R. Using a good naming convention (that is consistent with what others use) is part of the workflow. When you write your scripts it’s a good idea to follow the tidyverse style guide for script names, variable name, file names, and more.
1.8.2 Add a comment to your script
In the previous section you created your first script. We begin by adding a comment to the script. A comment is something that will be read by humans rather than the computer/R. You make comments for other people that will read your code and need to understand what you have done. However, realize that you are also making comments for your future self as illustrated in an XKCD cartoon.
A good way to start every script is with a comment that includes the date of your script (or even better when you installed your packages, more on this later). Like smartphone apps, packages are updated regularly. Sometimes after a package is updated it will no longer work with an older script. Fortunately, the checkpoint package lets users role back the clock and use older versions of packages. Adding a comment with the date of your script will help future users (including you) to use your script with the same version of the package used when you wrote the script. Dating your script is an important part of an effective and reproducible workflow.
Moving forward, I suggest you use comments to make your own personal notes in your own code as your write it. Note that in the above comment I used the internationally accepted date format order Year/Month/Day created by the International Organization for Standardization (ISO). Some people use the mnemonic You’re My Dream to remember the Year Month Day order. Wikipedia provides more information about this International Date Format (ISO 8601). An XKCD cartoon highlights the ISO date format:
1.8.3 Background about the tidyverse
There are generally two broad ways of using R, the older way and the newer way. Using R the older way is referred to as using base R. A more modern approach to using R is the tidyverse. The tidyverse represents a collection of packages the work together to give R a modern workflow. These packages do many things to help the data analyst (loading data, rearranging data, graphing, etc.). We will use the tidyverse approach to R in this guide.
A noted the tidyverse is a collection of packages. Each package adds new commands to R. The number of packages and correspondingly the number of new commands added to R by the tidyverse is large. Below is a list of the tidyverse packages:
## [1] "broom" "conflicted" "cli"
## [4] "dbplyr" "dplyr" "dtplyr"
## [7] "forcats" "ggplot2" "googledrive"
## [10] "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate"
## [16] "magrittr" "modelr" "pillar"
## [19] "purrr" "ragg" "readr"
## [22] "readxl" "reprex" "rlang"
## [25] "rstudioapi" "rvest" "stringr"
## [28] "tibble" "tidyr" "xml2"
## [31] "tidyverse"
Before you can use a package it needs to be installed – this is the same as downloading an app from the App Store. Normally, you can install a single packages with the install.packages command. Previously, you needed run an install.package command for every package in the tidyverse as illustrated below (though we no longer use this approach).
# The old way of installing the tidyverse packages
# Like downloading apps from the app store
install.packages("broom", dep = TRUE)
install.packages("cli", dep = TRUE)
install.packages("ggplot", dep = TRUE)
# etc
Fortunately, the tidyverse packages can now by installed with a single install.packages command. Specifically, the install.packages command below will install all of the packages listed above.
1.8.3.1 Warnings regarding install.packages()
Prior to running install.packages() ALWAYS use Session > Restart R. Failure to do so could corrupt the R installation on your computer.
The install.packages() command should NEVER be put in a script. ALWAYS run install.packages() from the Console not your script.
1.8.4 Add library(tidyverse) to your script
The tidyverse is now installed, so we need to activate it. We do that with the library command. Put the library line below at the top of your script file (below your comment):
1.8.5 Activate tidyverse auto-complete for your script
Select the library(tidyverse) text with your mouse/track-pad so that it is highlighted. Then click the Run button in the upper right of the Script panel. Doing this “runs” the selected text. After you click the Run button you should see text like the following the Console panel:
## ── Attaching core tidyverse packages ──────────────────
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ───────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
When you use library(tidyverse) to activate the tidyverse you activate the most commonly used subset of the tidyverse packages. In the output you see checkmarks beside names of the tidyverse packages you have activated.
By activating these packages you have added new commands to R that you will use. Sometimes these packages replace older versions of commands in R. The “Conflicts” section in the output shows you where the packages you activated replaced older R commands with newer R commands. You can activate the other tidyverse package by running a library command for each package – if needed. No need to do so now.
Most importantly, running the library(tidyverse) prior to entering the rest of your script allows R Studio to present auto-complete options when typing your text. Remember to start each script with the library(tidyverse) command and then Run it so you get the autocomplete options for the rest of the commands your enter.
1.9 Loading your data
1.9.1 Use read_csv (not read.csv) to open files.
If you inspect the Files pane on the right of the screen you see the data_okcupid.csv data file in our project directory. We will load this data with the commands below. If you followed the steps above, you should have auto-complete for the tidyverse commands you type for now in – in the current R session. Enter the command below into your script. As your start to type read_csv you will likely be presented with an auto-complete option. You can use the arrow keys to move up and down the list of options to select the one you want - then press tab to select it.
Once your command looks like the one below select the text and click on the “Run” button.
## Rows: 59946 Columns: 6
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (4): diet, pets, sex, status
## dbl (2): age, height
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The output indicates that you have loaded a data file and the type of data in each column. The sex column is of type col_character which indicates it contains text/letters. Most of the columns are of the type character. The age and height columns contain numbers are correspondingly indicated to be the type col_double. The label col_double indicates that a column of numbers represented in R with high precision. There are other ways of representing numbers in R but this is the type we will see/use most often.
1.10 Checking out your data
There many ways of viewing the actual data you loaded. A few of these are illustrated now.
1.10.1 view(): See a spreadsheet view of your data
You can inspect your data in a spreadsheet view by using the view command. Do NOT add this command to your script file – EVER. Adding it to the script can cause substantial problems. Type this command in the Console.
1.10.2 print(): See you data in the Console
You can inspect the first few rows of your data with the print() command. It is OK to add a print command to your script. Try the print() command below in the Console:
## # A tibble: 59,946 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 22 strictly anything 75 likes dogs a… m single
## 2 35 mostly other 70 likes dogs a… m single
## 3 38 anything 68 has cats m avail…
## 4 23 vegetarian 71 likes cats m single
## 5 29 <NA> 66 likes dogs a… m single
## 6 29 mostly anything 67 likes cats m single
## 7 32 strictly anything 65 likes dogs a… f single
## 8 31 mostly anything 65 likes dogs a… f single
## 9 24 strictly anything 67 likes dogs a… f single
## 10 37 mostly anything 65 likes dogs a… m single
## # ℹ 59,936 more rows
1.10.3 head(): Check out the first few rows of data
You can inspect the first few rows of your data with the head() command. Try the command below in the Console:
## # A tibble: 6 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 22 strictly anything 75 likes dogs an… m single
## 2 35 mostly other 70 likes dogs an… m single
## 3 38 anything 68 has cats m avail…
## 4 23 vegetarian 71 likes cats m single
## 5 29 <NA> 66 likes dogs an… m single
## 6 29 mostly anything 67 likes cats m single
You can be even more specific and indicate you only want the first three row of your data with the head() command. Try the command below in the Console:
## # A tibble: 3 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 22 strictly anything 75 likes dogs an… m single
## 2 35 mostly other 70 likes dogs an… m single
## 3 38 anything 68 has cats m avail…
1.10.4 tail(): Check out the last few rows of data
You can inspect the last few rows of your data with the tail() command. Try the command below in the Console:
## # A tibble: 6 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 31 <NA> 62 likes dogs f single
## 2 59 <NA> 62 has dogs f single
## 3 24 mostly anything 72 likes dogs and … m single
## 4 42 mostly anything 71 <NA> m single
## 5 27 mostly anything 73 likes dogs and … m single
## 6 39 <NA> 68 likes dogs and … m single
You can be even more specific and indicate you only want the last three row of your data with the tail() command. Try the command below in the Console:
## # A tibble: 3 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 42 mostly anything 71 <NA> m single
## 2 27 mostly anything 73 likes dogs and … m single
## 3 39 <NA> 68 likes dogs and … m single
1.10.5 summary(): Quick summaries
You can a short summary of your data with the summary() command. Note that we will use the summary() command in many places in the guide. The output of the summary() command changes depending on what you give it - that is put inside the brackets. You can give the summary() command many things such as data, the results of a regression analysis, etc.
Try the command below in the Console. You will see that summary() give the mean and median for each of the numeric variables (age and height).
## age diet height
## Min. : 18.0 Length:59946 Min. : 1.0
## 1st Qu.: 26.0 Class :character 1st Qu.:66.0
## Median : 30.0 Mode :character Median :68.0
## Mean : 32.3 Mean :68.3
## 3rd Qu.: 37.0 3rd Qu.:71.0
## Max. :110.0 Max. :95.0
## NA's :3
## pets sex status
## Length:59946 Length:59946 Length:59946
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
1.11 Run vs. Source with Echo vs. Source
There are different ways of running commands in R. So far you have used two of these. You can enter them into the Console as we have done already. Or you can put them in your script select the text and click the Run button. There are four ways of running commands in your script.
You can:
- Console: Enter commands directly
- Script: Select the command(s) and press the Run button.
- Script: Source (Without Echo)
- Script: Source With Echo
Two of these approaches involve using the Source button, see Figure 1.2. You bring up the options for the Source button, illustrated in this figure, by clicking on the small arrow to the right of the word Source.
1.11.1 Run select text
The Run button will run the text you highlight and present the relevant output. You have used this command a fair amount already.
I strongly suggest you ONLY use the Run button when testing a command to make sure it works or to debug a script. Or to run library(tidyverse) as you start working on your script so that you get the autocomplete options.
In general, you should always try to execute your R Scripts using the Source with Echo command (preceded by a Restart, see below). This ensures your script will work beginning to end for you in the future and for others that attempt to use it. Using the Run button in an ad lib basis can create output that is not reproducible.
1.11.2 Source (without Echo)
Source (without Echo) is not designed for the typical analysis workflow. It is mostly helpful when you run simulations. When you run Source (without Echo) much of the output you would wish to read is suppressed. In general, avoid this option. If you use it, you often won’t see what you want to see in the output.
1.11.3 Source with Echo
The Source with Echo command runs all of the contents of a script and presents the output in the R console. This is the approach you should use to running your scripts in most cases.
Prior to running Source with Echo (or just Source), it’s always a good idea to restart R. This makes sure you clear the computer memory of any errors from any previous runs.
So you should do the following EVERY time you run your script.
- Use the menu item: Session > Restart R
- Click the down arrow beside the Source button, and click on Source With Echo
This will clear potentially problematic previous stats, run the script commands, and display the output in the Console. Moving forward we will use this approach for running scripts. Once you have used Source with Echo once, you can just click the Source button and it will use Source with Echo automatically (without the need to use the pull down option for selecting Source with Echo).
1.12 Trying Source with Echo
Put the head(), tail(), and summary() command we used previously into your script. Then save your script using using the File > Save menu. You script should appear as below.
# Code written on: YYYY/MM/DD
# By: John Smith
library(tidyverse)
okcupid_profiles <- read_csv(file = "data_okcupid.csv")
head(okcupid_profiles)
tail(okcupid_profiles)
summary(okcupid_profiles)
Now do the following:
- Use the menu item: Session > Restart R
- Click the down arrow beside the Source button, and click on Source With Echo
You should see the output below:
# Code written on: YYYY/MM/DD
# By: John Smith
library(tidyverse)
okcupid_profiles <- read_csv(file = "data_okcupid.csv")
## Rows: 59946 Columns: 6
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (4): diet, pets, sex, status
## dbl (2): age, height
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 22 strictly anything 75 likes dogs an… m single
## 2 35 mostly other 70 likes dogs an… m single
## 3 38 anything 68 has cats m avail…
## 4 23 vegetarian 71 likes cats m single
## 5 29 <NA> 66 likes dogs an… m single
## 6 29 mostly anything 67 likes cats m single
## # A tibble: 6 × 6
## age diet height pets sex status
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 31 <NA> 62 likes dogs f single
## 2 59 <NA> 62 has dogs f single
## 3 24 mostly anything 72 likes dogs and … m single
## 4 42 mostly anything 71 <NA> m single
## 5 27 mostly anything 73 likes dogs and … m single
## 6 39 <NA> 68 likes dogs and … m single
## age diet height
## Min. : 18.0 Length:59946 Min. : 1.0
## 1st Qu.: 26.0 Class :character 1st Qu.:66.0
## Median : 30.0 Mode :character Median :68.0
## Mean : 32.3 Mean :68.3
## 3rd Qu.: 37.0 3rd Qu.:71.0
## Max. :110.0 Max. :95.0
## NA's :3
## pets sex status
## Length:59946 Length:59946 Length:59946
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
Congratulations you just ran your first script!
1.13 A few key points about
Sometimes you will need to send a command additional information. Moreover, that information often needs to be grouped together into a vector or a list before you can send it to the command. We’ll learn more about doing so in the future but here is a quick over view of vectors and lists to provide a foundation for future chapters.
1.13.0.1 Vector of numbers
We can create a vector of only numbers using the “c” function - which you can think of as being short for “combine” (or concatenate). In the commands below we create a vector of a few even numbers called “even_numbers”.
## [1] 2 4 6 8 10
We can obtain the second number in the vector using the following notation:
## [1] 4
1.13.0.2 Vector of characters
We can also create vectors using only characters. Note that I use SHIFT RETURN after each comma to move to the next line.
## [1] "copper kettles" "woolen mittens"
## [3] "brown paper packages"
As before, can obtain the second item in the vector using the following notation:
## [1] "woolen mittens"
1.13.1 Lists
Lists are similar to vectors in that you can create them and access items by their numeric position. Vectors must be all characters or all numbers. Lists can be a mix of characters or numbers. Most importantly items in lists can be accessed by their label. Note that I use SHIFT RETURN after each comma to move to the next line in the code below.
## $last_name
## [1] "Smith"
##
## $first_name
## [1] "John"
##
## $office_number
## [1] 1913
You can access an item in a list using double brackets:
## $first_name
## [1] "John"
You can access an item in a list by its label/name using the dollar sign:
## [1] "Smith"
## [1] 1913
1.14 Revisiting read_csv()
In the text above we learned about vectors of characters. This information can be used handle missing values in the read_csv() command. Consider a data set where for many participants we did not get a response to a question. We might have that represented in various ways within the data file. The value for that question might simply be omitted (i.e., represented by nothing – ““), represented by NA for not available, or maybe simply be something like -777.
When we load data with read_csv() we can indicate to the computer how missing values are represented. This ensures the computer sees something like -777 as a missing value rather than an actual value. We do so in the read_csv() command below using the na argument (not available). Note: use SHIFT-RETURN or SHIFT-ENTER to move the next line within the read_csv() command. Failure to indicate the missing value codes within read_csv() will result in incorrect answers in most scenarios with missing data.