Chapter 3 Week 2 Friday Workshop

3.1 Required Packages

The data files below are used in this chapter. The files are available at: https://github.com/dstanley4/psyc3250bookdown

Required Data
data_aff_survey.csv

The following CRAN packages must be installed:

Required CRAN Packages
tidyverse
janitor
skimr
sjstats

Important Note: You should NOT use library(psych) at any point! There are major conflicts between the psych package and the tidyverse. We will access the psych package commands by preceding each command with psych:: instead of using library(psych).

The complete script for this project can be download here: script_workshop2.R

3.2 Goals

For this workshop our goals are to learn how to:

  • Learn about column naming conventions

  • Use Projects in RStudio

  • Load data

  • Clean and screen data

  • Reverse-key items

  • Make scale scores

  • Calculate variance using the population-level formula

3.3 Data and column names

A key component of effect data analysis is the use of naming conventions for column names. In this section we focus on one column naming convention that makes it easy to work with survey data. You will need to use this column naming convention for your major project.

The naming convention we advocate will save you hours of hassles and permit easy application of certain tidyverse commands. However, we must stress that although the naming convention we advocate is based on the tidyverse style guide, it is not “right” or “correct” - there are other naming conventions you can use (once you finish the course). Any naming convention is better than no naming convention. The naming convention we advocate here will solve many problems.

To make your life easier down the road, it is critical you set up your spreadsheet or online survey such that it uses a naming convention prior to data collection. The naming conventions suggested here are adapted from the tidyverse style guide.

The key components of the column naming convention are the following:

  • Lowercase letters only

  • If two word column names are necessary, only use the underscore (“_“) character to separate words in the name. Do not use a period (”.”) a space (” “) or other symbols.

  • Use moderate length (not short0 column prefixes. For example, if you have an Affective Commitment Scale do not preface each items with “acs” instead preface each item with a longer version like “aff_com”.

  • Indicate in the item name if it is a likert-type scale and the number of points in the scale. For example, if your affective commitment scale has a 7-point Likert-type response scale you indicate that in the column name. For example, the name for two commitment items might be: “aff_com1_likert7’ and”aff_com2_likert7”. This indicates important information for future users of the data set (including you)

  • Indicate in the item name if the item is reversed keyed. ** Sometimes with Likert-type items, an item is reverse keyed. For example, on a positive job affect scale, participants will typically respond to items that reflect job affect using the scale: 1 - Strongly Disagree, 2 - Moderately Disagree, 3 - Neutral, 4, Moderately Agree, 5 - Strongly Agree. Higher numbers indicate more positive job affect. Sometimes, however, some items will use the same 1 to 5 response scale but be worded in the opposite manner such as “I hate my job”. Responding with a 5 to this item would indicate high negative job affect (not positive affect). But the columns for positive job affect scale should have high values to indicate more positive job affect not less positive job affect. Consequently, we flag the names of columns with reversed responses (i.e., reverse-key items) so that we know to treat those column differently later. Columns with reverse-keyed items need to be processed by a script so that the values are flipped and scored in the right direction. A normal positive job affect item might have a name like “job_aff1_likert7” whereas a reverse-key item would have a name like “job_aff2-likert7rev”. The “rev” in the column name indicates the item was a reverse-keyed item.

Notice how the data file we use for this workshop follows this naming convention.

3.4 Install packages

Prior to the starting this activity you must have a number of packages installed. Because you are using RStudio Cloud on your computer - a package only needs to be installed once. If you installed it previously you do not need to install it again. But if you’re not sure - you can always install it again to be safe.

  1. Use the menu Session > Restart R

  2. In the Console (NOT THE SCRIPT) type the following commands to install the required packages:

install.packages("tidyverse", dep = TRUE)
install.packages("skimr", dep = TRUE)
install.packages("janitor", dep = TRUE)
install.packages("sjstats", dep = TRUE)
  1. You’re done - these packages are now stored on your computer. You won’t need to install them again.

3.5 RStudio Project

  1. Create a folder on your computer for the example

  2. Download the data file for this example: data_aff_survey.csv. ONLY obtain this file from the Downloads folder. If you open in another program and save it again - the data file will not work.

  3. Place all the example data file in the folder you created in Step 1.

  4. Use the menu item File > New Project… to start the project

  5. On the window that appears select “Existing Directory”

  6. On the next screen, press the “Browse” button and find/select the folder with your data

  7. Press the Create Project Button

3.5.1 Make a script

  1. Go the File menu. File > New File > Script

  2. When you write your code - place into this script.

3.6 Loading data

Place the code below into your script:

# Date: YYYY-MM-DD
# Name: your name here
# Example: Single occasion survey

# Activate packages
library(tidyverse)
library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(skimr)
library(sjstats)

# Load the data
my_missing_value_codes <- c("-999", "", "NA")

raw_data_survey <- read_csv(file = "data_aff_survey.csv",
                     na = my_missing_value_codes)
## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────
## Delimiter: ","
## dbl (3): aff_com2_likert7, aff_com3_likert7, aff_com4_li...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We load the initial data into a raw_data_survey but immediately make a copy we will work with called analytic_data_survey. It’s good to keep a copy of the raw data for reference if you encounter problems.

analytic_data_survey <- raw_data_survey

3.7 Clean and screen the data

Remove empty row and columns from your data using the remove_empty_cols() and remove_empty_rows(), respectively. As well, clean the names of your columns to ensure they conform to tidyverse naming conventions.

# Initial cleaning
analytic_data_survey <- analytic_data_survey %>%
  remove_empty("rows") %>%
  remove_empty("cols") %>%
  clean_names()

You can confirm the column names following our naming convention with the glimpse command - and see the data type for each column.

glimpse(analytic_data_survey)
## Rows: 100
## Columns: 3
## $ aff_com2_likert7    <dbl> 3, 5, 2, 4, 2, 3, 4, 4, 4, 3, …
## $ aff_com3_likert7    <dbl> 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, …
## $ aff_com4_likert7rev <dbl> 5, 4, 6, 5, 6, 5, 4, 5, 5, 5, …

Notice the column names in this file. We have an aff_com2_likert7, aff_com3_likert7, aff_com4_likert7rev. The likert columns contain the responses to three affective commitment to the organization items. In particular, pay attention to the fact that the third item is reverse-keyed.

Just for your reference the text of the items used to generate the data for this columns was:

  • aff_com2_likert7: I enjoy discussing my organization with people outside of it

  • aff_com3_likert7: I really feel as if this organizations problems are my own

  • aff_com4_likert7rev: I think I could easily become as attached to another organization as I am to this one. (reverse-keyed)

3.7.1 Numeric screening

For numeric variables, we want to make sure that we don’t have impossible values. For example, in the context of this example you want to ensure none of the Likert responses are impossible (e.g., outside the 1- to 7-point rating scale) or clearly data entry errors. If a value is below 1 or higher than 7 it is an error. The p0 column indicate the lowest value for each variable. You can see the lowest value for aff_com2_likert7 is 1. The p100 column tells you the highest value in each column. You can see that the highest value for aff_com2_likert7 was 5. What are the lowest and highest values for aff_com3_likert7 and aff_com4_likert7rev. Also reported in this output are the mean and standard deviation. But note that the standard deviation reported use \(n\)-1 in the denominator when it is calculated. This may or may not be what you want to know.

analytic_data_survey %>%
  select(aff_com2_likert7, aff_com3_likert7, aff_com4_likert7rev) %>%
  skim()
##         skim_variable n_missing mean   sd p0 p50 p100
## 1    aff_com2_likert7         0 3.50 0.82  1 4.0    5
## 2    aff_com3_likert7         0 3.49 0.59  2 3.5    5
## 3 aff_com4_likert7rev         0 4.55 0.67  3 5.0    6

3.8 Flipping responses to reverse-key items

The way you deal with reverse-keyed items depends on how you scored them. Imagine you had a 7-point scale. You could have scored the scale with the values 1, 2, 3, 4, 5, 6, and 7. Alternatively, you could have scored the scale with the values 0, 1, 2, 3, 4, 5, and 6. The mathematical approach you use to correcting reverse-keyed items depends upon whether the 7-point scale starts with 1 or 0.

In this example, we scored the data using the value 1 to 7; so that is the approach illustrated here. See the extra information box (below) for details on how to fixed reverse-keyed items when the scale begins with zero.

In this data file all the reverse-keyed items were identified with the suffix “_likert7rev” in the column names. This suffix indicates the item was reverse keyed and that the original scale used the response points 1 to 7. We can see using the glimpse() command below that there was only one reverse-keyed item.

analytic_data_survey %>%
  glimpse()
## Rows: 100
## Columns: 3
## $ aff_com2_likert7    <dbl> 3, 5, 2, 4, 2, 3, 4, 4, 4, 3, …
## $ aff_com3_likert7    <dbl> 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, …
## $ aff_com4_likert7rev <dbl> 5, 4, 6, 5, 6, 5, 4, 5, 5, 5, …

To correct a reverse-keyed item where the lowest possible rating is 1 (i.e, 1 on a 1 to 7 scale), we simply subtract all the scores from a value one more than the highest point possible on the scale (i.e., one more than 7). For example, if a 1 to 7 response scale was used we subtract each response from 8 to obtain the recoded value.

Original value Math Recoded value
1 8 - 1 7
2 8 - 2 6
3 8 - 3 5
4 8 - 4 4
5 8 - 5 3
6 8 - 6 2
7 8 - 7 1

** Thus, for we need to subtract every value in the aff_com4_likert7rev column from 8 to flip the reverse-key response to the correct direction. You can see the code that does this below (but don’t type it yet).

# Do not type into your script. 
# This is PART of a command not a full command.
analytic_data_survey <- analytic_data_survey %>% 
  mutate(8 - across(.cols = ends_with("_likert7rev")) )

The code above is general in nature and will perform the subtraction for any column that end in “_likert7rev” in our case there is only one column that will be affect. The problem with the code above though is that you have the wrong column name. You have flipped the values in the column so that are not reverse-keyed anymore – but the column name indicates that you have reverse-keyed responses. So you need to add the code below to change the column name.

# Do not type into your script. 
# This is PART of a command not a full command.
  rename_with(.fn = str_replace,
              .cols = ends_with("_likert7rev"),
              pattern = "_likert7rev",
              replacement = "_likert7")

Let’s begin by looking at the first few rows of your data set:

head(analytic_data_survey)
## # A tibble: 6 × 3
##   aff_com2_likert7 aff_com3_likert7 aff_com4_likert7rev
##              <dbl>            <dbl>               <dbl>
## 1                3                3                   5
## 2                5                4                   4
## 3                2                3                   6
## 4                4                3                   5
## 5                2                3                   6
## 6                3                3                   5

You can see the first three values of the aff_com4_likert7 column are 5, 4, 6.

Now let fix the column with code below - which you should put in your script.

# Place this code in your script

analytic_data_survey <- analytic_data_survey %>% 
  mutate(8 - across(.cols = ends_with("_likert7rev")) ) %>% 
  rename_with(.fn = str_replace,
              .cols = ends_with("_likert7rev"),
              pattern = "_likert7rev",
              replacement = "_likert7")

After you put that above code in your script. Add another head() command:

## # A tibble: 6 × 3
##   aff_com2_likert7 aff_com3_likert7 aff_com4_likert7
##              <dbl>            <dbl>            <dbl>
## 1                3                3                3
## 2                5                4                4
## 3                2                3                2
## 4                4                3                3
## 5                2                3                2
## 6                3                3                3

Let’s run your full script. Go to the menu Session > Restart R. The click the Source with Echo button to run full script.

head(analytic_data_survey)
## # A tibble: 6 × 3
##   aff_com2_likert7 aff_com3_likert7 aff_com4_likert7
##              <dbl>            <dbl>            <dbl>
## 1                3                3                3
## 2                5                4                4
## 3                2                3                2
## 4                4                3                3
## 5                2                3                2
## 6                3                3                3

When you see the output fo the second head() command you can see that the aff_com4_likert7rev column has turned into aff_com4_likert7 (with no rev). You can also see the first few values of this column are 3, 4, and 2. That is, you can see the values in the column have been flipped.

Congratulations you’ve finished fixing the reverse-key item in your data set.

If your scale had used response options numbered 0 to 6 the math is different. For each item you would use subtract values from the highest possible point (i.e, 6) instead of one larger than the highest possible point.

Original value Math Recoded value
0 6 - 0 6
1 6 - 1 5
2 6 - 2 4
3 6 - 3 3
4 6 - 4 2
5 6 - 5 1
6 6 - 6 0

Thus, the mutate command would instead be:

mutate(6 - across(.cols = ends_with(“_likert7rev”)) )

3.9 Making scale scores

analytic_data_survey <- analytic_data_survey %>% 
  rowwise() %>% 
  mutate(affect_mean = mean(c_across(starts_with("aff_com")),
                                     na.rm = TRUE)) %>%
  ungroup() 

You can see the new column you created with the head() command. The values in this column represent the average of the three affective commitment items for each person.

head(analytic_data_survey)
## # A tibble: 6 × 4
##   aff_com2_likert7 aff_com3_likert7 aff_com4_likert7 affec…¹
##              <dbl>            <dbl>            <dbl>   <dbl>
## 1                3                3                3    3   
## 2                5                4                4    4.33
## 3                2                3                2    2.33
## 4                4                3                3    3.33
## 5                2                3                2    2.33
## 6                3                3                3    3   
## # … with abbreviated variable name ¹​affect_mean

Check for yourself that affect_mean column is the average of the three items for each person.

3.10 Descriptive statistics for scale scores

You can quickly get descriptive statistics using the skim() command - as illustrated below.

analytic_data_survey %>%
  skim()
##      skim_variable mean   sd    p0   p50  p100
## 1 aff_com2_likert7 3.50 0.82 1.000 4.000 5.000
## 2 aff_com3_likert7 3.49 0.59 2.000 3.500 5.000
## 3 aff_com4_likert7 3.45 0.67 2.000 3.000 5.000
## 4      affect_mean 3.48 0.58 1.667 3.333 4.667

However, if you can be more specific about the descriptive statistics you want. For example, if you wanted the mean and variance (rather than standard deviation) you could obtain them with the code below.

analytic_data_survey %>%
  summarise(mean_affect_mean_column = mean(affect_mean, na.rm = TRUE),
            var_affect_mean_column = var(affect_mean, na.rm = TRUE))
## # A tibble: 1 × 2
##   mean_affect_mean_column var_affect_mean_column
##                     <dbl>                  <dbl>
## 1                    3.48                  0.340