Chapter 4 Week 3 Friday Workshop
4.1 Required Packages
The data files below are used in this chapter. Right click to download data or scripts.
Required Data |
---|
data_aff_survey.csv |
The following CRAN packages must be installed. Use the install.packages() command.
Required CRAN Packages |
---|
tidyverse |
janitor |
skimr |
sjstats |
Important Note: You should NOT use library(psych) at any point! There are major conflicts between the psych package and the tidyverse. We will access the psych package commands by preceding each command with psych:: instead of using library(psych).
The complete script for this project can be download here: script_workshop2.R
4.2 Goals
For this workshop our goals are to learn how to:
Consolidate Workshop 2 concepts
Examine variance and covariance in the context of items
Illustrate how we can calculate the variance of a column of scale scores
4.3 Data and column names
A key component of effect data analysis is the use of naming conventions for column names. In this section we focus on one column naming convention that makes it easy to work with survey data. You will need to use this column naming convention for your major project.
The naming convention we advocate will save you hours of hassles and permit easy application of certain tidyverse commands. However, we must stress that although the naming convention we advocate is based on the tidyverse style guide, it is not “right” or “correct” - there are other naming conventions you can use (once you finish the course). Any naming convention is better than no naming convention. The naming convention we advocate here will solve many problems.
To make your life easier down the road, it is critical you set up your spreadsheet or online survey such that it uses a naming convention prior to data collection. The naming conventions suggested here are adapted from the tidyverse style guide.
The key components of the column naming convention are the following:
Lowercase letters only
If two word column names are necessary, only use the underscore (“_“) character to separate words in the name. Do not use a period (”.”) a space (” “) or other symbols.
Use moderate length (not short0 column prefixes. For example, if you have an Affective Commitment Scale do not preface each items with “acs” instead preface each item with a longer version like “aff_com”.
Indicate in the item name if it is a likert-type scale and the number of points in the scale. For example, if your affective commitment scale has a 7-point Likert-type response scale you indicate that in the column name. For example, the name for two commitment items might be: “aff_com1_likert7’ and”aff_com2_likert7”. This indicates important information for future users of the data set (including you)
Indicate in the item name if the item is reversed keyed. ** Sometimes with Likert-type items, an item is reverse keyed. For example, on a positive job affect scale, participants will typically respond to items that reflect job affect using the scale: 1 - Strongly Disagree, 2 - Moderately Disagree, 3 - Neutral, 4, Moderately Agree, 5 - Strongly Agree. Higher numbers indicate more positive job affect. Sometimes, however, some items will use the same 1 to 5 response scale but be worded in the opposite manner such as “I hate my job”. Responding with a 5 to this item would indicate high negative job affect (not positive affect). But the columns for positive job affect scale should have high values to indicate more positive job affect not less positive job affect. Consequently, we flag the names of columns with reversed responses (i.e., reverse-key items) so that we know to treat those column differently later. Columns with reverse-keyed items need to be processed by a script so that the values are flipped and scored in the right direction. A normal positive job affect item might have a name like “job_aff1_likert7” whereas a reverse-key item would have a name like “job_aff2-likert7rev”. The “rev” in the column name indicates the item was a reverse-keyed item.
4.4 Install packages
You will need the same packages installed as you did for the Week 2 Friday workshop. If you did not install the packages then - go back to that week and do so now.
4.5 RStudio Project
Create a folder on your computer for the example
Download the data file for this example: data_aff_survey.csv. ONLY obtain this file from the Downloads folder. If you open in another program and save it again - the data file will not work.
Place all the example data file in the folder you created in Step 1.
Use the menu item File > New Project… to start the project
On the window that appears select “Existing Directory”
On the next screen, press the “Browse” button and find/select the folder with your data
Press the Create Project Button
4.6 Week 2 Catchup
Place the code below into your script from last week:
# Date: YYYY-MM-DD
# Name: your name here
# Example: Workshop 3 PSYC 3250
## Activate packages
library(tidyverse)
library(janitor)
library(skimr)
library(sjstats)
## Load data
<- c("-999", "", "NA")
my_missing_value_codes
<- read_csv(file = "data_aff_survey.csv",
raw_data_survey na = my_missing_value_codes)
## Rows: 100 Columns: 3
## ── Column specification ────────────────────────────────
## Delimiter: ","
## dbl (3): aff_com2_likert7, aff_com3_likert7, aff_com4_li...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- raw_data_survey
analytic_data_survey
## Clean and screen
# Initial cleaning
<- analytic_data_survey %>%
analytic_data_survey remove_empty("rows") %>%
remove_empty("cols") %>%
clean_names()
# Numeric screening
%>%
analytic_data_survey select(aff_com2_likert7, aff_com3_likert7, aff_com4_likert7rev) %>%
skim()
## Flipping responses to reverse-key items
head(analytic_data_survey)
<- analytic_data_survey %>%
analytic_data_survey mutate(8 - across(.cols = ends_with("_likert7rev")) ) %>%
rename_with(.fn = str_replace,
.cols = ends_with("_likert7rev"),
pattern = "_likert7rev",
replacement = "_likert7")
head(analytic_data_survey)
Now go to the menu: Session > Restart R
Then press the Source with Echo button to run the entire script.
Following this type the head() command in the Console:
head(analytic_data_survey)
## # A tibble: 6 × 3
## aff_com2_likert7 aff_com3_likert7 aff_com4_likert7
## <dbl> <dbl> <dbl>
## 1 3 3 3
## 2 5 4 4
## 3 2 3 2
## 4 4 3 3
## 5 2 3 2
## 6 3 3 3
Sometimes there are too many columns. So another option is to use the glimpse() command:
%>%
analytic_data_survey glimpse()
## Rows: 100
## Columns: 3
## $ aff_com2_likert7 <dbl> 3, 5, 2, 4, 2, 3, 4, 4, 4, 3, 4, …
## $ aff_com3_likert7 <dbl> 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, 4, …
## $ aff_com4_likert7 <dbl> 3, 4, 2, 3, 2, 3, 4, 3, 3, 3, 4, …
In the current scenario all of the columns are coded in the right direction (i.e., there are no reverse-coded items) in the next section we will create the scale scores.
4.7 Creating composite/scale scores
When we AVERAGE (i.e., take the mean) of the commitment items for each person we call that a scale score. It represents our best guess of their overall commitment level based on those three items. The advantage of taking the AVERAGE of the three items is that we can interpret the result on the original 1 to 7 scale. We know the lowest score the someone can get is 1.0 (i.e, they obtained an average of 1.0 because that have a score of 1.0 on all three items). Correspondingly, the highest score someone can get is 7.0 (i.e, they obtained an average of 7.0 because that have a score of 7.0 on all three items). This is an interpretation approach the many researchers prefer. If you had more items, the range of possible scores would stay the same.
Other researchers prefer to get the SUM of the three items when calculating a scale score. This approach is use sometimes – but interpretation needs to into account the number of items in this case. For example, if you used the SUM approach the minimum maximum range is 3 to 21. That is, if a person scored 1.0 for all three items their score would be 3.0. Likewise, if a person scored 7.0 for all three items then their score would be 27. If you had more items, the range of possible scores would change.
Both approaches (AVERAGE vs SUM) are acceptable you just need to make sure you know which one you’re dealing with.
Add the code below to your script. Notice we calculate the scale score in two different ways in this script. The affect_mean and affect_sum columns are created. The affect_mean column corresponds to the affective_commitment column we created last week.
## Creating scale scores
<- analytic_data_survey %>%
analytic_data_survey rowwise() %>%
mutate(affect_mean = mean(c_across(starts_with("aff_com")),
na.rm = TRUE)) %>%
mutate(affect_sum = sum(c_across(starts_with("aff_com")),
na.rm = TRUE)) %>%
ungroup()
Use the head() command to see the columns you created:
%>%
analytic_data_survey head()
## # A tibble: 6 × 5
## aff_com2_likert7 aff_com3_likert7 aff_co…¹ affec…² affec…³
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3 3 3 3 9
## 2 5 4 4 4.33 13
## 3 2 3 2 2.33 7
## 4 4 3 3 3.33 10
## 5 2 3 2 2.33 7
## 6 3 3 3 3 9
## # … with abbreviated variable names ¹aff_com4_likert7,
## # ²affect_mean, ³affect_sum
Check for yourself that affect_mean column is the average of the three items for each person. As well, check for yourself that the affect_sum column is the sum of the three items for each person.
Sometimes there are too many columns to show across your screen when using the head() command. So you can also use the glimpse() command to see the columns:
%>%
analytic_data_survey glimpse()
## Rows: 100
## Columns: 5
## $ aff_com2_likert7 <dbl> 3, 5, 2, 4, 2, 3, 4, 4, 4, 3, 4, …
## $ aff_com3_likert7 <dbl> 3, 4, 3, 3, 3, 3, 4, 3, 3, 3, 4, …
## $ aff_com4_likert7 <dbl> 3, 4, 2, 3, 2, 3, 4, 3, 3, 3, 4, …
## $ affect_mean <dbl> 3.000, 4.333, 2.333, 3.333, 2.333…
## $ affect_sum <dbl> 9, 13, 7, 10, 7, 9, 12, 10, 10, 9…
4.8 Variance of scale scores
We will now calculate the variance of the two scale score columns:
%>%
analytic_data_survey summarise(var_affect_mean_column = var(affect_mean, na.rm = TRUE),
var_affect_sum_column = var(affect_sum, na.rm = TRUE))
## # A tibble: 1 × 2
## var_affect_mean_column var_affect_sum_column
## <dbl> <dbl>
## 1 0.340 3.06
In particular, notice that the variance the affect_sum column is 3.06. In the section below we will obtain this value through a very different calculation approach.
4.9 Variance of items
In this section we focus on the item variances.
4.9.1 Approach 1
Below is the traditional approach to calculating the variance for columns. Try it out in RStudio.
%>%
analytic_data_survey summarise(var_aff_com2 = var(aff_com2_likert7),
var_aff_com3 = var(aff_com3_likert7),
var_aff_com4 = var(aff_com4_likert7))
## # A tibble: 1 × 3
## var_aff_com2 var_aff_com3 var_aff_com4
## <dbl> <dbl> <dbl>
## 1 0.677 0.353 0.452
If you inspect the output above you see the variances aff_com2, aff_com3, and aff_com4 items are 0.677, 0.353, and 0.452, respectively. Remember these number - we’re going to look for them in the output of the second approach.
4.9.2 Approach 2
With the second approach to getting the variance of the items we use the cov() (i.e., covariance command). The commands below produce a covariance matrix.
<- analytic_data_survey %>%
cov_matrix select(aff_com2_likert7, aff_com3_likert7, aff_com4_likert7) %>%
cov()
print(cov_matrix)
aff_com2_likert7 | aff_com3_likert7 | aff_com4_likert7 | |
---|---|---|---|
aff_com2_likert7 | 0.6768 | 0.2677 | 0.3283 |
aff_com3_likert7 | 0.2677 | 0.3534 | 0.1914 |
aff_com4_likert7 | 0.3283 | 0.1914 | 0.4520 |
Examine the diagonal values of the covariance matrix. Notice that the (rounded) diagonal values are 0.677, 0.353, and 0.452. These are the same values we observed with Approach 1. We can conclude from this that in a covariance matrix the values along the diagonal represent variance.
In contrast, in matrix above, the values that are off-diagonal values represent covariances.
4.10 Variance of affect_sum column
Previously, we saw that the variance of the affect_sum column was 3.06. In lecture, we learned that you can calculate the variance of this column from the items variance and covariances. In lecture, we saw the formula below.
\[ s_x^2= s_{A1}^2 + s_{A2}^2 + s_{A3}^2 + 2*COV(A1,A2) + 2*COV(A1,A3) + 2*COV(A2,A3) \]
If we contextualize this formula for the current example it looks like the following:
\[ s_x^2 = s_{\text{aff_com2}}^2 + s_{\text{aff_com3}}^2 + s_{\text{aff_com4}}^2 \\ + 2*COV(_{\text{aff_com2},\text{aff_com3}}) \\ + 2*COV(_{\text{aff_com2},\text{aff_com4}})\\ + 2*COV(_{\text{aff_com3},\text{aff_com4}})\\ \]
However, the fancy formulas are just a way of saying sum up all the values in the covariance matrix.
We see the covariance matrix below:
print(cov_matrix)
aff_com2_likert7 | aff_com3_likert7 | aff_com4_likert7 | |
---|---|---|---|
aff_com2_likert7 | 0.6768 | 0.2677 | 0.3283 |
aff_com3_likert7 | 0.2677 | 0.3534 | 0.1914 |
aff_com4_likert7 | 0.3283 | 0.1914 | 0.4520 |
We can sum up all the values in the covariance matrix with the code below:
sum(cov_matrix)
## [1] 3.057
You see we get a value of 3.06 which is the variance of the affect_sum column. Recall our previous calculation, Approach 1 (repeated below), in which we looked aff_sum_column directly - we obtained the same value (taking rounding into account).
%>%
analytic_data_survey summarise(var_affect_sum_column = var(affect_sum, na.rm = TRUE))
## # A tibble: 1 × 1
## var_affect_sum_column
## <dbl>
## 1 3.06