Chapter 11 Equivalence tests and sample size

11.1 Required

The following CRAN packages must be installed:

Required CRAN Packages
MBESS
TOSTER

11.2 Interpreting non-significant findings

A common problem in the psychological literature is the interpretation of non-significant effects. As [Kirk]Kirk (1996)] notes “some researchers mistakenly interpret a failure to reject the null hypothesis as evidence for accepting it”. Indeed, it is unfortunately common to see an incorrect sentences like the following examples based on fictitious data: “The difference between the mean IQ’s for males and females was non-significant, \(t\)(98) = 12.247, p = .458, indicating that males and females have, on average, the same intelligence.” Or alternatively an incorrect sentence like: “The correlation between height and IQ was non-significant, \(r\) = .15, p = .854, indicating that height was not related to IQ.” Both of these sentences are statistically incorrect because the conclusions do not follow from the reported statistics.

The tendency for researchers to incorrectly conclude that there is no effect, when \(p\) > .05, is particularly troubling (and ubiquitous) when interpreting the interaction in an ANOVA. For example, consider a 2 (sex) by 2 (occasion) between/within ANOVA where the dependent variable is reaction time. Imagine there is a significant sex x occasion interaction. The significant interaction indicates the relation between occasion and response time depends on the level of sex. The researcher describes this significant interaction by comparing the reaction times of males and females at occasion 1 and then again at occasion 2: “A comparison of the mean reaction of times of males and females at occasion 1, \(t\)(28) = 1.06, \(p\) = .300, revealed no difference in reaction time. In contrast, at occasion 2, there was a difference in the mean reaction times of males and females, t(28) = 2.15, \(p\) = .040. Thus, reaction time were the same, on average, for males and females at occasion 1 but not occasion 2.” In this example the researcher erred in the their interpretation of the results at occasion 1. Specifically, the researcher incorrectly concluded at occasion 1 that the reaction time for males and females was the same because \(p\) > .05. That type of conclusion is not possible for p > .05 in a standard paired comparison / \(t\)-test.

The calculation of a \(p\)-value begins by assuming the null hypothesis is true; consequently, you cannot use a \(p\)-value as evidence the null hypothesis is true. That is, when a \(p\)-value exceeds the threshold for significance (.05) you cannot conclude there is no effect. This fact is discussed at the 10 minute mark in an excellent video by Daniel Lakens.

You might well wonder what to do if you do want to make the conclusion there this no effect/relation. This type of conclusion is possible but you need to use the right tool to do so. One tool for concluding there is no effect is the Bayes Factor BF - but that is beyond the scope of this course. An easily accessible alternative for concluding there is no effect or relation is the equivalence test (Lakens, Scheel, and Isager 2018).

11.3 When would I use an equivalence test?

You can use an equivalent test when you want to conclude there is not effect or relation. This situation might be more common that you might think. A few of the scenarios where you would like to use an equivalence test are outlined below:

11.3.1 \(t\)-test

  • You calculate a \(t\)-test and expect to find a difference between the two conditions. Unexpectedly, the hypothesized difference is non-significant. At this point you can’t draw much of a conclusion. You can say the groups were not statistically different but you cannot say the groups were statistically the same. An equivalence test could allow you to conclude the groups are statistically equivalent.

  • For theoretical reasons your study may begin with the primary purpose being determine if two groups are the same (e.g., two treatments for the same disease that are believed to be equally effective).

11.3.2 Correlation

  • You calculate a correlation and expect to find a relation between the two variables. Unexpectedly, the hypothesized relation is non-significant. At this point you can’t draw much of a conclusion. You can say you didn’t find evidence for a relation but you cannot say you found evidence of no relation. An equivalence test could allow you to conclude there is no relation between the two variables.

  • For theoretical reasons the purpose of your study may be to provide evidence that there is no relation between two variables (e.g., video game use and violent behaviors).

11.4 Possible Outcomes

(Lakens, Scheel, and Isager 2018) review a variety of outcomes from an equivalence test. These are easiest to understand in the context of a t-test with two groups. You could find the two groups are:

  1. Not statistically equivalent and not statistically different

  2. Statistically equivalent and not statistically different

  3. Statistically equivalent and statistically different

  4. Not statistically equivalent and statistically different

You can see from the possible outcomes above it is still possible to obtain an outcome that is difficult to interpret. The outcomes that are challenging to interpret are most likely to occur when you have small sample sizes and low statistical power for the equivalence test.

To avoid an ambiguous outcome from an equivalence test make sure you conduct a sample size analysis for an equivalence test prior to running your study. The sample size demands for an equivalence test may be substantially greater than for a traditional analysis. Therefore we encourage you to conduct both a traditional sample size analysis and a sample size analysis for an equivalence test before you start collecting data.

11.5 What is an equivalence test

An equivalence test is just a pair of one-sided \(t\)-tests that are used to establish if an effect falls within a specified range of effect sizes bounding zero. That is, an equivalence test is used to indicate if an effect/relation is close enough to zero to be considered zero for practical purposes.

11.6 Defining a zero effect

How do you decide how close to zero is close enough to be practically zero? You already did so in the the previous chapter on “NHST and sample size”. In that chapter you reviewed various ways of determining the smallest effect size of interest (SESOI). Any effect below the SESOI is logically close enough to zero to, for practical purposes, be zero. However, I encourage you to review (Lakens, Scheel, and Isager 2018) to see the full discussion on this issue.

For now, the most important aspect of conducting an equivalence test is the fact that you need to determine the smallest effect size of interest prior to data collection - to avoid equivalence testing being a fancy form of intentional or unintentional \(p\)-hacking. Take note of this point – a recent article indicated that approximately 25% of researchers have engaged in at least one form of \(p\)-hacking in the just last 12 months.

11.7 Equivalence - repeated measures

Consider the following scenario where you begin a study with the intent to prove there is no effect. Many years ago the cereal Shreddies engaged in an interested marketing strategy. They decided to market new Diamond Shreddies as illustrated below. Imagine that you are a researcher tasked to compare the taste of the two types of Shreddies. Participants are given a bowl of the old Shreddies and then asked to rate it on a 1 to 15 point scale where higher ratings indicate a better taste. Following this they are given a bowl of the new Diamond Shreddies and asked to rate the taste. How do you go about comparing the taste ratings if your goal is to establish the taste of the old Shreddies is the same as new Diamond Shreddies?

An incorrect approach to determining if the two types of cereal taste the same would be to just conduct a repeated measures \(t\)-test and look for a non-significant difference. A non-significant repeated measures \(t\)-test would leave you with no conclusion. The appropriate approach in this circumstance is to use an equivalence test. Prior to collecting data you set your smallest effect size of interest. Specifically, you imagine getting a mean for each group and calculating a difference using the original numbers on the 15-point rating scale. You decide that if that difference is between -1 and +1 (the smallest jump on the rating scale) then you will consider the two types of Shreddies to have equivalent taste. Notably, as per (Lakens, Scheel, and Isager 2018), you set this smallest effect size of interest before you examine your data - to avoid being a \(p\)-hacker.

11.7.1 Raw units

After you collect your data (N = 50) you have ratings for old Shreddies (M = 12.1, SD = 2.50) and new Diamond Shreddies (M = 11.9, SD = 2.50). Because the same people taste both cereals you also have a correlation between the two taste ratings of \(r\) = .80. You run the R-code below to conduct the equivalence test.

library(TOSTER)

tsum_TOST(m1=12.1, sd1=2.5,
          m2=11.9,sd2=2.5,
          n1=50, n2=50,
          r12=.8,
          low_eqbound=-1,
          high_eqbound=1,
          eqbound_type = "raw",
          paired = TRUE)
## 
## Paired t-test
## 
## The equivalence test was significant, t(49) = -3.578, p = 3.96e-04
## The null hypothesis test was non-significant, t(49) = 0.894, p = 3.75e-01
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: reject null equivalence hypothesis
## 
## TOST Results 
##                  t df p.value
## t-test      0.8944 49   0.375
## TOST Lower  5.3666 49 < 0.001
## TOST Upper -3.5777 49 < 0.001
## 
## Effect Sizes 
##               Estimate     SE              C.I. Conf. Level
## Raw             0.2000 0.2236 [-0.1749, 0.5749]         0.9
## Hedges's g(z)   0.1245 0.1420 [-0.1061, 0.3539]         0.9
## Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

We can then report that:

A repeated measures \(t\)-test indicated that the mean taste ratings for old Shreddies (M = 12.1, SD = 2.50) and new Diamond Shreddies (M = 11.9, SD = 2.50) were not significantly different, \(d\) = 0.13, 95% CI [-0.15, 0.40], \(t\)(49) = 0.984, \(p\) = 0.375. Prior to conducting analyses we established that a raw difference in the -1 to +1 range (the smallest possible change on the scale) would, for practical purposes, be considered equivalent to zero. The equivalence test was significant \(t\)(49) = -3.578, \(p\) < .001 indicating the means for the two conditions were equivalent. Thus, we can conclude that the taste ratings of the two types of Shreddies are not statistically different and that they are statistically equivalent.

Note that the \(d\)-value with 95% CI was obtained with the MBESS command: ci.sm(ncp = 0.894, N = 50)

11.7.2 Standardized units

You could also have run this test by indicating the smallest effect size of interest using standardized effect size (i.e., a repeated measures \(d\)-value). The code below produces a result identical to the above code. We simply indicate the range of values that count as practically equivalent to zero using \(d\)-values.

library(TOSTER)

tsum_TOST(m1=12.1, sd1=2.5,
          m2=11.9,sd2=2.5,
          n1=50, n2=50,
          r12=.8,
          low_eqbound=-.6325,
          high_eqbound=.6325,
          eqbound_type = "SMD",
          bias_correction = TRUE,
          paired = TRUE)
## Warning: setting bound type to SMD produces biased results!
## 
## Paired t-test
## 
## The equivalence test was significant, t(49) = -3.578, p = 3.96e-04
## The null hypothesis test was non-significant, t(49) = 0.894, p = 3.75e-01
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: reject null equivalence hypothesis
## 
## TOST Results 
##                  t df p.value
## t-test      0.8944 49   0.375
## TOST Lower  5.3669 49 < 0.001
## TOST Upper -3.5780 49 < 0.001
## 
## Effect Sizes 
##               Estimate     SE              C.I. Conf. Level
## Raw             0.2000 0.2236 [-0.1749, 0.5749]         0.9
## Hedges's g(z)   0.1245 0.1420 [-0.1061, 0.3539]         0.9
## Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

11.8 Equivalence - Independent groups

11.8.1 Raw units

You could also have run this study as an independent groups \(t\)-test where different people received each type of cereal. In this case, the R-code would be as below. Note the output in this case provides an ambiguous outcome: “the observed effect is statistically not different from zero and statistically not equivalent to zero” due to our small sample size. Thus, the primary finding from this study is that we should have used a larger number of participants.

library(TOSTER)

tsum_TOST(m1=12.1, sd1=2.5,
          m2=11.9,sd2=2.5,
          n1=50, n2=50,
          low_eqbound=-1,
          high_eqbound=1,
          eqbound_type = "raw",
          paired = FALSE)
## 
## Welch Two Sample t-test
## 
## The equivalence test was non-significant, t(98) = -1.600, p = 5.64e-02
## The null hypothesis test was non-significant, t(98) = 0.400, p = 6.9e-01
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##               t df p.value
## t-test      0.4 98   0.690
## TOST Lower  2.4 98   0.009
## TOST Upper -1.6 98   0.056
## 
## Effect Sizes 
##                Estimate     SE              C.I.
## Raw             0.20000 0.5000 [-0.6303, 1.0303]
## Hedges's g(av)  0.07939 0.2021 [-0.2474, 0.4058]
##                Conf. Level
## Raw                    0.9
## Hedges's g(av)         0.9
## Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

11.8.2 Standardized units

The R-code again for using standardized effect sizes:

library(TOSTER)

tsum_TOST(m1=12.1, sd1=2.5,
          m2=11.9,sd2=2.5,
          n1=50, n2=50,
          low_eqbound=-.4,
          high_eqbound=.4,
          eqbound_type = "SMD",
          paired = FALSE)
## Warning: setting bound type to SMD produces biased results!
## 
## Welch Two Sample t-test
## 
## The equivalence test was non-significant, t(98) = -1.600, p = 5.64e-02
## The null hypothesis test was non-significant, t(98) = 0.400, p = 6.9e-01
## NHST: don't reject null significance hypothesis that the effect is equal to zero 
## TOST: don't reject null equivalence hypothesis
## 
## TOST Results 
##               t df p.value
## t-test      0.4 98   0.690
## TOST Lower  2.4 98   0.009
## TOST Upper -1.6 98   0.056
## 
## Effect Sizes 
##                Estimate     SE              C.I.
## Raw             0.20000 0.5000 [-0.6303, 1.0303]
## Hedges's g(av)  0.07939 0.2021 [-0.2474, 0.4058]
##                Conf. Level
## Raw                    0.9
## Hedges's g(av)         0.9
## Note: SMD confidence intervals are an approximation. See vignette("SMD_calcs").

11.9 Equivalence Correlation

Imagine we are interested in conducting a study to prove our theory that there is a strong positive relation between academic performance and self-esteem. Prior to conducing the study we determined our smallest effect size of interest was .20. A traditional sample size analysis (with a desire for 90% power) indicated we should use a sample size of 210 (for a one-sided test). However, we also conducted a sample size analysis for an equivalence test, in case the correlation was non-significant, which suggested a larger sample size of 267. Consequently, we collected data from 267 students.

Our study found \(r\) = .09, \(p\) = 0.071 (one-tailed). This non-significant difference means there is little to conclude from this study. We can’t say there is a relation due to the non-significance. But, we also can’t say there is no relation. Fortunately, because we established our smallest effect size of interest prior to looking at our data we can run an equivalence test with the code below.

library(TOSTER)

# regular NHST test
# two-sided so divide p-value by 2 to get one-sided if needed
corsum_test(r = .09,
            n = 267)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## z = 1.5, N = 267, p-value = 0.1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03037  0.20780
## sample estimates:
##  cor 
## 0.09
# Equivalence test
corsum_test(r = .09,
            n = 267,
            alternative = "e",
            null = .2)
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## z = -1.8, N = 267, p-value = 0.03
## alternative hypothesis: equivalence
## null values:
## correlation correlation 
##         0.2        -0.2 
## 90 percent confidence interval:
##  -0.01099  0.18917
## sample estimates:
##  cor 
## 0.09

Because we conducted the equivalence test we can write a clear results section.

The relation between academic performance and self-esteem was non-significant, \(r\) = .09, 95% CI [-.03, .21], \(p\) = 0.071 (one-sided). A follow up equivalence test (based on our apriori smallest effect size of interest, \(\rho\) = .20) was significant, \(p\) = 0.034, which indicates that for practical purposes the relation was equivalent to zero. That is, the observed correlation, \(r\) = .09, was not statistically different from zero and it was statistically equivalent to zero.

Notice how much clearer and stronger this results section is with the inclusion of an equivalence test. Note that the 95% CI for the correlation was obtained with the MBESS command: ci.cc(r = .09, n = 267)

11.10 Sample sizes for equivalence testing

11.10.1 Independent t-test

11.10.1.1 Standardized units

library(TOSTER)

power_t_TOST(alpha = .05,
             power = .90,
             eqb = .4,
             type = "two.sample")
## 
##      Two-sample TOST power calculation 
## 
##           power = 0.9
##            beta = 0.1
##           alpha = 0.05
##               n = 136
##           delta = 0
##              sd = 1
##          bounds = -0.4, 0.4
## 
## NOTE: n is number in *each* group

11.10.1.2 Raw units

library(TOSTER)

power_t_TOST(alpha = .05,
             power = .90,
             sd = 2.50,
             eqb = 1,
             type = "two.sample")
## 
##      Two-sample TOST power calculation 
## 
##           power = 0.9
##            beta = 0.1
##           alpha = 0.05
##               n = 136
##           delta = 0
##              sd = 2.5
##          bounds = -1, 1
## 
## NOTE: n is number in *each* group

11.10.2 Repeated t-test

11.10.2.1 Standardized units

library(TOSTER)


power_t_TOST(alpha = .05,
             power = .90,
             eqb = .6325,
             type = "paired")
## 
##      Paired TOST power calculation 
## 
##           power = 0.9
##            beta = 0.1
##           alpha = 0.05
##               n = 28.47
##           delta = 0
##              sd = 1
##          bounds = -0.6325, 0.6325
## 
## NOTE: n is number of *pairs*

11.10.2.2 Raw units

library(TOSTER)

power_t_TOST(alpha = .05,
             power = .90,
             sd = 1.581,
             eqb = 1,
             type = "paired")
## 
##      Paired TOST power calculation 
## 
##           power = 0.9
##            beta = 0.1
##           alpha = 0.05
##               n = 28.46
##           delta = 0
##              sd = 1.581
##          bounds = -1, 1
## 
## NOTE: n is number of *pairs*

11.10.3 Correlation

library(TOSTER)

power_z_cor(alternative = "equivalence",
            alpha = .05,
            null = .2, 
            power = .9, 
            rho = 0)
## 
##      Approximate Power for Pearson Product-Moment Correlation (z-test) 
## 
##               n = 266.3
##             rho = 0
##           alpha = 0.05
##            beta = 0.1
##           power = 0.9
##            null = 0.2, -0.2
##     alternative = equivalence

References

Kirk, Roger E. 1996. “Practical Significance: A Concept Whose Time Has Come.” Educational and Psychological Measurement 56 (5): 746–59.
Lakens, Daniël, Anne M Scheel, and Peder M Isager. 2018. “Equivalence Testing for Psychological Research: A Tutorial.” Advances in Methods and Practices in Psychological Science 1 (2): 259–69.