Chapter 15 Deeper dive sampling [under construction]
15.1 Previously: Equal Population Means
In the last chapter, we examined the scenario where there was distribution of sample means where the sample means differed due to random sampling only. In this chapter, examine the scenario with multiple populations where the population means are different. We specifically focus on the variance of the distribution of sample means because that is theoretical foundation for Analysis of Variance (ANOVA). We begin, however, in the simple situation of a single population.
15.2 Theory sample mean variance
15.2.1 Single population (theory)
Recall that when there is a single population we express the variance in sample means using the formula below. This formula is conceptual one that calculates the variance in sample means due to sampling error for a given population variance and sample size.
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error}\\ &= \frac{\sigma_{people}^2}{n}\\ \end{aligned} \]
15.2.2 Multiple populations (theory)
When there are multiple populations with identical variances, but different population means, that same formula applies for sampling error. That is, because the populations of people all of have the same variance, we can use the old formula to calculate the variability in sample means due to sampling error.
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{\sigma_{people}^2}{n}\\ \end{aligned} \]
However, when there are multiple populations we also have to take into account the variability in population means when determining the variance of sample means. We use \(A\) to indicate the number of populations. If there are three populations then \(A = 3\). We use the formula below to calculate the variance of the population means:
\[ \begin{aligned} \text{variance in population means} &= \sigma_{\mu_A}^2 \\ &= \frac{\sum (\mu_{A_i}-\mu_{\mu_A})^2}{A} \\ \end{aligned} \]
where
\[ \begin{aligned} \mu_{\mu_A} &= \frac{\sum{\mu_A}}{A} \end{aligned} \]
So when there are multiple populations the variability in sample means is due to both sampling error and the variability in population means.
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error} + \text{variance in population means} \\ &= \frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2\\ \end{aligned} \]
Importantly, if the populations all have the same mean then \(\sigma_{\mu_A}^2=0\). When this happens, sample means differ only due to sampling error.
15.3 Practice sample mean variance
15.3.1 Single population (practice)
Let’s quickly revisit the scenario where there is a single population with a mean of 172.50 and a variance of 156.3236. Sample means (n = 5) will only differ due to sampling error. We can calculate the variance due to sampling error with the equation below.
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{\sigma_{people}^2}{n}\\ &= \frac{156.3236}{5}\\ &= 31.26472\\ \end{aligned} \]
Thus, where there is a single population (with variance 156.3236) and the sample size is n = 5, sample means will differ only due to sampling error. The variance in the sample means will be 31.26472.
We can confirm this with the R code below. We create a population and then take 9,000 sample means.
library(tidyverse)
library(learnSampling)
set.seed(1)
pop1 <- make_population(mean = 172.50, variance = 156.3236)
set.seed(1)
sample_means <- get_mean_samples(pop1,
n = 5,
number.of.trials = 9000)## Number of populations: 1
## Using a = 1
The first few rows of the data set with 9,000 sample mean is below:
head(sample_means)## trial source_population sample_mean
## 1 1 1 176.2
## 2 2 1 167.4
## 3 3 1 168.8
## 4 4 1 175.8
## 5 5 1 174.0
## 6 6 1 157.4
The trial column indicate which trial (of the 10,000) is represented in that row. The source population column indicate the population we obtain the sample from (1 since we only had 1 population). The sample mean column indicate the sample mean in that trial.
sample_means %>%
summarise(var_means = var(sample_mean))## var_means
## 1 30.80005
We take the variance of all of those sample means and get a value of 30.80005 which closely approximately the actual value of 31.26472 (which we get when using an infinite number of sample means.)
15.3.2 Multiples population, equal \(\mu\) (practice)
Now let’s see what happens when we have three populations with different means - as illustrated in the table and figure below. Notice the mean for all three populations is 172.50 in the table below.
| Population | Mean | Variance |
|---|---|---|
| 1 | \(\mu_{A1_{people}}= 172.50\) | \(\sigma_{1_{people}}^2= 156.3236\) |
| 2 | \(\mu_{A2_{people}}= 172.50\) | \(\sigma_{2_{people}}^2= 156.3236\) |
| 3 | \(\mu_{A3_{people}}= 172.50\) | \(\sigma_{3_{people}}^2= 156.3236\) |
We created a scenario where the sample means differed to random sampling only when we created three populations with identical means (and variances). This situation is depicted in the scenario below.
FIGURE 15.1: Sampling from three populations with the same mean (and variance)
Because the populations have equal variances we calculate the variance due to sampling error the way we did previously:
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{\sigma_{people}^2}{n}\\ &= \frac{156.3236}{5}\\ &= 31.26472\\ \end{aligned} \]
But we also have to calculate the variance in population means. But to calculate the variance of the population means we need the mean of the population means. This value is calculated below - it’s simply the average of 172.50 repeated three times:
\[ \begin{aligned} \mu_{\mu_A} &= \frac{\sum{\mu_A}}{A} \\ &= \frac{\mu_{A1}+\mu_{A2}+\mu_{A3}}{A} \\ &= \frac{172.50 + 172.50 + 172.50}{3} \\ &= \frac{517.50}{3}\\ &= 172.50 \\ \end{aligned} \]
We use this value to calculate the variance of the population means. Which we find to be 0 because the population means are all the same.
\[ \begin{aligned} \sigma_{\mu_A}^2 &= \frac{\sum (\mu_{A_i}-\mu_{\mu_A})^2}{A} \\ &= \frac{(\mu_{A_1} - \mu_{\mu_A})^2 + (\mu_{A_2} - \mu_{\mu_A})^2 + (\mu_{A_3} - \mu_{\mu_A})^2}{3} \\ &= \frac{(172.50 - 172.50)^2 + (172.50 - 172.50)^2 + (172.50 - 172.50)^2}{3} \\ &= \frac{0}{3} \\ &= 0\\ \end{aligned} \]
So the total variance is sample means is:
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error} + \text{variance in population means} \\ &= \frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2\\ &= 31.26472 + 0\\ &= 31.26472\\ \end{aligned} \]
We can confirm this with the simulation below. In this simulation we still take 9000 sample means but its 3000 from each of the three populations.
library(tidyverse)
library(learnSampling)
set.seed(1)
pop1 <- make_population(mean = 172.50, variance = 156.3236)
pop2 <- pop1 # populations are identical
pop3 <- pop1 # populations are identical
set.seed(1)
sample_means <- get_mean_samples(pop1, pop2, pop3,
n = 5,
number.of.trials = 9000)## Number of populations: 3
## Using a = 3
The first few rows of the data set with 9,000 sample mean is below:
head(sample_means)## trial source_population sample_mean
## 1 1 1 176.2
## 2 2 2 167.4
## 3 3 3 168.8
## 4 4 1 175.8
## 5 5 2 174.0
## 6 6 3 157.4
The trial column indicate which trial (of the 9,000) is represented in that row. The source population column indicate the population we obtain the sample from (1 since we only had 1 population). The sample mean column indicate the sample mean in that trial.
sample_means %>%
summarise(var_means = var(sample_mean))## var_means
## 1 30.80853
We take the variance of all of those sample means and get a value of 30.80005 which closely approximately the actual value of 31.26472 (which we get when using an infinite number of sample means.)
That is, when multiple populations have identical means, the variance of sample means from those populations is due entirely to sampling error. That variance of sample means is represented by the formula below:
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{\sigma_{people}^2}{n}\\ \end{aligned} \]
15.3.3 Multiples population, different \(\mu\) (practice)
Now let’s see what happens when we have three populations with different means - as illustrated in the table and figure below.
| Population | Mean | Variance |
|---|---|---|
| 1 | \(\mu_{A1_{people}}= 152.50\) | \(\sigma_{1_{people}}^2= 156.3236\) |
| 2 | \(\mu_{A2_{people}}= 172.50\) | \(\sigma_{2_{people}}^2= 156.3236\) |
| 3 | \(\mu_{A3_{people}}= 192.50\) | \(\sigma_{3_{people}}^2= 156.3236\) |
FIGURE 15.2: Sampling from three populations with the same mean (and variance)
Because the populations have equal variances (even if the means are different) we calculate the variance due to sampling error the way we did previously:
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{\sigma_{people}^2}{n}\\ &= \frac{156.3236}{5}\\ &= 31.26472\\ \end{aligned} \]
But we also have to calculate the variance in population means. But to calculate the variance of the population means we need the mean of the population means. This value is calculated below - notice that we are simply averaging the three population means: 152.5, 172.5, and 192.5.
\[ \begin{aligned} \mu_{\mu_A} &= \frac{\sum{\mu_A}}{A} \\ &= \frac{\mu_{A1}+\mu_{A2}+\mu_{A3}}{A} \\ &= \frac{152.50 + 172.50 + 192.50}{3} \\ &= \frac{517.50}{3}\\ &= 172.50 \\ \end{aligned} \]
We use this value to calculate the variance of the population means. Which we find to be 0 because the population means are all the same.
\[ \begin{aligned} \sigma_{\mu_A}^2 &= \frac{\sum (\mu_{A_i}-\mu_{\mu_A})^2}{A} \\ &= \frac{(\mu_{A_1} - \mu_{\mu_A})^2 + (\mu_{A_2} - \mu_{\mu_A})^2 + (\mu_{A_3} - \mu_{\mu_A})^2}{3} \\ &= \frac{(152.50 - 172.50)^2 + (172.50 - 172.50)^2 + (192.50 - 172.50)^2}{3} \\ &= \frac{(-20)^2 + (0)^2 + (20)^2}{3} \\ &= \frac{400 + 0 + 400}{3} \\ &= \frac{800}{3} \\ &= 266.6667\\ \end{aligned} \]
So the total variance is sample means is:
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error} + \text{variance in population means} \\ &= \frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2\\ &= 31.26472 + 266.6667\\ &= 297.9314\\ \end{aligned} \]
We can confirm this with the simulation below. In this simulation we still take 9000 sample means but its 3000 from each of the three populations.
library(tidyverse)
library(learnSampling)
set.seed(1)
pop1 <- make_population(mean = 172.50, variance = 156.3236)
pop2 <- make_population(mean = 152.50, variance = 156.3236)
pop3 <- make_population(mean = 192.50, variance = 156.3236)
set.seed(1)
sample_means <- get_mean_samples(pop1, pop2, pop3,
n = 5,
number.of.trials = 9000)## Number of populations: 3
## Using a = 3
The first few rows of the data set with 9,000 sample mean is below:
head(sample_means)## trial source_population sample_mean
## 1 1 1 176.2
## 2 2 2 147.4
## 3 3 3 188.8
## 4 4 1 175.8
## 5 5 2 154.0
## 6 6 3 177.4
The trial column indicate which trial (of the 9,000) is represented in that row. The source population column indicate the population we obtain the sample from (1 since we only had 1 population). The sample mean column indicate the sample mean in that trial.
sample_means %>%
summarise(var_means = var(sample_mean))## var_means
## 1 297.7683
We take the variance of all of those sample means and get a value of 297.7683 which closely approximately the actual value of 297.9314 (which we get when using an infinite number of sample means.)
That is, when multiple populations have different means, the variance of sample means from those populations is due to both sampling error and the variance of the population means. That variance of sample means is represented by the formula below:
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2\\ \end{aligned} \]
15.4 Back to ANOVA
15.4.1 Conceptual/Expected \(F\)-values
In the preceding simulations we saw how the variance of the distribution of sample means was different in the a) population means the same scenario and b) population means different scenario. That is, when the population means were different the variance of the distribution of sample means was larger. Think about this finding in terms of an experiment.
Consider a scenario where you are interested in examining the effect of crowding on aggressive behavior. Your dependent variable is aggressive behavior. Your independent variable is Crowding with three levels: low, medium, and high. Your study has 5 people in each of these three conditions. The 5 people’s aggression scores in the low crowding condition represent a sample aggression scores from a low crowding population. Likewise, the 5 people’s aggression scores in the medium crowding condition represent a sample aggression scores from a medium crowding population. Finally, the 5 people people’s aggression scores in the high crowding condition represent a sample aggression scores from a high crowding population.
You calculate a sample mean for each of the three conditions. You are asking if there is an effect of crowding on aggression. To answer this question we need to think of the distribution of sample means that would occur if you repeats the study thousands of times. If you repeated the study thousand of times, and crowding had now effect on aggression the distribution of sample means differ only due to sampling error and be described by the formula below. We think of this formula as describing the distribution of sample means under the Null Hypothesis (i.e., the scenario where the population means are equal.)
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error} \\ &= \frac{\sigma_{people}^2}{n} \\ \end{aligned} \]
In contrast, if the crowding manipulation did influence aggressive behavior, this means there are different population means for the three crowding conditions. If the population means are different, then means the variance for the distribution of sample means would correspond to:
\[ \begin{aligned} \sigma_{\bar{x}}^2 &= \text{variance due to sampling error} + \text{variance in population means} \\ &= \frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2\\ \end{aligned} \]
Conceptually, an \(F\)-value represents the comparison of these two approaches:
\[ \begin{aligned} F &= \frac{\text{variance due to sampling error} + \text{variance in population means}}{\text{variance due to sampling error}}\\ &= \frac{\frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2}{\frac{\sigma_{people}^2}{n}}\\ \end{aligned} \]
15.4.2 Two scenarios
Consider what happens to the above \(F\)-value when the population means are all the same. In this case the variance in population means is zero, \(\sigma_{\mu_A}^2=0\). When this happens, the numerator and denominator are the same - so the expected \(F\)-value is 1.00.
Now consider what happens to the above \(F\)-value when the population means are different. In this case, the variance of population means is larger than zero, \(\sigma_{\mu_A}^2 > 0\). When this happens, the numerator is larger than the denominator - so the expected \(F\)-value is greater than 1.00.
15.5 Calculation
The \(F\)-value formulas presented above are for the expected values. How do we calculate \(F\)-values in practice? The formula below estimates the variance in sample means due to all sources (sampling error and the variance in population means):
\[ \begin{aligned} s_{\bar{x}}^2 = \frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1} \end{aligned} \]
In contrast, the formula below estimates the variance due to sampling error alone:
\[ \begin{aligned} s_{\bar{x}}^2 = \frac{s_{people}^2}{n} \end{aligned} \]
Consequently, in practice we calculate \(F\)-values using the equation below.
\[ \begin{aligned} F = \frac{\frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{\frac{s_{people}^2}{n}} \end{aligned} \]
FIGURE 15.3: Linking conceptual and calculation formulas for the F-value)
15.6 A confusing rearrangement
Unfortunately, most textbooks don’t present \(F\)-values in the conceptually easy to follow way presented above. Instead, they use the algebraic rearrangement presented below. The rearrangement is important, however, because it is the basis for how information is presented in an ANOVA table. Unfortunately, when the rearranged in used - it’s hard to understand the equation.
\[ \begin{aligned} F &= \frac{\frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2}{\frac{\sigma_{people}^2}{n}}*1\\ &= \frac{\frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2}{\frac{\sigma_{people}^2}{n}}*\frac{n}{n}\\ &= \frac{n(\frac{\sigma_{people}^2}{n} + \sigma_{\mu_A}^2)}{n(\frac{\sigma_{people}^2}{n})}\\ &= \frac{\sigma_{people}^2 + n \sigma_{\mu_A}^2}{ \sigma_{people}^2}\\ &= \frac{\sigma_{error}^2 + n \sigma_{\mu_A}^2}{ \sigma_{error}^2}\\ \end{aligned} \]
Likewise, the calculation version is often presented in a different way - which is also harder to understand:
\[ \begin{aligned} F &= \frac{\frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{\frac{s_{people}^2}{n}}\\ &= \frac{\frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{\frac{s_{people}^2}{n}}*1\\ &= \frac{\frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{\frac{s_{people}^2}{n}}*\frac{n}{n}\\ &= \frac{n \frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{n \frac{s_{people}^2}{n}}\\ &= \frac{\frac{n \sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{s_{people}^2}\\ \end{aligned} \]
15.7 Worked example
A researcher was interested in people’s reactions to video game feedback. Participants were randomly assigned to each of three conditions: negative feedback (subjects were insulted for their lack of ability to learn the game by a person over a loudspeaker who was reported to be in a control booth monitoring their performance), positive feedback (subjects were encouraged at how well they were doing), and a control condition in which subjects did not receive feedback. The dependent measure was video game score.
| Feedback | Mean | Variance | n |
|---|---|---|---|
| negative | \(\bar{x}_{negative}= 17498.97959\) | \(s_{negative}^2= 41157141\) | n = 196 |
| none | \(\bar{x}_{none}= 20196.90306\) | \(s_{none}^2= 37084531\) | n=196 |
| positive | \(\bar{x}_{positive}= 25003.62245\) | \(s_{positive}^2= 31086669\) | n=196 |
First, we estimate the population variance:
\[ \begin{aligned} \text{variance due to sampling error} &= \frac{s_{negative}^2+s_{none}^2+s_{positive}^2}{3}\\ &= \frac{41157141+37084531+31086669}{3}\\ &= 36442780\\ \end{aligned} \]
Second, we estimate the variance in sample means as if sampling error was the only source of variability:
\[ \begin{aligned} s_{\bar{x}}^2 &= \frac{s_{people}^2}{n}\\ &= \frac{36442780}{196}\\ &= 185932.6\\ \end{aligned} \]
Third, we estimate the variance in the distribution of sample means due to all sources:
\[ \begin{aligned} s_{\bar{x}}^2 &= \frac{\sum_{i=1}^{a=3}{(\bar{x}_i - v)^2}}{a-1}\\ &= \frac{(\bar{x}_{negative}-\bar{\bar{x}})^2+(\bar{x}_{none}-\bar{\bar{x}})^2+(\bar{x}_{positive}-\bar{\bar{x}})^2}{3-1}\\ &= \frac{(17498.97959-20899.83503)^2+(20196.90306-20899.83503)^2+(25003.62245-20899.83503)^2}{3-1}\\ &= 14450501.13 \end{aligned} \]
Fourth, we compare those two estimates in a ratio:
\[ \begin{aligned} F &= \frac{\text{variance due to sampling error} + \text{variance in population means}}{\text{variance due to sampling error}}\\ &= \frac{\frac{\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{\frac{s_{people}^2}{n}}\\ &= \frac{14450501.13}{\frac{36442780}{196}}\\ &= 77.72\\ \end{aligned} \]
The rearrangement of the \(F\)-value equation is below. This can be used to understand the other components of the \(F\)-table.
\[ \begin{aligned} F &= \frac{ \frac{(n)\sum_{i=1}^{a=3}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}}{s_{people}^2}\\ &= \frac{(196)14450501.13}{36442780}\\ &= \frac{2832298221}{36442780}\\ &= 77.72\\ \end{aligned} \]
Below we see this formula mapped to the \(F\)-table:
FIGURE 15.4: Mapping formulas to the F-table.
15.8 Summary
| Attribute | Parameter Calculation | Estimate of Parameter | Reflects |
|---|---|---|---|
| Method 1: Variance | \(\sigma_{\bar{x}}^2 = \frac{\sum_{i=1}^{K=\infty}{(\bar{x}_i - \mu_{\bar{x}})^2}}{K}\) | \(s_{\bar{x}}^2 = \frac{\sum_{i=1}^{a}{(\bar{x}_i - \bar{\bar{x}})^2}}{a-1}\) | Sampling Error & Variance of Population Means |
| Method 2 CLT: Variance | \(\sigma_{\bar{x}}^2 = \frac{\sigma_{people}^2}{n}\) | \(s_{\bar{x}}^2 = \frac{s_{people}^2}{n}\) | Just Sampling Error |