Statistical power from scratch

In the 1970s the American government wanted to save fuel by allowing drivers to turn right at a red light (Reinhart (2020)). Many studies found that this Right-Turn-On-Red (RTOR) change caused more accidents. Unfortunately, these studies concluded that the results were not statistically significant. Only years later, when combining the data, it was found that the changes were significant (Preusser et al. (1982)). Statisticians nowadays solve these kind of problems by considering not only significance, but also power. This post aims to explain and demonstrate power from scratch. Specifically, data is simulated to show power and the necessary concepts underlying power. The underlying concepts are the

Binomial distribution,
Normal distribution,
t-distribution and the
t-test,

to finally explain

Power.

There is also a

Conclusion.

This post is not meant as a rigorous analysis of statistics, instead it aims to explain the concepts as clearly as possible. Therefore, the post contains many plots. To see how the plots are generated visit the R Markdown file.

Running example

To make the material more concrete a running example is used. Suppose the American government is trying to determine whether they should enable RTOR. They want to save fuel but only if it does not cause more deaths. Suppose that, on average, there are 100 accidents per year in Ohio. Suppose also that we (as reader, but not as government) know that with RTOR we go to an average of 120 accidents per year. We denote these means for the old and new situation by \(\mu_o\) and \(\mu_n\) respectively. When knowing these means, it is clear that RTOR should not be allowed. How soon could the government know this after only seeing the data?

Binomial distribution

One way to generate data for this situation is as follows. Suppose 100 000 people turn right in Ohio per year. Since the average is 100 accidents per year, the probability of having an accident when turning right is \(\tfrac{100}{100 \: 000} = 0.001\). With RTOR, this probability becomes \(P_n = \tfrac{120}{100 \: 000} = 0.0013\). We can now simulate the situation by taking 100 000 random numbers between 0 and 1 and seeing whether the sampled number is below \(P_n\). (Specifically, the random numbers come from an uniform distribution between 0 and 1.) After one year, the simulation reports that 132 accidents have occurred. It makes sense that a government would not change opinion on that single report. Next, we simulate what happens after the passing of 20 years.

Histogram for the number of accidents over 20 years Data after multiple years

After 20 years a government would be very likely to change opinion. It can be seen that the histogram starts to approximate a bell curve. The simulation consists of \(n\) independent experiments, each with a boolean-valued outcome. As the number of years approach infinity, the histogram will approach a binomial distribution. A binomial distribution is denoted as \(B(n, p)\), where \(n\) is the number of trials and \(p\) the success probability for each trial. Next, two density plots are shown for 100 and 1000 years. The plots are combined with a line for \(B(100 \: 000, 0.0012)\). The figures show that the density plot approaches aforementioned binomial distribution.

Histogram for the number of accidents over 100 years Data after many years

Histogram for the number of accidents over 1000 years Data after many years

At first thought, it might be strange that data generated by an uniform distribution approaches a smooth bell curve. The simplest way to see this is via a Galton box. A Galton box is a device invented by Sir Francis Galton; see the next image for a schematic view of the device.

Galton box (source: Wikimedia Commons)

The picture shows a board with rows of pins. When balls are dropped from the top they will bounce randomly to the left or the right at each pin. In the bottom a bell curve will arise. With this image the bell curve makes sense. For example, it is very unlikely that a ball will fall to the right at each pin.

Normal distribution

The central limit theorem (CLT) states that when many independent and identically distributed random variables with finite mean and variance are aggregated, the result approaches a normal distribution. The normal distribution is denoted by \(N(\mu, \sigma^2)\) with mean \(\mu\) and variance \(\sigma^2\). Note that in general in statistics, "Greek letters for measures of the population (called 'parameters') and Latin letters are used for measures of one or more samples (called 'statistics')" (Smith (2018)). Let \(X\) be a random variable, then \(\mu\) and \(\sigma^2\) are defined as

\[ \mu = \sum_{x \in X} \: xp(x), \: \text{and} \] \[ \sigma^2 = E{[X - E(X)]^2}. \]

The following table lists \(\mu_n\) and \(\sigma^2_n\) for the histograms shown above.

Number of years	\(\boldsymbol{\mu_n}\)	\(\boldsymbol{\sigma^2_n}\)
20	119.7	66.6
100	120.9	106.6
1000	120.1	117.3

After more and more years the mean starts to approach the expected mean of 120. The data approaches the binomial distribution by definition. Specifically, it will approach \(B(100 \: 000, 0.0012)\). By the CLT and the fact that \(n\) is large, the data is also close to \(N(120, \sqrt{120})\), as can be seen when comparing the distributions.

Two binomial distributions and a normal distribution Comparing two binomial distributions

Sampling distribution

When taking only a few samples, there will be a lot of variance. This will cause a sampling distribution to be wider than a normal distribution. For example, we can plot the distribution for taking one sample from \(N(120, \sqrt(120))\), repeated 10 000 times.

Histogram for taking one sample (10 000 times)

The distribution is all over the place. Some samples returned 160, which is very unlikely, and therefore does not occur often. Returning 160 when averaging two samples is even more unlikely.

Histogram for taking the mean over two samples (10 000 times)

After more and more samples the variance of the histogram decreases. This is caused by taking the average over the samples. The more samples in the set, the less likely it is to observe extreme outliers.

So, the pattern for the variance in the distribution is clear. For easier comparison with drawn samples it is convenient to standardize the distribution. The distribution's mean can be removed to get a distribution around zero.

Histogram for the distribution of taking two samples around zero

The width still depends on the data. This can be standardized by dividing each result by the distribution's standard deviation.

Histogram for the distribution of taking two samples around zero with a normalized variance

After also correcting for the sample size a standardized sampling distribution remains. This distribution is known as a t-distribution \(T(d)\) where \(d\) denotes the degrees of freedom with \(d = n - 1\).

Various t-distributions compared to the standard normal distribution

t-test

Now distributions are known, statistical tests can be conducted. Let \(S_o\) and \(S_n\) be two sets of samples. The samples come from two normal distributions with a variance of 120. Specifically, \(S_o\) is a set from \(N(\mu_o, \sqrt{120})\), and \(S_n\) is a set from \(N(\mu_n, \sqrt{120})\).

Index	\(\boldsymbol{S_o}\)	\(\boldsymbol{S_n}\)
1	110.6	117.0
2	113.9	112.9
3	105.2	120.7
4	105.6	98.5
5	95.5	131.2
6	100.8	126.0
7	96.9	104.3
8	102.8	114.0
9	110.5	127.6
10	84.6	130.3

Suppose a scientist is looking into the situation. The scientist states that any found result should be with 95% confidence. \(S_o\) will be compared with \(S_n\) in an effort to see whether \(\mu_o\) differs from \(\mu_n\), that is, \(\mu_o < \mu_n\) or \(\mu_o > \mu_n\). This is called a t-test. For a t-test normality should be validated. Normality tests are omitted here.

Let \(\overline{x}\) be the sample mean, \(s\) the sample standard deviation and \(n\) the sample size. Suppose two years have passed. The number of accidents for these years are 117.0 and 112.9. Then, \(\overline{x} = 114.95\), \(s \approx 2.899\), and \(n = 2\). To obtain the t-distribution values have been normalized. Therefore, the researcher also normalizes the output for the sample according to

\[ t = \frac{\overline{x} - \mu}{s / \sqrt{n}}. \]

The t-value becomes \(t \approx 7.293\). To test the significance of this \(t\) value it is compared with a critical value. The critical value denotes the boundary of a confidence interval (CI) on the t-distribution. The t-value can be compared with a one-sided or two-sided CI. Below the 60% two-sided, and the 80% one-sided confidence intervals are colored to get an intuition about the relationship between the intervals. Note that

\[ \begin{aligned} P(-a < T(1) < a) & = 60\%, \: \text{and} \\ P(T(1) < a) & = 80\%. \\ \end{aligned} \]

T(1) and the 80% one-sided and 60% two-sided confidence intervals

So, the scientist tries to determine whether \(\mu_o > \mu_n\) or \(\mu_o < \mu_n\). This is reason to look at the two-sided CI. The calculated t-value is compared to the 95% two-sided critical value.

Calculated t-value compared to critical values

Since \(t < 12.706\) the scientist concludes that there is no significant difference between the sample and the population, that is, the scientist concludes that \(\mu_n = \mu_o\). So, RTOR does not cause more accidents.

Power

When looking again after a few more years the results start to change.

P-values for comparing \(S_n\) with \(u_o\)

After more and more samples it becomes clear that there is a significant difference between the two samples. Suppose the researcher does not know \(\mu_o\), and instead has to guessed based on \(S_o\). The results will become less significant.

P-values for comparing \(S_n\) with \(S_o\)

The power "describes the probability that a study will detect an effect when there is a genuine effect to be detected" (Ellis (2010)). (So, a higher power is better.) From the plots it seems that the power depends on the sample size \(n\), significance criterion \(\alpha\), and some other effect which depend on the test and the samples. The latter is called the effect size.

One way to calculate the effect size for the t-test is by using Cohen's d (Cohen (1992)). For two groups of samples \(a\) and \(b\), with respectively means \(m_a\) and \(m_b\), standard deviations \(s_a\) and \(s_b\), and sample size \(n_a\) and \(n_b\)

\[ d = \frac{m_a - m_b}{s_p}, \]

where

\[ s_p = \sqrt{\frac{(n_a-1)s_a^2 + (n_b-1)s_b^2}{(n_a - 1) + (n_b - 1)}} = \sqrt{\frac{(n_a-1)s_a^2 + (n_b-1)s_b^2}{n_a + n_b - 2}}. \]

When comparing the first two samples from \(\mu_n\) and \(\mu_o\) we get \(s_p \approx 2.632\) and \(d \approx 1.026\). The positive effect size \(d\) indicates that \(m_a > m_b\), that is, \(m_n > m_o\). Repeating this calculation for more samples shows that the effect size becomes larger and more reliable.

Cohen's \(d\) for comparing \(S_n\) with \(S_o\)

Ellis (2010) argues that "effect size estimates need to be interpreted separately from tests of statistical significance". For example, a small effect size of r = 0.04 might be huge if is it about Propranolol and heart attack survival. In effect it could mean a "4% decrease in heart attacks for people at risk" (Ellis (2010)).

For the running example it also makes sense to look at the effect size. After 4 samples the t-test shows insignificant results. Cohen's \(d\) shows that the samples from \(S_n\) are half a standard deviation higher than \(S_o\). Cohen would classify this as a medium effect (Cohen (1992)). After 5 samples the results are still insignificant. However, Cohen's \(d\) is higher than 1, which Cohen calls a large effect. Seeing such an "insignificant", but large effect could make one doubt the safety of RTOR.

Even though Cohen's rules of thumb can be useful, it is better to reason about the expected effect size and design the study around that. (Ellis (2010)) provides various options for predicting effect sizes, namely

refer to a meta-analysis examining the effect,
pool the effect size estimates of related work,
run a pretest,
estimate based on theory, or
create a dummy table to explore the trade-offs between various values for \(n\), power, and effect size.

Power formulas tend to be non-trivial. Therefore, most books advise to use power tables, or use software packages. This post will do the latter, and plot some results which are applicable to the running example. One could look at the required sample size to detect varying levels of effect sizes.

Power as a function of the sample size and effect size

For example, suppose the researcher expected the effect size \(d\) to be 0.50. (A "medium effect" according to Cohen.) To be 60% sure of not making a type II error, the researcher needs about 40 samples per group. When having only 2 samples per group, the power is 0.06. This means that the chance of having a type II error is 1 - 0.06 = 94%.

In the case of RTOR it would be better to think about the effect of making a type I versus a type II error. The researcher would prefer raising a false alarm about the danger (type I error) over not raising an alarm while one should have (type II error). Assume a medium effect size \(d\) of 0.50. Then the plot for the type I error versus the type II error would look as follows.

Power as a function of sample size and significance level

Specifically, for \(\alpha = 0.05\) we have:

Number of samples	Power
2	0.06
20	0.34
80	0.88

This corresponds with the earlier observation that tests will only yield significant results if the sample size is high enough.

Conclusion

Much of the statistics literature is based on a common set of notions (for example, distributions and variance). It is assumed that the reader is familiar with all the notions. Working through these concepts by starting from scratch has been insightful. To do this the working example proved to be an useful common thread. Most fundamentals could be visualised surprisingly easy. For example, the normal, binomial and t distribution can be obtained by drawing many random values between 0 and 1.

Even though the example was clear, power analysis remains vague. This is due to the fact that real world information needs to be taken into account. For example, in most situations small effects are of little interest, while in others they might save hundreds of lifes per year. This post has given a brief introduction to power analysis, and omitted many aspects. For a more detailed description see Ellis (2010).

References

Cohen, J. (1992). A power primer. Psychological bulletin, 112(1), 155.

Ellis, P. D. (2010). The Essential Guide to Effect Sizes. Cambridge University Press.

Preusser, D. F., Leaf, W. A., DeBartolo, K. B., Blomberg, R. D., & Levy, M. M. (1982). The effect of right-turn-on-red on pedestrian and bicyclist accidents. Journal of safety research, 13(2), 45-55.

Reinhart, A. Statistical power and underpowered statistics. https://www.statisticsdonewrong.com/power.html. Accessed: 2020-02-07.

Smith, W (2018). The Mathematical Symbols used in Statistics. https://ocw.smithw.org/csunstatreview/statisticalsymbols.pdf. Accessed: 2020-02-13.

Rik Huijzer. The text is licensed under CC BY-SA 4.0. The code is licensed under the MIT License. Website built with Franklin.jl and the Julia programming language. Last update: 2024-04-17.