HUIJZER.XYZ

Statistical power from scratch

2019-12-29

In the 1970s the American government wanted to save fuel by allowing drivers to turn right at a red light (Reinhart (2020)). Many studies found that this Right-Turn-On-Red (RTOR) change caused more accidents. Unfortunately, these studies concluded that the results were not statistically significant. Only years later, when combining the data, it was found that the changes were significant (Preusser et al. (1982)). Statisticians nowadays solve these kind of problems by considering not only significance, but also power. This post aims to explain and demonstrate power from scratch. Specifically, data is simulated to show power and the necessary concepts underlying power. The underlying concepts are the

to finally explain

There is also a

This post is not meant as a rigorous analysis of statistics, instead it aims to explain the concepts as clearly as possible. Therefore, the post contains many plots. To see how the plots are generated visit the R Markdown file.

Running example

To make the material more concrete a running example is used. Suppose the American government is trying to determine whether they should enable RTOR. They want to save fuel but only if it does not cause more deaths. Suppose that, on average, there are 100 accidents per year in Ohio. Suppose also that we (as reader, but not as government) know that with RTOR we go to an average of 120 accidents per year. We denote these means for the old and new situation by μo\mu_o and μn\mu_n respectively. When knowing these means, it is clear that RTOR should not be allowed. How soon could the government know this after only seeing the data?

Binomial distribution

One way to generate data for this situation is as follows. Suppose 100 000 people turn right in Ohio per year. Since the average is 100 accidents per year, the probability of having an accident when turning right is 100100 000=0.001\tfrac{100}{100 \: 000} = 0.001. With RTOR, this probability becomes Pn=120100 000=0.0013P_n = \tfrac{120}{100 \: 000} = 0.0013. We can now simulate the situation by taking 100 000 random numbers between 0 and 1 and seeing whether the sampled number is below PnP_n. (Specifically, the random numbers come from an uniform distribution between 0 and 1.) After one year, the simulation reports that 132 accidents have occurred. It makes sense that a government would not change opinion on that single report. Next, we simulate what happens after the passing of 20 years.

Histogram for the number of accidents over 20 years Data after multiple years

After 20 years a government would be very likely to change opinion. It can be seen that the histogram starts to approximate a bell curve. The simulation consists of nn independent experiments, each with a boolean-valued outcome. As the number of years approach infinity, the histogram will approach a binomial distribution. A binomial distribution is denoted as B(n,p)B(n, p), where nn is the number of trials and pp the success probability for each trial. Next, two density plots are shown for 100 and 1000 years. The plots are combined with a line for B(100 000,0.0012)B(100 \: 000, 0.0012). The figures show that the density plot approaches aforementioned binomial distribution.

Histogram for the number of accidents over 100 years Data after many years

Histogram for the number of accidents over 1000 years Data after many years

At first thought, it might be strange that data generated by an uniform distribution approaches a smooth bell curve. The simplest way to see this is via a Galton box. A Galton box is a device invented by Sir Francis Galton; see the next image for a schematic view of the device.

Galton box (source: Wikimedia Commons)

The picture shows a board with rows of pins. When balls are dropped from the top they will bounce randomly to the left or the right at each pin. In the bottom a bell curve will arise. With this image the bell curve makes sense. For example, it is very unlikely that a ball will fall to the right at each pin.

Normal distribution

The central limit theorem (CLT) states that when many independent and identically distributed random variables with finite mean and variance are aggregated, the result approaches a normal distribution. The normal distribution is denoted by N(μ,σ2)N(\mu, \sigma^2) with mean μ\mu and variance σ2\sigma^2. Note that in general in statistics, "Greek letters for measures of the population (called 'parameters') and Latin letters are used for measures of one or more samples (called 'statistics')" (Smith (2018)). Let XX be a random variable, then μ\mu and σ2\sigma^2 are defined as

μ=xX xp(x), and \mu = \sum_{x \in X} \: xp(x), \: \text{and} σ2=E[XE(X)]2. \sigma^2 = E{[X - E(X)]^2}.

The following table lists μn\mu_n and σn2\sigma^2_n for the histograms shown above.

Number of yearsμn\boldsymbol{\mu_n}σn2\boldsymbol{\sigma^2_n}
20119.766.6
100120.9106.6
1000120.1117.3

After more and more years the mean starts to approach the expected mean of 120. The data approaches the binomial distribution by definition. Specifically, it will approach B(100 000,0.0012)B(100 \: 000, 0.0012). By the CLT and the fact that nn is large, the data is also close to N(120,120)N(120, \sqrt{120}), as can be seen when comparing the distributions.

Two binomial distributions and a normal distribution Comparing two binomial distributions

Sampling distribution

When taking only a few samples, there will be a lot of variance. This will cause a sampling distribution to be wider than a normal distribution. For example, we can plot the distribution for taking one sample from N(120,(120))N(120, \sqrt(120)), repeated 10 000 times.

Histogram for taking one sample (10 000 times)

The distribution is all over the place. Some samples returned 160, which is very unlikely, and therefore does not occur often. Returning 160 when averaging two samples is even more unlikely.

Histogram for taking the mean over two samples (10 000 times)

After more and more samples the variance of the histogram decreases. This is caused by taking the average over the samples. The more samples in the set, the less likely it is to observe extreme outliers.

So, the pattern for the variance in the distribution is clear. For easier comparison with drawn samples it is convenient to standardize the distribution. The distribution's mean can be removed to get a distribution around zero.

Histogram for the distribution of taking two samples around zero

The width still depends on the data. This can be standardized by dividing each result by the distribution's standard deviation.

Histogram for the distribution of taking two samples around zero with a normalized variance

After also correcting for the sample size a standardized sampling distribution remains. This distribution is known as a t-distribution T(d)T(d) where dd denotes the degrees of freedom with d=n1d = n - 1.

Various t-distributions compared to the standard normal distribution

t-test

Now distributions are known, statistical tests can be conducted. Let SoS_o and SnS_n be two sets of samples. The samples come from two normal distributions with a variance of 120. Specifically, SoS_o is a set from N(μo,120)N(\mu_o, \sqrt{120}), and SnS_n is a set from N(μn,120)N(\mu_n, \sqrt{120}).

IndexSo\boldsymbol{S_o}Sn\boldsymbol{S_n}
1110.6117.0
2113.9112.9
3105.2120.7
4105.698.5
595.5131.2
6100.8126.0
796.9104.3
8102.8114.0
9110.5127.6
1084.6130.3

Suppose a scientist is looking into the situation. The scientist states that any found result should be with 95% confidence. SoS_o will be compared with SnS_n in an effort to see whether μo\mu_o differs from μn\mu_n, that is, μo<μn\mu_o < \mu_n or μo>μn\mu_o > \mu_n. This is called a t-test. For a t-test normality should be validated. Normality tests are omitted here.

Let x\overline{x} be the sample mean, ss the sample standard deviation and nn the sample size. Suppose two years have passed. The number of accidents for these years are 117.0 and 112.9. Then, x=114.95\overline{x} = 114.95, s2.899s \approx 2.899, and n=2n = 2. To obtain the t-distribution values have been normalized. Therefore, the researcher also normalizes the output for the sample according to

t=xμs/n. t = \frac{\overline{x} - \mu}{s / \sqrt{n}}.

The t-value becomes t7.293t \approx 7.293. To test the significance of this tt value it is compared with a critical value. The critical value denotes the boundary of a confidence interval (CI) on the t-distribution. The t-value can be compared with a one-sided or two-sided CI. Below the 60% two-sided, and the 80% one-sided confidence intervals are colored to get an intuition about the relationship between the intervals. Note that

P(a<T(1)<a)=60%, andP(T(1)<a)=80%. \begin{aligned} P(-a < T(1) < a) & = 60\%, \: \text{and} \\ P(T(1) < a) & = 80\%. \\ \end{aligned}

T(1) and the 80% one-sided and 60% two-sided confidence intervals

So, the scientist tries to determine whether μo>μn\mu_o > \mu_n or μo<μn\mu_o < \mu_n. This is reason to look at the two-sided CI. The calculated t-value is compared to the 95% two-sided critical value.

Calculated t-value compared to critical values

Since t<12.706t < 12.706 the scientist concludes that there is no significant difference between the sample and the population, that is, the scientist concludes that μn=μo\mu_n = \mu_o. So, RTOR does not cause more accidents.

Power

When looking again after a few more years the results start to change.

P-values for comparing SnS_n with uou_o

After more and more samples it becomes clear that there is a significant difference between the two samples. Suppose the researcher does not know μo\mu_o, and instead has to guessed based on SoS_o. The results will become less significant.

P-values for comparing SnS_n with SoS_o

The power "describes the probability that a study will detect an effect when there is a genuine effect to be detected" (Ellis (2010)). (So, a higher power is better.) From the plots it seems that the power depends on the sample size nn, significance criterion α\alpha, and some other effect which depend on the test and the samples. The latter is called the effect size.

One way to calculate the effect size for the t-test is by using Cohen's d (Cohen (1992)). For two groups of samples aa and bb, with respectively means mam_a and mbm_b, standard deviations sas_a and sbs_b, and sample size nan_a and nbn_b

d=mambsp, d = \frac{m_a - m_b}{s_p},

where

sp=(na1)sa2+(nb1)sb2(na1)+(nb1)=(na1)sa2+(nb1)sb2na+nb2. s_p = \sqrt{\frac{(n_a-1)s_a^2 + (n_b-1)s_b^2}{(n_a - 1) + (n_b - 1)}} = \sqrt{\frac{(n_a-1)s_a^2 + (n_b-1)s_b^2}{n_a + n_b - 2}}.

When comparing the first two samples from μn\mu_n and μo\mu_o we get sp2.632s_p \approx 2.632 and d1.026d \approx 1.026. The positive effect size dd indicates that ma>mbm_a > m_b, that is, mn>mom_n > m_o. Repeating this calculation for more samples shows that the effect size becomes larger and more reliable.

Cohen's dd for comparing SnS_n with SoS_o

Ellis (2010) argues that "effect size estimates need to be interpreted separately from tests of statistical significance". For example, a small effect size of r = 0.04 might be huge if is it about Propranolol and heart attack survival. In effect it could mean a "4% decrease in heart attacks for people at risk" (Ellis (2010)).

For the running example it also makes sense to look at the effect size. After 4 samples the t-test shows insignificant results. Cohen's dd shows that the samples from SnS_n are half a standard deviation higher than SoS_o. Cohen would classify this as a medium effect (Cohen (1992)). After 5 samples the results are still insignificant. However, Cohen's dd is higher than 1, which Cohen calls a large effect. Seeing such an "insignificant", but large effect could make one doubt the safety of RTOR.

Even though Cohen's rules of thumb can be useful, it is better to reason about the expected effect size and design the study around that. (Ellis (2010)) provides various options for predicting effect sizes, namely

Power formulas tend to be non-trivial. Therefore, most books advise to use power tables, or use software packages. This post will do the latter, and plot some results which are applicable to the running example. One could look at the required sample size to detect varying levels of effect sizes.

Power as a function of the sample size and effect size

For example, suppose the researcher expected the effect size dd to be 0.50. (A "medium effect" according to Cohen.) To be 60% sure of not making a type II error, the researcher needs about 40 samples per group. When having only 2 samples per group, the power is 0.06. This means that the chance of having a type II error is 1 - 0.06 = 94%.

In the case of RTOR it would be better to think about the effect of making a type I versus a type II error. The researcher would prefer raising a false alarm about the danger (type I error) over not raising an alarm while one should have (type II error). Assume a medium effect size dd of 0.50. Then the plot for the type I error versus the type II error would look as follows.

Power as a function of sample size and significance level

Specifically, for α=0.05\alpha = 0.05 we have:

Number of samplesPower
20.06
200.34
800.88

This corresponds with the earlier observation that tests will only yield significant results if the sample size is high enough.

Conclusion

Much of the statistics literature is based on a common set of notions (for example, distributions and variance). It is assumed that the reader is familiar with all the notions. Working through these concepts by starting from scratch has been insightful. To do this the working example proved to be an useful common thread. Most fundamentals could be visualised surprisingly easy. For example, the normal, binomial and t distribution can be obtained by drawing many random values between 0 and 1.

Even though the example was clear, power analysis remains vague. This is due to the fact that real world information needs to be taken into account. For example, in most situations small effects are of little interest, while in others they might save hundreds of lifes per year. This post has given a brief introduction to power analysis, and omitted many aspects. For a more detailed description see Ellis (2010).

References

Cohen, J. (1992). A power primer. Psychological bulletin, 112(1), 155.

Ellis, P. D. (2010). The Essential Guide to Effect Sizes. Cambridge University Press.

Preusser, D. F., Leaf, W. A., DeBartolo, K. B., Blomberg, R. D., & Levy, M. M. (1982). The effect of right-turn-on-red on pedestrian and bicyclist accidents. Journal of safety research, 13(2), 45-55.

Reinhart, A. Statistical power and underpowered statistics. https://www.statisticsdonewrong.com/power.html. Accessed: 2020-02-07.

Smith, W (2018). The Mathematical Symbols used in Statistics. https://ocw.smithw.org/csunstatreview/statisticalsymbols.pdf. Accessed: 2020-02-13.