## Sunday, February 11, 2018

### By how much would we need to increase our sample sizes to have adequate power with an alpha level of 0.005?

At our department seminar last week, the recent paper by Benjamin et al. on redefining statistical significance was brought up. In this paper, a large group of researchers argue that findings with a p-value close to 0.05 reflect only weak evidence for an effect. Thus, to claim a new discovery, the authors propose a stricter threshold, α = 0.005.

After hearing of this proposal, the immediate reaction in the seminar room was horror at some rough estimations of either the loss of power or increase in the required sample size that this would involve. I imagine that this reaction is rather standard among researchers, but from a quick scan of the “Redefine Statistical Significance” paper and four responses to the paper that I have found (“Why redefining statistical significance will not improve reproducibility and could make the replication crisis worse” by Crane, “Justify your alpha” by Lakens et al., “Abandon statistical significance” by McShane et al., and “Retract p < 0.005 and propose using JASP instead” by Perezgonzales & Frías-Navarro), there are no updated sample size estimates.

Required sample estimates for α = 0.05 α = 0.005 are very easy to calculate with g*power. So, here are the sample size estimates for achieving 80% power, for two-tailed independent-sample t-tests and four different effect sizes:

 Alpha N for d = 0.2 N for d = 0.4 N for d = 0.6 N for d = 0.8 0.05 788 200 90 52 0.005 1336 338 152 88

It is worth noting that most effects in psychology tend to be closer to the d = 0.2 end of the scale, and that most designs are nowadays more complicated than simple main effects in a between-subject comparison. More complex designs (e.g., when one is looking at an interaction) usually require even more participants.

The argument of Benjamin et al., that p-values close to 0.05 provide very weak evidence, is convincing. But their solution raises practical issues which should be considered. For some research questions, collecting a sample of 1336 participants could be achievable, for example by using online questionnaires instead of testing participants at the lab. For other research questions, collecting these kinds of samples is unimaginable. It’s not impossible, of course, but doing so would require a collective change in mindset, the research structure (e.g., investing more resources into a single project, providing longer-term contracts for early career researchers), and incentives (e.g., relaxing the requirement to have many first-author publications).

If we ignore peoples’ concerns about the practical issues associated with collecting this many participants, the Open Science movement may lose a great deal of supporters.

Can I end this blog post on a positive note? Well, there are some things we can do to make the numbers from the table above seem less scary. For example, we can use within-subject designs when possible. Things already start to look brighter: Using the same settings in g*power as above, but calculating the required sample size for “Difference between two dependent means”, we get the following:

 Alpha N for d = 0.2 N for d = 0.4 N for d = 0.6 N for d = 0.8 0.05 199 52 24 15 0.005 337 88 41 25

We could also pre-register our study, including the expected direction of a test, which would allow us to use a one-sided t-test. If we do this, in addition to using a within-subject design, we have:

 Alpha N for d = 0.2 N for d = 0.4 N for d = 0.6 N for d = 0.8 0.05 156 41 19 12 0.005 296 77 36 22

The bottom line is: A comprehensive solution to the replication crisis should address the practical issues associated with getting larger sample sizes.