At our department seminar last
week, the recent paper by Benjamin et al. on redefining statistical
significance was brought up. In this paper, a large group of researchers
argue that findings with a p-value
close to 0.05 reflect only weak evidence for an effect. Thus, to claim a new
discovery, the authors propose a stricter threshold, α = 0.005.
After hearing of this proposal, the
immediate reaction in the seminar room was horror at some rough estimations of
either the loss of power or increase in the required sample size that this
would involve. I imagine that this reaction is rather standard among
researchers, but from a quick scan of the “Redefine Statistical Significance”
paper and four responses to the paper that I have found (“Why redefining statistical significance
will not improve reproducibility and could make the replication crisis worse”
by Crane, “Justify your alpha” by
Lakens et al., “Abandon statistical
significance” by McShane et al., and “Retract p < 0.005 and propose using JASP
instead” by Perezgonzales & Frías-Navarro), there are no updated sample
size estimates.
Required sample estimates for α = 0.05 α = 0.005 are very easy to
calculate with g*power. So, here
are the sample size estimates for achieving 80% power, for two-tailed
independent-sample t-tests and four different effect sizes:
Alpha
|
N for d = 0.2
|
N for d = 0.4
|
N for d = 0.6
|
N for d = 0.8
|
0.05
|
788
|
200
|
90
|
52
|
0.005
|
1336
|
338
|
152
|
88
|
It is worth noting that most
effects in psychology tend to be closer to the d
= 0.2 end of the scale, and that most designs are nowadays more complicated than simple main
effects in a between-subject comparison. More complex designs (e.g., when one
is looking at an interaction) usually require even more participants.
The argument of Benjamin et al.,
that p-values close to 0.05 provide
very weak evidence, is convincing. But their solution raises practical issues
which should be considered. For some research questions, collecting a sample of
1336 participants could be achievable, for example by using online
questionnaires instead of testing participants at the lab. For other research
questions, collecting these kinds of samples is unimaginable. It’s not
impossible, of course, but doing so would require a collective change in
mindset, the research structure (e.g., investing more resources into a single
project, providing longer-term contracts for early career researchers), and
incentives (e.g., relaxing the requirement to have many first-author publications).
If we ignore peoples’
concerns about the practical issues associated with collecting this many
participants, the Open Science movement may lose a great deal of supporters.
Can I end this blog post on a
positive note? Well, there are some things we can do to make the numbers from
the table above seem less scary. For example, we can use within-subject designs
when possible. Things already start to look brighter: Using the same settings in
g*power as above, but calculating the required sample size for “Difference
between two dependent means”, we get the following:
Alpha
|
N for d = 0.2
|
N for d = 0.4
|
N for d = 0.6
|
N for d = 0.8
|
0.05
|
199
|
52
|
24
|
15
|
0.005
|
337
|
88
|
41
|
25
|
We could also pre-register our
study, including the expected direction of a test, which would allow us to use
a one-sided t-test. If we do this, in addition to using a within-subject
design, we have:
Alpha
|
N for d = 0.2
|
N for d = 0.4
|
N for d = 0.6
|
N for d = 0.8
|
0.05
|
156
|
41
|
19
|
12
|
0.005
|
296
|
77
|
36
|
22
|
The bottom line is: A comprehensive
solution to the replication crisis should address the practical issues
associated with getting larger sample sizes.
No comments:
Post a Comment