At our department seminar last
week, the recent paper by Benjamin et al. on redefining statistical
significance was brought up. In this paper, a large group of researchers
argue that findings with a pvalue
close to 0.05 reflect only weak evidence for an effect. Thus, to claim a new
discovery, the authors propose a stricter threshold, α = 0.005.
After hearing of this proposal, the
immediate reaction in the seminar room was horror at some rough estimations of
either the loss of power or increase in the required sample size that this
would involve. I imagine that this reaction is rather standard among
researchers, but from a quick scan of the “Redefine Statistical Significance”
paper and four responses to the paper that I have found (“Why redefining statistical significance
will not improve reproducibility and could make the replication crisis worse”
by Crane, “Justify your alpha” by
Lakens et al., “Abandon statistical
significance” by McShane et al., and “Retract p < 0.005 and propose using JASP
instead” by Perezgonzales & FríasNavarro), there are no updated sample
size estimates.
Required sample estimates for α = 0.05 α = 0.005 are very easy to
calculate with g*power. So, here
are the sample size estimates for achieving 80% power, for twotailed
independentsample ttests and four different effect sizes:
Alpha

N for d = 0.2

N for d = 0.4

N for d = 0.6

N for d = 0.8

0.05

788

200

90

52

0.005

1336

338

152

88

It is worth noting that most
effects in psychology tend to be closer to the d
= 0.2 end of the scale, and that most designs are nowadays more complicated than simple main
effects in a betweensubject comparison. More complex designs (e.g., when one
is looking at an interaction) usually require even more participants.
The argument of Benjamin et al.,
that pvalues close to 0.05 provide
very weak evidence, is convincing. But their solution raises practical issues
which should be considered. For some research questions, collecting a sample of
1336 participants could be achievable, for example by using online
questionnaires instead of testing participants at the lab. For other research
questions, collecting these kinds of samples is unimaginable. It’s not
impossible, of course, but doing so would require a collective change in
mindset, the research structure (e.g., investing more resources into a single
project, providing longerterm contracts for early career researchers), and
incentives (e.g., relaxing the requirement to have many firstauthor publications).
If we ignore peoples’
concerns about the practical issues associated with collecting this many
participants, the Open Science movement may lose a great deal of supporters.
Can I end this blog post on a
positive note? Well, there are some things we can do to make the numbers from
the table above seem less scary. For example, we can use withinsubject designs
when possible. Things already start to look brighter: Using the same settings in
g*power as above, but calculating the required sample size for “Difference
between two dependent means”, we get the following:
Alpha

N for d = 0.2

N for d = 0.4

N for d = 0.6

N for d = 0.8

0.05

199

52

24

15

0.005

337

88

41

25

We could also preregister our
study, including the expected direction of a test, which would allow us to use
a onesided ttest. If we do this, in addition to using a withinsubject
design, we have:
Alpha

N for d = 0.2

N for d = 0.4

N for d = 0.6

N for d = 0.8

0.05

156

41

19

12

0.005

296

77

36

22

The bottom line is: A comprehensive
solution to the replication crisis should address the practical issues
associated with getting larger sample sizes.
No comments:
Post a Comment