Amidst the outcries and discussions about the replication
crisis, there is one point on which there is a general consensus: very often,
studies in psychology are underpowered. An underpowered study is one which runs
a high risk, under the assumption that the hypothesis that the effect is true,
to not detect the effect at the significance threshold. The word that we need
to run bigger studies has seeped through the layer of replication bullies to
the general scientific population. Papers are increasingly often being rejected
for having a small sample sizes. If nothing else, that should be reason enough
to care about this issue.
Despite the general consensus about the importance of
properly-powered studies, there is no real consensus about what we should
actually do about it, in practice. Of course, the solution, in theory, is
simple – we need to run bigger studies. But this solution is only simple if you
have the resources to do so. In practice, as I will discuss below, there are
many issues that remain unaddressed. I argue that, despite the upwards trend in
psychological science, drastic measures need to be taken to enable scientists
(regardless of their background) to produce good science.
For those who believe
that underpowered studies are not a problem
Meehl, Cohen, Schmidt, Gelman – they all explain the problem
of underpowered studies much better than I ever could. The notion that
underpowered studies give you misleading results is not an opinion – it’s a
mathematical fact. But seeing is believing, and if you still believe that you
can get useful information with small or medium-sized effects and 20
participants, the best way to convince you otherwise is to show you some
simulations. If you haven’t tinkered around with simulating data, download R,
copy-and-paste the code below, and see what happens. Do it. Now. It doesn't matter if you're an undergraduate student, professor, or lay person who somehow stumbled across this blog post. I’ll wait.
*elevator
music*
# Simulating the populations
Population1=rnorm(n=10000,mean=100,sd=15)
Population2=rnorm(n=10000,mean=106,sd=15)
# This gives us a true effect in the
population of Cohen's d = 0.4.
#
RUN THE CODE BELOW MULTIPLE TIMES
#
Sampling 20 participants from the population
Sample1=sample(Population1,20)
Sample2=sample(Population2,20)
#
Calculating the means for the two samples
mean(Sample1)
mean(Sample2)
#
Note how the means vary with each time we run the simulation.
t.test(Sample1,Sample2)
#
Note how many of the results give you a “significant” p-value.
The populations that we are simulating have a mean (e.g.,
IQ) of 100 and 106, respectively, and a standard deviation of 15. The
difference can be summarised as a Cohen’s d effect size of 0.4, a medium-sized
effect. One may get an intuitive feeling for how strong an experimental manipulation
would need to be to cause a true difference of 6 IQ points. The power (i.e.,
probability of obtaining a significant result, given that we know that the
alternative hypothesis is true and we have an effect of Cohen’s d = 0.4 in the
population) is 23% with 20 participants per cell (i.e., 40 altogether). You
should see the observed means jumping around quite a lot, suggesting that if
you care about quantifying the size of the effect you will get very unstable
results. You should also see a large number of simulations returning
non-significant effects, despite the fact that we know that there is an effect
in the population, suggesting that if you want to make reject/accept H0
decisions based on a single study you will be wrong most of the time.
For the professors
who forgot what it’s like to be young
So, we need to increase our sample sizes if we study
small-to-medium effects. What’s the problem? The problems are practical in
nature. Maybe you are lucky enough to have gone through all stages of your career
at a department that has a very active participant pool, unlimited resources
for paying participants, and maybe even an army of bored research assistants
just waiting to be assigned with the task of going out and finding hundreds of
participants. In this case, you can count yourself incredibly lucky. My PhD
experience was similar to this. With a pool of keen undergraduates, enough
funds to pay a practically unlimited amount of participants, and modern booth
labs where I could test up to 8 people in parallel, I once managed to collect
enough data for a four-experiment paper within a month. I list the following
subsequent experiences to respectfully remind colleagues that things aren’t always this
easy. These are my experiences, of course – I don’t know how many people have
similar stories. My guess is that I’m not alone. Especially early-career
researchers and scientists from non-first-world countries, where giving funding
to social sciences is not really a thing yet, probably have similar
experiences. Or maybe I’m wrong about that, and I’m just unlucky. Either way, I
would be interested to hear about those others’ experiences in the comments.
-
Working in a small, stuffy lab with no windows
and only one computer that takes about as long to start as it takes you to run
a participant.
-
Relying on bachelor students to collect data.
They have no resources for this. They can ask their friends and families, stop
people in the corridor, and only their genuine interest and curiosity in the
research question stops them from just sitting in the lab for ten hours and
testing themselves over and over again, or learning how to write code for a
random number generator to produce the data that is expected of them.
-
Paying for participants from your own pocket.
-
Commuting for two hours (one way) to a place
with participants, with a 39-degree fever, then trying hard not to cough while
the participants do tasks involving voice recording.
-
Pre-registering your study, then having your
contract run out before you have managed to collect the number of participants
you’d promised.
-
Trying to find free spots on the psychology
department notice boards or toilet doors to plaster the flyer for your study
between an abundance of other recruitment posters, and getting, on average,
less than one participant per week, despite incessant spamming.
-
Raising the issue of participant recruitment
with senior colleagues, but not being able to come up with a practically
feasible way to recruit participants more efficiently.
-
Trying to find collaborators to help you with
data collection, but learning that while people are happy to help, they rarely
have spare resources they could use to recruit and test participants for you.
-
Writing to lecturers to ask if you can advertise
my study in their lectures. Being told that so many students ask the same
question that allowing everyone to present their study in class is just not
feasible anymore.
I can consider myself lucky in the sense that I’m doing
mostly behavioural studies with unselected samples of adults. If you are conducting
imaging studies, the price of a single participant cannot be covered from your
own pocket if the university decides not to pay. If you are studying a special
population, such as a rare disease, finding seven participants in the entire
country during your whole PhD or post-doc contract could already be an
achievement. If you are conducting experiments with children, bureaucratic
hurdles may prevent you from directly approaching your target population.
So, can we keep it
small?
It’s all good and well, some people say, to make theoretical
claims about the sample sizes that we need. But there are practical hurdles
that make it impossible in many cases. So, can we ignore the armchair
theoreticians’ hysteria about power and use practical feasibility to guide our
sample sizes?
Well, in theory we can. But in order to allow science to
progress, we, as a field, need to make some concessions:
-
Every study should be published, i.e., there
should be no publication bias.
-
Every study should provide full data in a freely
accessible online repository.
-
Every couple of years, someone needs to do a
meta-analysis to synthesise the results from the existing small studies.
-
Replications (including direct replications) are
not frowned upon.
-
We cannot, ever, draw conclusions from a single
study.
At this stage, none of these premises are satisfied.
Therefore, if we continue to conduct small studies in the current system, those
that show non-significant results will likely disappear in a file drawer. Ironically,
the increased awareness of power amongst reviewers is increasing publication
bias at the same time: reviewers who recommend rejection based on small sample
sizes have good intentions, but this leads to an even larger amount of data that
never see the light of day. In addition, studies that have marginally
significant effects will be p-hacked
beyond recognition. For meta-analyses, the published literature will then give
us a completely skewed view of the world. And in the end, we’ve wasted a lot of
resources and learned nothing.
So, increasing sample
size it is?
Unless we, as a field, tackle the issues described in the previous section, we will need
to increase our sample sizes. There is no way around it. This solution will
work, under a single premise:
-
Research is not for everyone: Publishable studies
will be conducted by a handful of labs in elite universities, who have the
funding to recruit hundreds of participants within weeks or months. These will
be the labs that will produce high-quality research at a fast pace, which will
result in them winning more grants and producing even more high-quality
research. And those who don’t have the resources to conduct large studies from
the beginning? Well, fuck ‘em.
This is a valid view point, as a world where this is the norm
would not have any of the problems associated with the small-study-world
described above. And yet, I would say that such a world would be very bad.
First, for individuals such as me (of course, I have some
personal-interest-motivations in writing this blog post), who spend months and
months, lugging around the testing laptop through trains and different
departments in search of participants, while other researchers snap their
fingers and get their research assistant to run the same study in a matter of weeks.
Second, it disadvantages populations of researchers who may have systematically
different views. As mentioned above, populations with fewer resources probably
include younger researchers, and those from not-first-world countries. Reducing
the opportunity for these researchers to contribute to their field of expertise
will create a monotonous field, where scientific theories are based,
to a large extent, on the musings of old white men. By this process, the field
would lose an overwhelming amount of potential by locking out a majority of
scholars.
In short, I argue that publishing only well-powered studies
without consideration of practical issues that some researchers face will be
bad for individual researchers, as well as the whole field. So, how can we
increase power without creating a Matthew Effect, where the rich get richer and
the poor get poorer?
-
Collaborate more, as I’ve argued here.
-
Routinely use StudySwap to look for collaborators
who help you to get the sample size you need, but also to collect data for
other researchers if you happen to have some bored research assistants or lots
of keen undergrads.
-
For the latter part of the last point, “rich”
researchers will need to start sacrificing their own resources, which they
could well use for a study of their own, that would have a chance of getting
them another first-author publication instead of ending up as fifth out of seven
authors on someone else’s paper.
-
As a logical consequence of the last point,
researchers need to change their mindset, such that they prefer to publish
fewer first-author papers and to spend more time collecting data, both for
their own pet projects and for others'.
-
And why are we so obsessed with first-author
publications in the first place? It’s our incentive system, of course. We, as a field, should stop
giving scholarships, jobs, grants, and promotions to researchers with the most
first-author publications.
And where to now?
Perhaps an ideal world would consist of large-scale studies,
and small studies and meta-analyses, as it kind of does already. But in order
to allow for the build-up of knowledge in such as system, to be able separate
true effects from crap in candy wrappers, we, as a field, need to fix all of
the issues above.
And in the meantime, there are more questions than answers
for individual researchers. Do I conduct a large study? Do I bank all of my
resources on a single experiment, with a chance that, for whatever reason, it
may not work out, and I will finish my contract without a single publication?
Do I risk looking, in front of a prospective longish-term employer, like a
dreamer, one who promises the moon but in the end fails to recruit enough
participants? Or do I conduct small studies during my short-term contract? Do I
risk that journals will reject all of my papers because they are underpowered?
Do I run a small study, knowing that, most likely, the results will be
uninterpretable? Knowing that I may face pressure to p-hack to get publishable results, from journals, collaborators, or
the shrewd little devil sitting on my shoulder, reminding me that I won’t have
a job if I don’t get publications?