There are two types of psychological researchers: those who
are acutely aware of what is widely referred to as the replication crisis,
and those who have heard about it. The factum of this crisis is simple: it is
becoming increasingly obvious that some proportion of the published literature reports
empirical results that cannot be reproduced and that are likely to be false.
The implication is clear: if scientists believe in results that are false, they
will build theories based on erroneous assumptions, which in turn has
detrimental effects on instances when these theories are applied to practical
domains (e.g., treatment of psychological disorders, evidence-based teaching
methods, hiring decisions by organisational psychologists). Clearly, this is an
outcome that needs to be avoided at all costs.
Luckily, it is relatively clear what needs to be done to
improve the way science works. Three issues that drive the crisis are generally
discussed: (1) Publication bias (e.g., Ioannidis, 2005; Rosenthal, 1979; Simonsohn, Nelson, & Simmons,
2014),
(2) underpowered studies (e.g., Button et al., 2013; Christley, 2010; Cohen, 1962; Royall, 1986),
and (3) questionable research practices (e.g., John, Loewenstein, & Prelec, 2012; Wagenmakers, Wetzels,
Borsboom, van der Maas, & Kievit, 2012). If all three of these issues
were addressed, the replication crisis would be – if not eliminated altogether
– strongly reduced. Addressing Point (1) requires changes in the incentive
system of academia (see, e.g., Lakens & Evers, 2014, for a discussion), and in the way journals
decide whether or not to publish a given submission. While this is possibly the
most important issue to solve, an individual researcher cannot take an easy set
of routine steps to achieve this desired end state. Point (2) requires
researchers to carefully think about the expected or minimal practically
meaningful effect size and to determine the number of participants, a priori, to be able to make meaningful
conclusions about the presence or absence of such an effect. While this is not
too tricky conceptually, it often requires sample sizes that appear unrealistic
to one who is used to test 16 participants per condition. In some instances –
for example, when working with a clinical population – collecting a large
number of participants can indeed be unrealistic. Thus, addressing this issue
also requires some deeper changes in the research structure, such as
encouraging collaborations between many labs to share the work associated with
collecting sufficient data, and making available the raw data of all smaller
studies (regardless of their outcome), to allow for meta-analyses across these
smaller studies.
In this blog post, I focus on Point (3). The prevalence of
questionable research practices in existing studies is easiest to address, in
the sense that each individual researcher could contribute to resolving this
issue. In practical terms, this would require spreading awareness of how not to
p-hack. The term (and concept of) p-hacking is something I became aware of
relatively late during my PhD. At the department, we had a workshop by Zoltan Dienes,
who talked, among other things, about optional stopping. For the first time, I
heard that it was not good practice to collect some data, look at the results,
and then to decide whether or not to collect more data. There were many
surprised reactions to this bit of information, including by senior researchers,
showing that researchers in general are not necessarily aware of what
constitutes questionable research practices.
In the majority of cases, questionable research practices,
such as optional stopping in a frequentist framework, are not a result of
malevolence or dishonesty, but rather stem from a lack of awareness. In
addition to this workshop, I have encountered many other examples of senior
researchers who did not seem to be aware that some common research practices
are considered to be questionable. As some additional examples, here is some
advice that I got from various researchers throughout my studies:
At a student pre-conference
workshop, a senior professor said: “Always check your data after you’ve tested,
say, five participants. If there is not even a hint of an effect, there’s no
point wasting your time by collecting more data.”
In response to some data which
did not show a hypothesised significant correlation between two variables,
another senior scientist advised to remove some “obvious outliers” and look at
a subgroup of participants, because “clearly, there’s something there!”
Another senior researcher told me:
“For any effect that you want to test, repeat the analysis with at least five
different trimming methods and dependent variable transformations, and only believe in the
effect if the results of all analysis methods converge.”
All of these researchers genuinely wanted to give good and
helpful advice to a student. However, all of these practices could be
considered questionable. Again, this demonstrates the lack of awareness of what
passes as good research practices and what doesn’t. Addressing the issue of
questionable research practices, therefore, would require, first and foremost, spreading
awareness of what constitutes p-hacking.
To this end, I attempt to make a list of common practices that are not
acceptable in a frequentist framework:
(1) Optional
stopping: Collect some data, and then decide whether or not more data should be
collected based on the obtained results. Note that things change if we plan the
multiple analyses in advance and adjust the alpha-level accordingly, as
described, e.g., in this
blogpost by Daniƫl Lakens.
(2) Creative
trimming: If the expected effect is not significant, have a closer look at the
individual subject data (and/or item-level data, in a psycholinguistic
experiment), and exclude any subjects (and/or items) that could be considered
outliers, until the required effect becomes significant.
(3) Adding
covariates: If the expected effect is not significant, perhaps there are some
confounding variables which we can include as covariates in various
constellations until the effect becomes significant?
(4) When
an independent variable has more than two levels and the overall main effect is
not significant, do follow-up tests anyway, to see if the effect emerges at any
level.
(5) Looking
at subgroups: If the expected effect is not significant, maybe it shows up only
in women? Or in participants aged 40 and over? Or those with particularly fast
reaction times?
(6) Cramming
as many independent variables as possible into a multi-way ANOVA: The more
contrasts you have, the greater the chance that at least one of them will be
statistically significant.
All of these issues are related to the problem of multiple
testing. In the frequentist framework, multiple testing is a problem, because
the p-value does not do what many
people seem to think it does: it does not tell us anything about the likelihood
of an effect. Instead, it tells us the likelihood of obtaining the data, if we
assume that there is no effect. The usefulness of a p-value, therefore, lies not in telling you whether or not there is
an effect, but as a tool to ensure that, in the long run, your Type-I error
rate (i.e., probability of erroneously concluding that there is an effect) is no greater than the
alpha-level (i.e., 5% for a cut-off of p
< 0.05). If one calculates multiple p-values
from the same data, the probability that at least one of these is smaller than
0.05 increases. This compromises the desirable property of the p-value, that in the long run, 5% of all
studies will be false positives under the null hypothesis: in other words, the
probability of a false positive under the null increases substantially with
multiple comparisons. This is described in detail in multiple publications (e.g., Cramer et al., 2015; Dienes, 2011; Forstmeier & Schielzeth,
2011)
and blog posts (e.g., by Hilda
Bastian and Dorothy
Bishop).
What can we do?
If we want to avoid questionable research practices (as we
should), it is important to consider alternatives that would allow us to do
sound and honest analyses. There are three possibilities, which have been
discussed in the previous literature:
(1) Alpha-level
corrections (e.g,. see Cramer et al., 2015).
(2) Pre-registration
(e.g., Cramer et al., 2015; Wagenmakers et al., 2012): Before collecting data, a
pre-registration report can be uploaded, e.g., on the Open Science Framework
(OSF) website (osf.io). The OSF website allows you to
create a non-modifiable record of your report, which can be submitted along
with the paper. Such a pre-registration report should contain details about the
independent and dependent variables, the comparison of interest (including an
explicit hypothesis), the number of participants that will be tested, and the
trimming procedure. Any post-hoc decisions
(e.g., about including an unexpected confound in the final analyses) can still
be added, assuming there is a sound justification, because it will be clear
both to the authors and readers that this was a post-hoc decision. Pre-registration is possibly the quickest and
most efficient way to do good science – not to mention that deciding on these
analysis issues a priori will save a
lot of time during the analyses!
(3) Changing
to Bayesian analysis methods (e.g., Dienes, 2011; Rouder, 2014; Wagenmakers, 2007):
For reasons explained in these papers, Bayesian analyses are not susceptible to
multiple testing problems. The Bayesian philosophical approach is different from that of the
frequentists: here, we are not concerned with long-term error rates,
but rather about quantifying the degree of support for one hypothesis over
another (which, incidentally, seems to be what some researchers think they are
doing when they are calculating a p-value).
Bayes Factors scare some researchers because they are more difficult to
calculate than p-values, but this is
changing. R users can download the BayesFactor
package, which allows them to do Bayesian model comparisons using the same
models they would use in a frequentist analysis (ANOVAs, LMEs, regressions).
SPSS users can use the newly developed software JASP,
which is user-friendly and similar in layout to SPSS – and it’s open source!
Conclusion
Thanks to some excellent work by various researchers (see
the reference list below), we have a pretty good idea of what needs to be done
to address the replication crisis. While two possible solutions require deeper
changes to the way in which research is conducted and disseminated – namely
combatting the practice of conducting under-powered studies and publication
bias – the prevalence of questionable research practices can (and should) be
addressed by each individual researcher. As shown by my anecdotes above, some
are genuinely unaware about the drawbacks of multiple tests in frequentist
frameworks – although this is based, of course, on my anecdotal evidence, and
is hopefully changing with the recent publicity associated with the replication
crisis. My guess is that once researchers become aware of the practices that
are considered dishonest, and learn that there are viable alternatives, the
majority will start to use honest and sound methods, because it’s a
win-win situation: Overall, science will become more reproducible, and
researchers will have a warm, fuzzy feeling, knowing that they can be more
confident in their exciting new findings.
References
Button, K. S., Ioannidis, J. P. A., Mokrysz, C.,
Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013).
Confidence and precision increase with high statistical power. Nature Reviews Neuroscience, 14(8).
doi:10.1038/nrn3475-c4
Christley, R. (2010). Power and error:
increased risk of false positive results in underpowered studies. Open Epidemiology Journal, 3, 16-19.
Cohen, J. (1962). The statistical power of
abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.
Cramer, A. O., van Ravenzwaaij, D., Matzke,
D., Steingroever, H., Wetzels, R., Grasman, R. P., . . . Wagenmakers, E.-J.
(2015). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and
remedies. Psychonomic Bulletin &
Review, 1-8.
Dienes, Z. (2011). Bayesian versus orthodox
statistics: Which side are you on? Perspectives
on Psychological Science, 6(3), 274-290.
Forstmeier, W., & Schielzeth, H.
(2011). Cryptic multiple hypotheses testing in linear models: overestimated
effect sizes and the winner's curse. Behavioral
Ecology and Sociobiology, 65(1), 47-55.
Ioannidis, J. P. A. (2005). Why most
published research findings are false. Plos
Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
John, L. K., Loewenstein, G., & Prelec,
D. (2012). Measuring the prevalence of questionable research practices with
incentives for truth telling. Psychological
Science, 0956797611430953.
Lakens, D., & Evers, E. R. (2014).
Sailing from the seas of chaos into the corridor of stability practical
recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3),
278-292.
Rosenthal, R. (1979). The "File Drawer
Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Rouder, J. N. (2014). Optional stopping: No
problem for Bayesians. Psychonomic
Bulletin & Review, 21(2), 301-308.
Royall, R. M. (1986). The Effect of
Sample-Size on the Meaning of Significance Tests. American Statistician, 40(4), 313-315. doi:10.2307/2684616
Simonsohn, U., Nelson, L. D., &
Simmons, J. P. (2014). P-curve: a key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534-547.
doi:10.1037/a0033242
Wagenmakers, E.-J. (2007). A practical
solution to the pervasive problems ofp values. Psychonomic Bulletin & Review, 14(5), 779-804.
Wagenmakers, E.-J., Wetzels, R., Borsboom,
D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely
confirmatory research. Perspectives on
Psychological Science, 7(6), 632-638.