According to a guy on the
internet whom I unfortunately don’t know and can’t credit, “The amount of
energy necessary to refute bullshit is an order of magnitude bigger than to
produce it”. Anyone who has tried to argue for a lack of an effect can probably
attest to this. Convincing an audience that there is an effect can be done
relatively easily by showing a p-value
smaller than 0.05. However, if the p-value
is larger than 0.05, a much stronger case is needed: it could be argued that
the sample was too small to provide sufficient power to detect an effect, that
the manipulation was not strong enough, that the effect occurs, but only under specific
circumstances, that the experimenter probably fudged the analyses or is
incompetent – the list goes on and on.
Especially after the recent Open
Science Collaboration paper (2015),
which has shown that only around 1/3 of all studies in psychological science are replicable, I don’t
think any competent psychological researcher would argue that null results are
not important. If only positive findings are published while null-results are
hidden in a file drawer, the inevitable consequence is that a considerable
proportion of published findings are false positives (Ioannidis, 2005; Rosenthal, 1979).
In such a system, false positives are never corrected, because this would
require someone to fail to replicate and subsequently to manage to convince
the reviewers and a journal editor that this failure is not due to a lack of
competence.
The issue still stands, though,
that providing convincing evidence for a null result is hard. The argument that
a null-result is probably due to a moderating factor or that the researcher is
not competent enough to find an effect often comes off as a desperate straw-clutching, but other issues require careful consideration. In
particular, it is worth considering why providing evidence for the absence of
an effect is harder than providing evidence for an effect. Here, I propose some
reasons why this may be the case. I would welcome any thoughts or discussions on
this – needless to say, this issue is very important in all areas of
psychology, so we need to understand all obstacles that a replicator needs to
take into account.
Let’s say, I conduct an exact
replication of a published experiment, which had found a significant effect. I
fail to find such an effect. If I followed the exact procedures of the original
authors, the argument that a moderating factor is responsible for the results
is off the table. The only explanation that remains is that both studies represent
random variance around a true population mean. Either that, or I blotched the
experiment (because I’m a malevolent replicator and/or a stupid junior
researcher) – but note that it is equally likely that the original authors
blotched their experiment. If we assume that the different outcomes of the two
studies reflect random noise around a true population mean, that still doesn’t
tell us anything about whether this population mean is different from zero. For
this reason, among others, simply failing to find a p-value on the same side of significance as the original study does
not constitute a convincing argument for the null hypothesis.
We could crank it up by doing a
Bayes Factor analysis: unlike p-values,
these quantify the degree to which the data is compatible with the
null-hypothesis over an alternative hypothesis (Dienes, 2014; Rouder, Speckman, Sun, Morey, & Iverson, 2009).
Thus, it can tell you, based on both data sets, whether your null hypothesis is
likely to be true, or whether there might be a case for the alternative
hypothesis. If the data is equally compatible with both models – which is
likely, if the sample size is small, as is typical in psychological studies –
the Bayes Factor will give you an ‘equivocal’ value around 1 (conventionally
ranging from 1/3 to 3). Assuming the original authors send me their raw data, I
can do a Bayes Factor analysis on both data sets. Most likely, I would get
equivocal evidence in one or both of the studies, which would indicate that
larger sample sizes are needed to detect whether the effect exists or not.
But let’s say I get evidence for
an effect in the item set of the original authors, and evidence against it in
my own item set. Here, it could be argued that an effect could be real, but it
has been over-estimated in the original study – which is plausible, given the
principle of the “winner’s curse” (Button et al., 2013; Gelman & Carlin, 2014).
As the Bayes Factor requires us to pre-specify the size of the expected effect
if H1 is true (the prior), it may tell us that the data is more in
line with a zero-effect than the unrealistic alternative hypothesis of a large
effect. Of course, we could lower the prior post-hoc,
which would strengthen our case if we still get evidence for the null over the
alternative hypothesis. But more likely, we would just get equivocal evidence,
suggesting that we don’t have enough data to distinguish between the new
alternative hypothesis and the null. Even if we do get evidence for the null, a
desperate reviewer may argue that the effect may be even smaller than the one
assumed by the new prior. After all, most theories in cognitive psychology make
directional rather than quantitative predictions (at best): either the effect
is zero, or it is larger (or smaller) than zero. If the effect is even slightly
different from zero, it can therefore have strong theoretical implications.
In an ideal world, this dilemma
would be resolved by testing hundreds or thousands of participants: this would
allow us be confident that the data are more compatible with the null than even
a teeny tiny effect. Alternatively, we need to be satisfied with the conclusion
that if an effect exists, it must be smaller than the effect reported in the
original study. Both would be logically sound ways of arguing for the null if
there is a previous study that has supported the alternative hypothesis.
However, if I do a study which is
not a replication, but tests a prediction of a new theory, I am unlikely to
have a concrete idea about the effect size. In a typical psycholinguistic
experiment, I would test about 25 participants – and if I’m very conscientious,
I would replicate my own experiment with another 25 participants. As I am testing
the prediction of a theory, a null-result is theoretically important, but here
I am very open to the attack that the sample size is not sufficient
to test the hypothesis that an effect exists, but is small. More importantly,
as my experimental procedure has not been used, in this exact format, by any
previous experiments, it could be argued that my manipulation was inadequate.
For example, if I wanted to show that adults were physically stronger than
children, and my dependent measure was the ability to lift a feather, I would
find evidence for the null-hypothesis even if I tested millions of children and
adults and used a tiny prior.
Here, the burden of proof lies
with the experimenter. As much as possible, the experimental design should maximise
the chances – not of finding evidence for the alternative hypothesis – but of
allowing the experimenter to draw theoretically informative conclusions,
regardless of the outcome. This is different from what appears to be the
conventional mind set, but it should considerably increase a researcher’s
productivity. In my (admittedly not-so-extensive) experience, it happens quite
frequently that researchers design a thoughtful experiment based on a very neat
theoretical prediction, only to end up throwing the data out, because the
expected effect was not significant, and arguing for the null was impossible,
given the design.
In summary, making a convincing
argument for a null hypothesis is not easy. But here is the million-dollar
question: is this different when your conclusion is that the null is true,
compared to when your conclusion is that the null is false? Underpowered
studies can lead to erroneous conclusions regardless of whether the H1
appears to be supported or not (Royall, 1986).
Both outcomes may be driven by blotched experimental procedures or incompetent
data analysis procedures. (Arguably, purposely dodgy data analyses are more
common when the authors try to argue for H1 than when they argue for
H0.) Random noise is more likely to suppress an existing effect than
to lead to a systematic pattern in the absence of an effect – but either way,
this would be much less of a problem if we move towards larger sample
sizes and more replications (as we should). While a null-result can often be
argued to be the result of a weak manipulation, a positive result can often be
argued to be the result of a covariate. When an alternative hypothesis is
supported, researchers tend to conclude that it is “in line” with their theory,
but a creative person could almost always propose a dozen other theories that
would explain the same effect.
So, is there anything at all that
makes it more difficult to refute bullshit than to produce it?
References
Button, K. S., Ioannidis, J. P. A., Mokrysz, C.,
Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013).
Confidence and precision increase with high statistical power. Nature Reviews Neuroscience, 14(8).
doi:10.1038/nrn3475-c4
Collaboration, O. S. (2015). Estimating the
reproducibility of psychological science. Science,
349(6251), aac4716.
Dienes, Z. (2014). Using Bayes to get the
most out of non-significant results. Frontiers
in Psychology, 5.
Gelman, A., & Carlin, J. (2014). Beyond
Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6),
641-651.
Ioannidis, J. P. A. (2005). Why most
published research findings are false. PLOS Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
Rosenthal, R. (1979). The "File Drawer
Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Rouder, J. N., Speckman, P. L., Sun, D. C.,
Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and
rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225-237. doi:10.3758/Pbr.16.2.225
Royall, R. M. (1986). The Effect of
Sample-Size on the Meaning of Significance Tests. American Statistician, 40(4), 313-315. doi:10.2307/2684616