Wednesday, September 16, 2015

On refuting bullshit

According to a guy on the internet whom I unfortunately don’t know and can’t credit, “The amount of energy necessary to refute bullshit is an order of magnitude bigger than to produce it”. Anyone who has tried to argue for a lack of an effect can probably attest to this. Convincing an audience that there is an effect can be done relatively easily by showing a p-value smaller than 0.05. However, if the p-value is larger than 0.05, a much stronger case is needed: it could be argued that the sample was too small to provide sufficient power to detect an effect, that the manipulation was not strong enough, that the effect occurs, but only under specific circumstances, that the experimenter probably fudged the analyses or is incompetent – the list goes on and on.

Especially after the recent Open Science Collaboration paper (2015), which has shown that only around 1/3 of all studies in psychological science are replicable, I don’t think any competent psychological researcher would argue that null results are not important. If only positive findings are published while null-results are hidden in a file drawer, the inevitable consequence is that a considerable proportion of published findings are false positives (Ioannidis, 2005; Rosenthal, 1979). In such a system, false positives are never corrected, because this would require someone to fail to replicate and subsequently to manage to convince the reviewers and a journal editor that this failure is not due to a lack of competence.

The issue still stands, though, that providing convincing evidence for a null result is hard. The argument that a null-result is probably due to a moderating factor or that the researcher is not competent enough to find an effect often comes off as a desperate straw-clutching, but other issues require careful consideration. In particular, it is worth considering why providing evidence for the absence of an effect is harder than providing evidence for an effect. Here, I propose some reasons why this may be the case. I would welcome any thoughts or discussions on this – needless to say, this issue is very important in all areas of psychology, so we need to understand all obstacles that a replicator needs to take into account.

Let’s say, I conduct an exact replication of a published experiment, which had found a significant effect. I fail to find such an effect. If I followed the exact procedures of the original authors, the argument that a moderating factor is responsible for the results is off the table. The only explanation that remains is that both studies represent random variance around a true population mean. Either that, or I blotched the experiment (because I’m a malevolent replicator and/or a stupid junior researcher) – but note that it is equally likely that the original authors blotched their experiment. If we assume that the different outcomes of the two studies reflect random noise around a true population mean, that still doesn’t tell us anything about whether this population mean is different from zero. For this reason, among others, simply failing to find a p-value on the same side of significance as the original study does not constitute a convincing argument for the null hypothesis.

We could crank it up by doing a Bayes Factor analysis: unlike p-values, these quantify the degree to which the data is compatible with the null-hypothesis over an alternative hypothesis (Dienes, 2014; Rouder, Speckman, Sun, Morey, & Iverson, 2009). Thus, it can tell you, based on both data sets, whether your null hypothesis is likely to be true, or whether there might be a case for the alternative hypothesis. If the data is equally compatible with both models – which is likely, if the sample size is small, as is typical in psychological studies – the Bayes Factor will give you an ‘equivocal’ value around 1 (conventionally ranging from 1/3 to 3). Assuming the original authors send me their raw data, I can do a Bayes Factor analysis on both data sets. Most likely, I would get equivocal evidence in one or both of the studies, which would indicate that larger sample sizes are needed to detect whether the effect exists or not.

But let’s say I get evidence for an effect in the item set of the original authors, and evidence against it in my own item set. Here, it could be argued that an effect could be real, but it has been over-estimated in the original study – which is plausible, given the principle of the “winner’s curse” (Button et al., 2013; Gelman & Carlin, 2014). As the Bayes Factor requires us to pre-specify the size of the expected effect if H1 is true (the prior), it may tell us that the data is more in line with a zero-effect than the unrealistic alternative hypothesis of a large effect. Of course, we could lower the prior post-hoc, which would strengthen our case if we still get evidence for the null over the alternative hypothesis. But more likely, we would just get equivocal evidence, suggesting that we don’t have enough data to distinguish between the new alternative hypothesis and the null. Even if we do get evidence for the null, a desperate reviewer may argue that the effect may be even smaller than the one assumed by the new prior. After all, most theories in cognitive psychology make directional rather than quantitative predictions (at best): either the effect is zero, or it is larger (or smaller) than zero. If the effect is even slightly different from zero, it can therefore have strong theoretical implications.

In an ideal world, this dilemma would be resolved by testing hundreds or thousands of participants: this would allow us be confident that the data are more compatible with the null than even a teeny tiny effect. Alternatively, we need to be satisfied with the conclusion that if an effect exists, it must be smaller than the effect reported in the original study. Both would be logically sound ways of arguing for the null if there is a previous study that has supported the alternative hypothesis.

However, if I do a study which is not a replication, but tests a prediction of a new theory, I am unlikely to have a concrete idea about the effect size. In a typical psycholinguistic experiment, I would test about 25 participants – and if I’m very conscientious, I would replicate my own experiment with another 25 participants. As I am testing the prediction of a theory, a null-result is theoretically important, but here I am very open to the attack that the sample size is not sufficient to test the hypothesis that an effect exists, but is small. More importantly, as my experimental procedure has not been used, in this exact format, by any previous experiments, it could be argued that my manipulation was inadequate. For example, if I wanted to show that adults were physically stronger than children, and my dependent measure was the ability to lift a feather, I would find evidence for the null-hypothesis even if I tested millions of children and adults and used a tiny prior.

Here, the burden of proof lies with the experimenter. As much as possible, the experimental design should maximise the chances – not of finding evidence for the alternative hypothesis – but of allowing the experimenter to draw theoretically informative conclusions, regardless of the outcome. This is different from what appears to be the conventional mind set, but it should considerably increase a researcher’s productivity. In my (admittedly not-so-extensive) experience, it happens quite frequently that researchers design a thoughtful experiment based on a very neat theoretical prediction, only to end up throwing the data out, because the expected effect was not significant, and arguing for the null was impossible, given the design.

In summary, making a convincing argument for a null hypothesis is not easy. But here is the million-dollar question: is this different when your conclusion is that the null is true, compared to when your conclusion is that the null is false? Underpowered studies can lead to erroneous conclusions regardless of whether the H1 appears to be supported or not (Royall, 1986). Both outcomes may be driven by blotched experimental procedures or incompetent data analysis procedures. (Arguably, purposely dodgy data analyses are more common when the authors try to argue for H1 than when they argue for H0.) Random noise is more likely to suppress an existing effect than to lead to a systematic pattern in the absence of an effect – but either way, this would be much less of a problem if we move towards larger sample sizes and more replications (as we should). While a null-result can often be argued to be the result of a weak manipulation, a positive result can often be argued to be the result of a covariate. When an alternative hypothesis is supported, researchers tend to conclude that it is “in line” with their theory, but a creative person could almost always propose a dozen other theories that would explain the same effect.

So, is there anything at all that makes it more difficult to refute bullshit than to produce it?

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013). Confidence and precision increase with high statistical power. Nature Reviews Neuroscience, 14(8). doi:10.1038/nrn3475-c4
Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5.
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641-651.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Rouder, J. N., Speckman, P. L., Sun, D. C., Morey, R. D., & Iverson, G. (2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2), 225-237. doi:10.3758/Pbr.16.2.225
Royall, R. M. (1986). The Effect of Sample-Size on the Meaning of Significance Tests. American Statistician, 40(4), 313-315. doi:10.2307/2684616