Friday, October 30, 2015

How to p-hack

There are two types of psychological researchers: those who are acutely aware of what is widely referred to as the replication crisis, and those who have heard about it. The factum of this crisis is simple: it is becoming increasingly obvious that some proportion of the published literature reports empirical results that cannot be reproduced and that are likely to be false. The implication is clear: if scientists believe in results that are false, they will build theories based on erroneous assumptions, which in turn has detrimental effects on instances when these theories are applied to practical domains (e.g., treatment of psychological disorders, evidence-based teaching methods, hiring decisions by organisational psychologists). Clearly, this is an outcome that needs to be avoided at all costs.
Luckily, it is relatively clear what needs to be done to improve the way science works. Three issues that drive the crisis are generally discussed: (1) Publication bias (e.g., Ioannidis, 2005; Rosenthal, 1979; Simonsohn, Nelson, & Simmons, 2014), (2) underpowered studies (e.g., Button et al., 2013; Christley, 2010; Cohen, 1962; Royall, 1986), and (3) questionable research practices (e.g., John, Loewenstein, & Prelec, 2012; Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). If all three of these issues were addressed, the replication crisis would be – if not eliminated altogether – strongly reduced. Addressing Point (1) requires changes in the incentive system of academia (see, e.g., Lakens & Evers, 2014, for a discussion), and in the way journals decide whether or not to publish a given submission. While this is possibly the most important issue to solve, an individual researcher cannot take an easy set of routine steps to achieve this desired end state. Point (2) requires researchers to carefully think about the expected or minimal practically meaningful effect size and to determine the number of participants, a priori, to be able to make meaningful conclusions about the presence or absence of such an effect. While this is not too tricky conceptually, it often requires sample sizes that appear unrealistic to one who is used to test 16 participants per condition. In some instances – for example, when working with a clinical population – collecting a large number of participants can indeed be unrealistic. Thus, addressing this issue also requires some deeper changes in the research structure, such as encouraging collaborations between many labs to share the work associated with collecting sufficient data, and making available the raw data of all smaller studies (regardless of their outcome), to allow for meta-analyses across these smaller studies.
In this blog post, I focus on Point (3). The prevalence of questionable research practices in existing studies is easiest to address, in the sense that each individual researcher could contribute to resolving this issue. In practical terms, this would require spreading awareness of how not to p-hack. The term (and concept of) p-hacking is something I became aware of relatively late during my PhD. At the department, we had a workshop by Zoltan Dienes, who talked, among other things, about optional stopping. For the first time, I heard that it was not good practice to collect some data, look at the results, and then to decide whether or not to collect more data. There were many surprised reactions to this bit of information, including by senior researchers, showing that researchers in general are not necessarily aware of what constitutes questionable research practices.
In the majority of cases, questionable research practices, such as optional stopping in a frequentist framework, are not a result of malevolence or dishonesty, but rather stem from a lack of awareness. In addition to this workshop, I have encountered many other examples of senior researchers who did not seem to be aware that some common research practices are considered to be questionable. As some additional examples, here is some advice that I got from various researchers throughout my studies:

At a student pre-conference workshop, a senior professor said: “Always check your data after you’ve tested, say, five participants. If there is not even a hint of an effect, there’s no point wasting your time by collecting more data.”

In response to some data which did not show a hypothesised significant correlation between two variables, another senior scientist advised to remove some “obvious outliers” and look at a subgroup of participants, because “clearly, there’s something there!”

Another senior researcher told me: “For any effect that you want to test, repeat the analysis with at least five different trimming methods and dependent variable transformations, and only believe in the effect if the results of all analysis methods converge.”

All of these researchers genuinely wanted to give good and helpful advice to a student. However, all of these practices could be considered questionable. Again, this demonstrates the lack of awareness of what passes as good research practices and what doesn’t. Addressing the issue of questionable research practices, therefore, would require, first and foremost, spreading awareness of what constitutes -hacking. To this end, I attempt to make a list of common practices that are not acceptable in a frequentist framework:

(1)  Optional stopping: Collect some data, and then decide whether or not more data should be collected based on the obtained results. Note that things change if we plan the multiple analyses in advance and adjust the alpha-level accordingly, as described, e.g., in this blogpost by DaniĆ«l Lakens.

(2)  Creative trimming: If the expected effect is not significant, have a closer look at the individual subject data (and/or item-level data, in a psycholinguistic experiment), and exclude any subjects (and/or items) that could be considered outliers, until the required effect becomes significant.

(3)  Adding covariates: If the expected effect is not significant, perhaps there are some confounding variables which we can include as covariates in various constellations until the effect becomes significant?

(4)  When an independent variable has more than two levels and the overall main effect is not significant, do follow-up tests anyway, to see if the effect emerges at any level.

(5)  Looking at subgroups: If the expected effect is not significant, maybe it shows up only in women? Or in participants aged 40 and over? Or those with particularly fast reaction times?

(6)  Cramming as many independent variables as possible into a multi-way ANOVA: The more contrasts you have, the greater the chance that at least one of them will be statistically significant.

All of these issues are related to the problem of multiple testing. In the frequentist framework, multiple testing is a problem, because the p-value does not do what many people seem to think it does: it does not tell us anything about the likelihood of an effect. Instead, it tells us the likelihood of obtaining the data, if we assume that there is no effect. The usefulness of a p-value, therefore, lies not in telling you whether or not there is an effect, but as a tool to ensure that, in the long run, your Type-I error rate (i.e., probability of erroneously concluding that there is an effect) is no greater than the alpha-level (i.e., 5% for a cut-off of p < 0.05). If one calculates multiple p-values from the same data, the probability that at least one of these is smaller than 0.05 increases. This compromises the desirable property of the p-value, that in the long run, 5% of all studies will be false positives under the null hypothesis: in other words, the probability of a false positive under the null increases substantially with multiple comparisons. This is described in detail in multiple publications (e.g., Cramer et al., 2015; Dienes, 2011; Forstmeier & Schielzeth, 2011) and blog posts (e.g., by Hilda Bastian and Dorothy Bishop).

What can we do?
If we want to avoid questionable research practices (as we should), it is important to consider alternatives that would allow us to do sound and honest analyses. There are three possibilities, which have been discussed in the previous literature:

(1)  Alpha-level corrections (e.g,. see Cramer et al., 2015).

(2)  Pre-registration (e.g., Cramer et al., 2015; Wagenmakers et al., 2012): Before collecting data, a pre-registration report can be uploaded, e.g., on the Open Science Framework (OSF) website (osf.io). The OSF website allows you to create a non-modifiable record of your report, which can be submitted along with the paper. Such a pre-registration report should contain details about the independent and dependent variables, the comparison of interest (including an explicit hypothesis), the number of participants that will be tested, and the trimming procedure. Any post-hoc decisions (e.g., about including an unexpected confound in the final analyses) can still be added, assuming there is a sound justification, because it will be clear both to the authors and readers that this was a post-hoc decision. Pre-registration is possibly the quickest and most efficient way to do good science – not to mention that deciding on these analysis issues a priori will save a lot of time during the analyses!

(3)  Changing to Bayesian analysis methods (e.g., Dienes, 2011; Rouder, 2014; Wagenmakers, 2007): For reasons explained in these papers, Bayesian analyses are not susceptible to multiple testing problems. The Bayesian philosophical approach is different from that of the frequentists: here, we are not concerned with long-term error rates, but rather about quantifying the degree of support for one hypothesis over another (which, incidentally, seems to be what some researchers think they are doing when they are calculating a p-value). Bayes Factors scare some researchers because they are more difficult to calculate than p-values, but this is changing. R users can download the BayesFactor package, which allows them to do Bayesian model comparisons using the same models they would use in a frequentist analysis (ANOVAs, LMEs, regressions). SPSS users can use the newly developed software JASP, which is user-friendly and similar in layout to SPSS – and it’s open source!

Conclusion
Thanks to some excellent work by various researchers (see the reference list below), we have a pretty good idea of what needs to be done to address the replication crisis. While two possible solutions require deeper changes to the way in which research is conducted and disseminated – namely combatting the practice of conducting under-powered studies and publication bias – the prevalence of questionable research practices can (and should) be addressed by each individual researcher. As shown by my anecdotes above, some are genuinely unaware about the drawbacks of multiple tests in frequentist frameworks – although this is based, of course, on my anecdotal evidence, and is hopefully changing with the recent publicity associated with the replication crisis. My guess is that once researchers become aware of the practices that are considered dishonest, and learn that there are viable alternatives, the majority will start to use honest and sound methods, because it’s a win-win situation: Overall, science will become more reproducible, and researchers will have a warm, fuzzy feeling, knowing that they can be more confident in their exciting new findings.

References

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013). Confidence and precision increase with high statistical power. Nature Reviews Neuroscience, 14(8). doi:10.1038/nrn3475-c4
Christley, R. (2010). Power and error: increased risk of false positive results in underpowered studies. Open Epidemiology Journal, 3, 16-19.
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., . . . Wagenmakers, E.-J. (2015). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 1-8.
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6(3), 274-290.
Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behavioral Ecology and Sociobiology, 65(1), 47-55.
Ioannidis, J. P. A. (2005). Why most published research findings are false. Plos Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 0956797611430953.
Lakens, D., & Evers, E. R. (2014). Sailing from the seas of chaos into the corridor of stability practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278-292.
Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301-308.
Royall, R. M. (1986). The Effect of Sample-Size on the Meaning of Significance Tests. American Statistician, 40(4), 313-315. doi:10.2307/2684616
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: a key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534-547. doi:10.1037/a0033242
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems ofp values. Psychonomic Bulletin & Review, 14(5), 779-804.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An agenda for purely confirmatory research. Perspectives on Psychological Science, 7(6), 632-638.