Thursday, July 30, 2015

What’s wrong with p-values, and (why) are Bayes Factors better?

First off, let me say that I have no strong feelings about p-values, and some positive feelings towards Bayes Factors. There are personal reasons for this: During my PhD, p-values set me off on a wild goose chase (see my previous blog post). A few months before my thesis submission, Bayes Factors saved the day by informing me that there was mostly no evidence for the effect I was chasing in the first place. This would justify strong negative feelings towards p-values on my part, but then again, in situations of personal conflict is it always worth taking a step back to consider what went wrong, whether the blame lies at least in part with myself, and what I can learn from the experience for personal growth. This blog post is my attempt to make peace with p-values: my conclusion, so far, is that perhaps they are not to blame for everything, and that there is a lot to learn from my mistakes.
P-values seem to elicit a lot of strong feelings with a lot of people. They have received a great deal of negative publicity lately, and they have been blamed for the replicability crisis in psychology (e.g., Cumming, 2014; Halsey, Curran-Everett, Vowler, & Drummond, 2015). The replicability crisis refers to the outcry that most research findings published in scientific journals are false. Basically, what seems to be happening is that journals prefer to publish papers which report sensational, unexpected effects, that are likely to receive a lot of attention from the general public. In contrast, journals prefer to not publish studies that replicate an experiment, especially if these report a null-result. The consequences of such an incentive system are clear: the literature gets filled with sensational papers that report type-I errors (i.e., the presence of an effect, when in reality the null hypothesis is true).
Given this explanation, the problem leading to the replication crisis is not with p-values per se, but rather with the way they are used and interpreted. As the main argument of the current blog post, I propose that it is a grave mistake to draw conclusions based on a single p-value (i.e., a single published study). P-values are a frequentist statistic, where the outcome of a single experiment can only be considered in relation to a hypothetical infinite chain of events (or, in practice, a large number of events). Taking the classical coin example, that most psychology majors will remember from their introductory statistics course: if you want to determine whether a coin is fair or not, you toss it 10 000 times to check whether around 50% of the outcomes are heads. Now imagine that you want to determine whether the coin is fair or not, but you only toss it once. What can you conclude about the fairness of the coin? Nothing. You toss it twice, and get a head and a tail. Still, you cannot conclude anything. If you toss it three times and get heads each time, you might raise an eyebrow.
The same holds true for an experiment: If the null hypothesis of a given effect is true, and we run hundreds of experiments to test for the presence of this effect, we would expect, by definition, that around 5% of all p-values will be significant at the 5% level (p < 0.05). If we run a single experiment, and get a p-value (i.e., probability of obtaining the data if the null hypothesis were true) of 0.03, we can conclude nothing. If we replicate this experiment and get a p-value of 0.3, we still cannot conclude anything. However, if we run a series of experiments and get low p-values in a majority of them, we may raise an eyebrow and consider the possibility that the effect is actually real. Thus, p-values can be interpreted within a chain of events, but it is theoretically impossible to draw conclusions from a single observation.
Some related caveats about p-values are explained very nicely by Schmidt (1992). In set of experiments, due to sampling error, the observed effect size will vary as a normal distribution around the true population effect size. If a population effect is real (δ > 0), it is nevertheless possible to obtain an observed effect size of exactly zero (d = 0). Conversely, if the null hypothesis is true, it is possible to obtain a significant p-value and a reasonably-sized observed effect. Therefore, obtaining one significant (p­ < 0.05) and one non-significant (p > 0.05) result in two experiments tells us very little about whether the effect may be real or not. Even more detrimental is the conclusion that the two p-values are intrinsically different and must come from different populations. Such a result is often interpreted to mean that there must be some unknown moderator determining whether the effect is present or not, when it is likely that both observations simply reflect sampling noise around a single true population parameter value (which may or may not be zero).
So much for p-values (and other frequentist statistics, such as confidence intervals or power calculations): they are only meaningful when numerous studies are available, and if we can be sure that the studies that are available are truly random repeated samples. Given the current publication system, as described above, neither of the two conditions is met: (1) Regardless of whether replications yield significant p-values or not, their publication is discouraged. Therefore, for some effects, there is only one p-value available in the literature. (2) Replications are especially difficult to publish if they yield a non-significant result. This means that positive-result replications are over-represented in the literature. In the worst-case scenario, a “consistent effect” in the literature simply represents the 5% false positives (Rosenthal, 1979). In the defence of p-values, then: their distribution can tell us a lot. It can give us information about the presence or absence of an effect, and even about  questionable research practices for a given research area (Simonsohn, Nelson, & Simmons, 2014a; 2014b). However, this becomes difficult or impossible in a system where a large proportion of experiment outcomes are never reported because they do not meet an arbitrary criterion, and where conclusions are generally drawn from a single p-value.
But are Bayes Factors better, and if so, how? Unlike frequentist theorists, Bayesians do not start off with the assumption of a hypothetical infinite chain of events. Instead, one starts off with a prior belief about what kind of effect one might expect. If research on this question is already available, one can use that to guide the expected effect size; otherwise, an educated guess will do, with an adjustment in the prior distribution to reflect a high degree of uncertainty. After the data is collected, the prior belief is combined with the data to yield a posterior distribution. Thus, a Bayesian analysis allows one to update one’s a priori beliefs in light of incoming data. More comprehensive explanations of the Bayesian framework are available in Dienes (2011) and van de Schoot et al. (2014).
Back to the coin example: if a Bayesian wants to check whether a coin is fair, she starts off with an a priori belief. If there is no reason to believe that the coin is rigged, the belief would be that in a coin toss, heads and tails are equally likely outcomes. Say the first two coin tosses provide two heads. While this is not overly strong evidence against this prior belief, she might be somewhat inclined to consider the possibility that the coin biased towards heads. If the third coin toss provides tails, the degree of belief shifts back towards the “unbiased”-hypothesis. Thus, each consecutive coin toss allows the Bayesian to update her belief about the coin. The more data is available, the more confident the Bayesian will be that her belief is correct.
If a large amount of unbiased data is available, a frequentist and a Bayesian studying the same question would probably always converge in their conclusions. This is a big “if”, though, that does not seem to be met in the available published literature. As such, a Bayesian approach might be more suited to drawing conclusions based on a small amount of available data: even though it would be optimal to have as much data as possible within any theoretical framework, a Bayesian, but not a frequentist approach allows for conclusions based on few studies. It is important to bear in mind that, in a Bayesian framework, there is a large degree of uncertainty in one’s beliefs if the data on which it is based is sparse. Furthermore, Bayesian analyses are not immune to publication bias: if one considers only papers that report type-I errors, one is less likely to arrive at the conclusion that the null-hypothesis is true, even if this were evident from a set of experiments using truly random samples.
In conclusion, it seems that the main problem with p-values is not intrinsic to p-values per se, but rather in the data that is available in the literature, and the common practice of drawing conclusions based on a single study. The former may be a consequence of the latter: if researchers believe that the p-value of a single study can provide a convincing answer to the question "Is there an effect?", there seems indeed very little use in publishing replications. However, we know that this is not true. Furthermore, regardless of the theoretical framework that one adopts for statistical inference, it is always necessary to have multiple studies in order to be confident about the conclusions that we draw.

Cumming, G. (2014). The New Statistics: Why and How. Psychological Science, 25(1), 7-29. doi: 10.1177/0956797613504966
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6(3), 274-290.
Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle P value generates irreproducible results. Nature Methods, 12(3), 179-185. doi: 10.1038/nmeth.3288
Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). p-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science, 9(6), 666-681. doi: 10.1177/1745691614553988
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). P-curve: a key to the file-drawer. J Exp Psychol Gen, 143(2), 534-547. doi: 10.1037/a0033242
van de Schoot, R., Denisen, J., Neyer, F., Kaplan, D., Asendorpf, J., & van Aken, M. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85(3), 842-860. doi: 10.1111/cdev.12169

** Disclaimer: I am far from an expert on either frequentist or Bayesian statistics. I would welcome any corrections or discussions! **


  1. Thanks for posting this. I do have one comment, and it may be that I've misread you. The idea that a single P value is a single datum and can never tell you anything seems odd to me - it seems to ignore the sample size on which that P-value was based. The "problem" with the single P=0.03 in the example isn't that it's a single P-value; it's that its value is in the region where we need to think carefully about the rate and consequences of false positives. If I flipped a coin 100,000 times and got a P-value of 0.0000001 for a test that it was fair, would you keep betting on it at 1:1 because there was only a single datum and we couldn't conclude anything about its fairness? I'll be waiting to take your bet :-)

    P values come under a lot of unfair attack, as you acknowledge in your post - it's mostly not about P values, but about the way they are used and (behind that) some surprisingly frequent misunderstandings about what they do. More about this here:

  2. This comment has been removed by the author.

  3. Dear Stephen,
    Thanks for your comment, and for the interesting link!
    I was trying to be provocative with that statement. :) I would be reluctant to bet on a coin being fair after 100000 tosses and with a tiny p-value. Having a large n-size does tell us something, but I think it's fair to say that this is not how p-values are actually used by many researchers in psychology. Often, conclusions about the presence or absence of an effect are drawn only from the p-value of a single underpowered study, which is kind of like drawing conclusions about the fairness of a coin after tossing it once.