First off, let me say that I have no strong feelings about p-values, and some positive feelings
towards Bayes Factors. There are personal reasons for this: During my PhD, p-values set me off on a wild goose chase
(see my previous blog post). A few months before my thesis submission, Bayes
Factors saved the day by informing me that there was mostly no evidence for the
effect I was chasing in the first place. This would justify strong negative feelings
towards p-values on my part, but then
again, in situations of personal conflict is it always worth taking a step back
to consider what went wrong, whether the blame lies at least in part with
myself, and what I can learn from the experience for personal growth. This blog
post is my attempt to make peace with p-values:
my conclusion, so far, is that perhaps they are not to blame for everything,
and that there is a lot to learn from my mistakes.
P-values seem to
elicit a lot of strong feelings with a lot of people. They have received a
great deal of negative publicity lately, and they have been blamed for the
replicability crisis in psychology (e.g., Cumming, 2014; Halsey, Curran-Everett, Vowler, & Drummond,
2015).
The replicability crisis refers to the outcry that most research findings
published in scientific journals are false. Basically, what seems to be
happening is that journals prefer to publish papers which report sensational,
unexpected effects, that are likely to receive a lot of attention from the
general public. In contrast, journals prefer to not publish studies that
replicate an experiment, especially if these report a null-result. The consequences
of such an incentive system are clear: the literature gets filled with
sensational papers that report type-I errors (i.e., the presence of an effect,
when in reality the null hypothesis is true).
Given this explanation, the problem leading to the
replication crisis is not with p-values
per se, but rather with the way they
are used and interpreted. As the main argument of the current blog post, I propose that it is a grave mistake to draw conclusions
based on a single p-value (i.e., a
single published study). P-values are
a frequentist statistic, where the outcome of a single experiment can only be
considered in relation to a hypothetical infinite chain of events (or, in
practice, a large number of events). Taking the classical coin example, that
most psychology majors will remember from their introductory statistics course:
if you want to determine whether a coin is fair or not, you toss it 10 000
times to check whether around 50% of the outcomes are heads. Now imagine that
you want to determine whether the coin is fair or not, but you only toss it
once. What can you conclude about the fairness of the coin? Nothing. You toss
it twice, and get a head and a tail. Still, you cannot conclude anything. If
you toss it three times and get heads each time, you might raise an eyebrow.
The same holds true for an experiment: If the null
hypothesis of a given effect is true, and we run hundreds of experiments to
test for the presence of this effect, we would expect, by definition, that
around 5% of all p-values will be
significant at the 5% level (p < 0.05). If we run a single experiment, and get a p-value (i.e., probability of obtaining the
data if the null hypothesis were true) of 0.03, we can conclude nothing. If we
replicate this experiment and get a p-value
of 0.3, we still cannot conclude anything. However, if we run a series of
experiments and get low p-values in a
majority of them, we may raise an eyebrow and consider the possibility that the
effect is actually real. Thus, p-values
can be interpreted within a chain of events, but it is theoretically impossible
to draw conclusions from a single observation.
Some related caveats about p-values are explained very nicely by Schmidt (1992).
In set of experiments, due to sampling error, the observed effect size will
vary as a normal distribution around the true population effect size. If a
population effect is real (δ > 0), it is nevertheless possible to obtain an
observed effect size of exactly zero (d = 0). Conversely, if the null
hypothesis is true, it is possible to obtain a significant p-value and a reasonably-sized observed effect. Therefore,
obtaining one significant (p < 0.05)
and one non-significant (p > 0.05)
result in two experiments tells us very little about whether the effect may be
real or not. Even more detrimental is the conclusion that the two p-values are intrinsically different and
must come from different populations. Such a result is often interpreted to
mean that there must be some unknown moderator determining whether the effect
is present or not, when it is likely that both observations simply reflect
sampling noise around a single true population parameter value (which may or
may not be zero).
So much for p-values
(and other frequentist statistics, such as confidence intervals or power
calculations): they are only meaningful when numerous studies are available,
and if we can be sure that the studies that are available are truly random
repeated samples. Given the current publication system, as described above,
neither of the two conditions is met: (1) Regardless of whether replications
yield significant p-values or not,
their publication is discouraged. Therefore, for some effects, there is only
one p-value available in the
literature. (2) Replications are especially difficult to publish if they yield
a non-significant result. This means that positive-result replications are
over-represented in the literature. In the worst-case scenario, a “consistent
effect” in the literature simply represents the 5% false positives (Rosenthal, 1979). In the defence of p-values, then: their distribution can tell us a lot. It can give us information about the presence
or absence of an effect, and even about questionable research practices for a given research area (Simonsohn, Nelson, & Simmons, 2014a; 2014b).
However, this becomes difficult or impossible in a system where a large
proportion of experiment outcomes are never reported because they do not meet
an arbitrary criterion, and where conclusions are generally drawn from a single
p-value.
But are Bayes Factors better, and if so, how? Unlike
frequentist theorists, Bayesians do not start off with the assumption of a
hypothetical infinite chain of events. Instead, one starts off with a prior
belief about what kind of effect one might expect. If research on this question
is already available, one can use that to guide the expected effect size;
otherwise, an educated guess will do, with an adjustment in the prior
distribution to reflect a high degree of uncertainty. After the data is
collected, the prior belief is combined with the data to yield a posterior
distribution. Thus, a Bayesian analysis allows one to update one’s a priori beliefs in light of incoming
data. More comprehensive explanations of the Bayesian framework are available
in Dienes (2011)
and van de Schoot et al. (2014).
Back to the coin example: if a Bayesian wants to check
whether a coin is fair, she starts off with an a priori belief. If there is no reason to believe that the coin is
rigged, the belief would be that in a coin toss, heads and tails are equally
likely outcomes. Say the first two coin tosses provide two heads. While this is
not overly strong evidence against this prior belief, she might be somewhat
inclined to consider the possibility that the coin biased towards heads. If the
third coin toss provides tails, the degree of belief shifts back towards the “unbiased”-hypothesis.
Thus, each consecutive coin toss allows the Bayesian to update her belief about
the coin. The more data is available, the more confident the Bayesian will be
that her belief is correct.
If a large amount of unbiased data is available, a
frequentist and a Bayesian studying the same question would probably always
converge in their conclusions. This is a big “if”, though, that does not seem
to be met in the available published literature. As such, a Bayesian approach
might be more suited to drawing conclusions based on a small amount of available data: even
though it would be optimal to have as much data as possible within any
theoretical framework, a Bayesian, but not a frequentist approach allows for conclusions based on
few studies. It is important to bear in mind that, in a Bayesian
framework, there is a large degree of uncertainty in one’s beliefs if the data
on which it is based is sparse. Furthermore, Bayesian analyses are not immune to publication bias: if one considers only papers that report type-I errors, one is less likely to arrive at the conclusion that the null-hypothesis is true, even if this were evident from a set of experiments using truly random samples.
In conclusion, it seems that the main problem with p-values is not intrinsic to p-values per se, but rather in the data that is available in the literature,
and the common practice of drawing conclusions based on a single study. The former may be a consequence of the latter: if researchers believe that the p-value of a single study can provide a convincing answer to the question "Is there an effect?", there seems indeed very little use in publishing replications. However, we know that this is not true. Furthermore, regardless of the theoretical framework that one adopts for statistical inference, it is always necessary to have multiple studies in order to be confident about the conclusions that we draw.
References
Cumming, G.
(2014). The New Statistics: Why and How. Psychological
Science, 25(1), 7-29. doi: 10.1177/0956797613504966
Dienes, Z. (2011). Bayesian versus orthodox
statistics: Which side are you on? Perspectives
on Psychological Science, 6(3), 274-290.
Halsey, L. G., Curran-Everett, D., Vowler, S. L.,
& Drummond, G. B. (2015). The fickle P value generates irreproducible
results. Nature Methods, 12(3),
179-185. doi: 10.1038/nmeth.3288
Rosenthal, R. (1979). The "File Drawer
Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Schmidt, F. L. (1992). What do data really mean?
Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Simonsohn, U., Nelson, L. D., & Simmons, J. P.
(2014a). p-Curve and Effect Size: Correcting for Publication Bias Using Only
Significant Results. Perspectives on
Psychological Science, 9(6), 666-681. doi: 10.1177/1745691614553988
Simonsohn, U., Nelson, L. D., & Simmons, J. P.
(2014b). P-curve: a key to the file-drawer. J
Exp Psychol Gen, 143(2), 534-547. doi: 10.1037/a0033242
van de Schoot, R., Denisen, J., Neyer, F., Kaplan, D.,
Asendorpf, J., & van Aken, M. (2014). A gentle introduction to Bayesian
analysis: Applications to developmental research. Child Development, 85(3), 842-860. doi: 10.1111/cdev.12169
** Disclaimer: I am far from an expert on either frequentist or Bayesian statistics. I would welcome any corrections or discussions! **
** Disclaimer: I am far from an expert on either frequentist or Bayesian statistics. I would welcome any corrections or discussions! **
Thanks for posting this. I do have one comment, and it may be that I've misread you. The idea that a single P value is a single datum and can never tell you anything seems odd to me - it seems to ignore the sample size on which that P-value was based. The "problem" with the single P=0.03 in the example isn't that it's a single P-value; it's that its value is in the region where we need to think carefully about the rate and consequences of false positives. If I flipped a coin 100,000 times and got a P-value of 0.0000001 for a test that it was fair, would you keep betting on it at 1:1 because there was only a single datum and we couldn't conclude anything about its fairness? I'll be waiting to take your bet :-)
ReplyDeleteP values come under a lot of unfair attack, as you acknowledge in your post - it's mostly not about P values, but about the way they are used and (behind that) some surprisingly frequent misunderstandings about what they do. More about this here: http://wp.me/p5x2kS-Y.
This comment has been removed by the author.
ReplyDeleteDear Stephen,
ReplyDeleteThanks for your comment, and for the interesting link!
I was trying to be provocative with that statement. :) I would be reluctant to bet on a coin being fair after 100000 tosses and with a tiny p-value. Having a large n-size does tell us something, but I think it's fair to say that this is not how p-values are actually used by many researchers in psychology. Often, conclusions about the presence or absence of an effect are drawn only from the p-value of a single underpowered study, which is kind of like drawing conclusions about the fairness of a coin after tossing it once.