Thursday, July 30, 2015

What’s wrong with p-values, and (why) are Bayes Factors better?

First off, let me say that I have no strong feelings about p-values, and some positive feelings towards Bayes Factors. There are personal reasons for this: During my PhD, p-values set me off on a wild goose chase (see my previous blog post). A few months before my thesis submission, Bayes Factors saved the day by informing me that there was mostly no evidence for the effect I was chasing in the first place. This would justify strong negative feelings towards p-values on my part, but then again, in situations of personal conflict is it always worth taking a step back to consider what went wrong, whether the blame lies at least in part with myself, and what I can learn from the experience for personal growth. This blog post is my attempt to make peace with p-values: my conclusion, so far, is that perhaps they are not to blame for everything, and that there is a lot to learn from my mistakes.
P-values seem to elicit a lot of strong feelings with a lot of people. They have received a great deal of negative publicity lately, and they have been blamed for the replicability crisis in psychology (e.g., Cumming, 2014; Halsey, Curran-Everett, Vowler, & Drummond, 2015). The replicability crisis refers to the outcry that most research findings published in scientific journals are false. Basically, what seems to be happening is that journals prefer to publish papers which report sensational, unexpected effects, that are likely to receive a lot of attention from the general public. In contrast, journals prefer to not publish studies that replicate an experiment, especially if these report a null-result. The consequences of such an incentive system are clear: the literature gets filled with sensational papers that report type-I errors (i.e., the presence of an effect, when in reality the null hypothesis is true).
Given this explanation, the problem leading to the replication crisis is not with p-values per se, but rather with the way they are used and interpreted. As the main argument of the current blog post, I propose that it is a grave mistake to draw conclusions based on a single p-value (i.e., a single published study). P-values are a frequentist statistic, where the outcome of a single experiment can only be considered in relation to a hypothetical infinite chain of events (or, in practice, a large number of events). Taking the classical coin example, that most psychology majors will remember from their introductory statistics course: if you want to determine whether a coin is fair or not, you toss it 10 000 times to check whether around 50% of the outcomes are heads. Now imagine that you want to determine whether the coin is fair or not, but you only toss it once. What can you conclude about the fairness of the coin? Nothing. You toss it twice, and get a head and a tail. Still, you cannot conclude anything. If you toss it three times and get heads each time, you might raise an eyebrow.
The same holds true for an experiment: If the null hypothesis of a given effect is true, and we run hundreds of experiments to test for the presence of this effect, we would expect, by definition, that around 5% of all p-values will be significant at the 5% level (p < 0.05). If we run a single experiment, and get a p-value (i.e., probability of obtaining the data if the null hypothesis were true) of 0.03, we can conclude nothing. If we replicate this experiment and get a p-value of 0.3, we still cannot conclude anything. However, if we run a series of experiments and get low p-values in a majority of them, we may raise an eyebrow and consider the possibility that the effect is actually real. Thus, p-values can be interpreted within a chain of events, but it is theoretically impossible to draw conclusions from a single observation.
Some related caveats about p-values are explained very nicely by Schmidt (1992). In set of experiments, due to sampling error, the observed effect size will vary as a normal distribution around the true population effect size. If a population effect is real (δ > 0), it is nevertheless possible to obtain an observed effect size of exactly zero (d = 0). Conversely, if the null hypothesis is true, it is possible to obtain a significant p-value and a reasonably-sized observed effect. Therefore, obtaining one significant (p­ < 0.05) and one non-significant (p > 0.05) result in two experiments tells us very little about whether the effect may be real or not. Even more detrimental is the conclusion that the two p-values are intrinsically different and must come from different populations. Such a result is often interpreted to mean that there must be some unknown moderator determining whether the effect is present or not, when it is likely that both observations simply reflect sampling noise around a single true population parameter value (which may or may not be zero).
So much for p-values (and other frequentist statistics, such as confidence intervals or power calculations): they are only meaningful when numerous studies are available, and if we can be sure that the studies that are available are truly random repeated samples. Given the current publication system, as described above, neither of the two conditions is met: (1) Regardless of whether replications yield significant p-values or not, their publication is discouraged. Therefore, for some effects, there is only one p-value available in the literature. (2) Replications are especially difficult to publish if they yield a non-significant result. This means that positive-result replications are over-represented in the literature. In the worst-case scenario, a “consistent effect” in the literature simply represents the 5% false positives (Rosenthal, 1979). In the defence of p-values, then: their distribution can tell us a lot. It can give us information about the presence or absence of an effect, and even about  questionable research practices for a given research area (Simonsohn, Nelson, & Simmons, 2014a; 2014b). However, this becomes difficult or impossible in a system where a large proportion of experiment outcomes are never reported because they do not meet an arbitrary criterion, and where conclusions are generally drawn from a single p-value.
But are Bayes Factors better, and if so, how? Unlike frequentist theorists, Bayesians do not start off with the assumption of a hypothetical infinite chain of events. Instead, one starts off with a prior belief about what kind of effect one might expect. If research on this question is already available, one can use that to guide the expected effect size; otherwise, an educated guess will do, with an adjustment in the prior distribution to reflect a high degree of uncertainty. After the data is collected, the prior belief is combined with the data to yield a posterior distribution. Thus, a Bayesian analysis allows one to update one’s a priori beliefs in light of incoming data. More comprehensive explanations of the Bayesian framework are available in Dienes (2011) and van de Schoot et al. (2014).
Back to the coin example: if a Bayesian wants to check whether a coin is fair, she starts off with an a priori belief. If there is no reason to believe that the coin is rigged, the belief would be that in a coin toss, heads and tails are equally likely outcomes. Say the first two coin tosses provide two heads. While this is not overly strong evidence against this prior belief, she might be somewhat inclined to consider the possibility that the coin biased towards heads. If the third coin toss provides tails, the degree of belief shifts back towards the “unbiased”-hypothesis. Thus, each consecutive coin toss allows the Bayesian to update her belief about the coin. The more data is available, the more confident the Bayesian will be that her belief is correct.
If a large amount of unbiased data is available, a frequentist and a Bayesian studying the same question would probably always converge in their conclusions. This is a big “if”, though, that does not seem to be met in the available published literature. As such, a Bayesian approach might be more suited to drawing conclusions based on a small amount of available data: even though it would be optimal to have as much data as possible within any theoretical framework, a Bayesian, but not a frequentist approach allows for conclusions based on few studies. It is important to bear in mind that, in a Bayesian framework, there is a large degree of uncertainty in one’s beliefs if the data on which it is based is sparse. Furthermore, Bayesian analyses are not immune to publication bias: if one considers only papers that report type-I errors, one is less likely to arrive at the conclusion that the null-hypothesis is true, even if this were evident from a set of experiments using truly random samples.
In conclusion, it seems that the main problem with p-values is not intrinsic to p-values per se, but rather in the data that is available in the literature, and the common practice of drawing conclusions based on a single study. The former may be a consequence of the latter: if researchers believe that the p-value of a single study can provide a convincing answer to the question "Is there an effect?", there seems indeed very little use in publishing replications. However, we know that this is not true. Furthermore, regardless of the theoretical framework that one adopts for statistical inference, it is always necessary to have multiple studies in order to be confident about the conclusions that we draw.

References
Cumming, G. (2014). The New Statistics: Why and How. Psychological Science, 25(1), 7-29. doi: 10.1177/0956797613504966
Dienes, Z. (2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6(3), 274-290.
Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle P value generates irreproducible results. Nature Methods, 12(3), 179-185. doi: 10.1038/nmeth.3288
Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014a). p-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science, 9(6), 666-681. doi: 10.1177/1745691614553988
Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014b). P-curve: a key to the file-drawer. J Exp Psychol Gen, 143(2), 534-547. doi: 10.1037/a0033242
van de Schoot, R., Denisen, J., Neyer, F., Kaplan, D., Asendorpf, J., & van Aken, M. (2014). A gentle introduction to Bayesian analysis: Applications to developmental research. Child Development, 85(3), 842-860. doi: 10.1111/cdev.12169

** Disclaimer: I am far from an expert on either frequentist or Bayesian statistics. I would welcome any corrections or discussions! **

Friday, July 17, 2015

Hyman’s Maxim: The most important principle in observational sciences?

I grew up in a house of mathematicians. Among other things, this means that throughout my childhood, I heard a lot of jokes about physicists (the mathematicians’ equivalent of blondes jokes). As a child, I used to find those jokes mildly amusing. When I started learning about research methods in psychology, I started finding them really funny, but sad at the same time – after all, if a hard science like physics is a laughing stock because of their dodgy scientific practices, what does this mean for a soft science like psychology?
Here’s an example of a physicist joke:

A physicist is doing a talk at a conference. He is holding up a graph, and is explaining what the data on it means. After half an hour, a graduate student timidly raises her hand and politely notes that the graph is held upside down. The physicist stops, looks at the graph, turns it the right way up, and says: “Why, you’re right! Well, in this case the data is even easier to explain!”

The moral of the story is that, for a reasonably intelligent and creative person, it is almost always possible to come up with a plausible-sounding explanation for any set of results. This, of course, is already well-known: for this reason, the scientific method entails explicitly stating a hypothesis before the data is collected and analysed. However, in psychological research this principle is not straight forward, for a simple reason: it is very rare for the data to behave in the way that was anticipated.
Like probably everyone else in the field, I learnt this the hard way during my PhD. I conducted a study to look an effect in three conditions (let’s call the size of the effect in the three conditions A, B, and C, respectively). Two theories (let’s call them X and Y) made opposing predictions:
If X is true, A < B < C.
If Y is true, A > B > C.
The result? B = 0 < A = C.
I think this scenario is familiar to anyone who has ever done an experiment in the so-called soft sciences, and it’s a PhD student’s worst nightmare. What does one do with a set of results like these? One of my advisors said I can do one of the following: (1) Figure out why we got this unexpected set of results, (2) write up a paper with our initial predictions in the introduction, our results, and conclude that ‘more research is needed’ to understand this unexpected set of results, or (3) forget about the whole thing.
In retrospect, I should have done (3), but due to my stubbornness I went for (1) instead. I had numerous meetings with my advisors to discuss how any theory could account for the obtained results – but in this case, we could not even come up with a reasonable-sounding explanation. Then I decided to collect some more data. I conducted four more studies with larger samples, and eventually performed a meta-analysis of all the data on this effect that I could get my hands on (which, aside from the data that I had collected, was not much). Thus having maximised the power to obtain a potentially true effect, I found that A = B = 0, and C is only slightly larger than zero. On the bright side, this finally allowed me to conclude that the data is more compatible with Theory X than Theory Y, but at this stage I had wasted hours of my advisors’ time and most of my PhD trying to understand the results of the first experiment, which were basically just random noise.
This is where Hyman’s Maxim comes in. I came across it in this blogpost by chance, after I had already submitted my PhD thesis. The maxim says: “Do not try to explain something until you are sure there is something to be explained.” Ray Hyman started off as a magician, but later became a skeptic and a psychologist. Aside from the blogpost, I have not found any publications on Hyman’s Maxim, but in my opinion, this is the most important principle in psychological science, and possibly any science that involves drawing inferences from data. As a scientist’s main job is to obtain data that can support or refute theories, it is easy to get carried away with drawing the link between data and theory, and to forget how important it is to ensure that the data actually tells you what you think it tells you. In psychology, with generally small effects and noisy data, the non-zero probability that a statistically significant effect reflects random noise is often forgotten. Consequently, any statistically significant result is in the danger of being interpreted as ‘meaningful’: if the a priori theory did not predict it, we must be missing something, there must be some explanation, or moderating factor, which should explain this unexpected result.

In conclusion, unless we have ensured that an unexpected result is replicable, drawing inferences from a single study with a statistically significant result that was not predicted a priori is a lot like telling someone’s future from the stars or their tea leaves. In fact, if the null hypothesis happens to be true, it is literally like telling someone’s future from the stars or their tea leaves. On some level, everyone knows this already, but perhaps it is easy to forget this point. My proposed solution to this problem: Create motivational posters starring Hyman’s Maxim. Put them up in every psychological scientist’s office and bathroom.

*********************************************
Edit (29/3/15): I created some motivational posters starring Hyman's Maxim. Sorry for my awful photoshop skills; feel free to improve or make your own!