Wednesday, February 13, 2019

P-values 101: An attempt at an intuitive but mathematically correct explanation


P-values are those things that are we want to be smaller than 0.05, as I knew them during my undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a non-scientist or work in a field that does not use p-values, you’ve probably heard of terms like p-hacking (e.g., from this video by John Oliver). P-values and, more specifically, p-hacking, are getting the blame for a lot of the things that go wrong in psychological science, and probably other fields as well.

So, what exactly are p-values, what is p-hacking, and what does all of that have to do with the replication crisis? Here, a lot of stats-savvy people shake their heads and say: “Well, it’s complicated.” Or they start explaining the formal definition of the p-value, which, in my experience, to someone who doesn’t already know a lot about statistics, sounds like blowing smoke and nitpicking on wording. There are rules, which, if one doesn’t understand how a p-value works, just sound like religious rituals, such as: “If you calculate a p-value halfway through data collection, the p-value that you calculate at the end of your study will be invalid.”

Of course, the mathematics behind p-values is on the complicated side: I had the luxury of being able to take the time to do a graduate certificate in statistics, and learn more about statistics than most of my colleagues will have the time for. Still, I won’t be able to explain how to calculate the -value without revising and preparing. However, the logic behind it is relatively simple, and I often wonder why people don’t explain p-values (and p-hacking) in a much less complicated way1. So here goes my attempt at an intuitive, maths-free, but at the same time mathematically correct explanation.

What are p-values?
A p-value is the conditional probability of obtaining the data or data more extreme, given the null hypothesis. This is a paraphrased formal text-book definition. So, what does this mean?

Conditional means that we’re assuming a universe where the null hypothesis is always true. For example, let’s say I pick 100 people and randomly divide them into two groups. Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined the null hypothesis to be true.

Then I conduct my study; I might ask the participants whether they feel better after having taken this pill and look at differences in well-being between Group 1 and Group 2. In the long run, I don’t expect any differences, but in a given sample, there will be at least some numerical differences due to random variation. Then – and this point is key for the philosophy behind the p-value – I repeat this procedure, 10 times, 100 times, 1,000,000 times (this is why the p-value is called a frequentist statistic). And, according to the definition of the p-value, I will find that, as the sample size increases, the percentage of experiments where I get a p-value smaller than or equal to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.

The second important aspect of the definition of the p-value is that it is the probability of the data, not of the null hypothesis. Why this is important to bear in mind can be, again, explained with the example above. Let’s say we conduct our experiment with the two identical placebo treatments, and get a p-value of 0.037. What’s the probability that the null hypothesis is false? 0%: We know that it’s true, because this is how we designed the experiment. What if we get a p-value of 0.0000001? The probability of the null hypothesis being false is still 0%.

If we’re interested in the probability of the hypothesis rather than the probability of the data, we’ll need to use Bayes’ Theorem (which I won’t explain here, as this would be a whole different blog post). The reason why the p-value is not interpretable as a probability about the hypothesis (null or otherwise) is that we need to consider how likely the null hypothesis is to be true. In the example above, if we do the experiment 1,000,000 times, the null hypothesis will always be true, by design, therefore we can be confident that every single significant p-value that we get is a false positive.

We can take a different example, where we know that the null hypothesis is always false. For example, we can take children between 0 and 10 year old, and correlate their height with their age. If we collect data repeatedly and accumulate a huge number of samples to calculate the p-value associated with the correlation coefficient, we will occasionally get p > 0.05 (the proportion of times that this will happen depends both on the sample size and the true effect size, and equals to 1- statistical power). So, if we do the experiment and find that the correlation is not significantly different from zero, what is the probability that the null hypothesis is false? It’s 100%, because everyone knows that children’s size increases with age.

In these two cases we know that, in reality, the null hypothesis is true and false, respectively. In an actual experiment, of course, we don’t know if it’s true of false – that’s why we’re doing the experiment in the first place. It is relevant, however, how many hypotheses we expect to be true in the long run.

If we study mainly crazy ideas that we had in the shower, chances are, the null hypothesis will often be true. This means that the posterior probability of the null hypothesis being false, even after obtaining a p-value smaller than 0.05, will still be relatively small.

If we carefully build on previous empirical work, design a methodologically sound study, and derive our predictions from well-founded theories, we will be more likely to study effects where the null hypothesis is often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical p-value.

So what about p-hacking?
That was the crash course to p-values, so now we turn to p-hacking. P-hacking practices are often presented as lists in talks about open science, and include:
-       Optional stopping: Collecting N participants, calculating a p-value, and deciding whether or not to continue with data collection depending on whether this first peek gives them a significant p-value or not,
-       Cherry-picking of outcome variables: Collecting many potential outcome variables in a given experiment, and deciding on or changing the outcome-of-interest after having looked at the results,
-       Determining or changing data cleaning procedures after conditional on whether the p-value is significant or not,
-       Hypothesising after results are known (HARKing): First collecting the data, and then writing a story to fit the results and framing it as a confirmatory study.

The reasons why each of these are wrong are not intuitive when we think about data. However, they become clearer when we think of a different example: Guessing the outcome of a coin toss. We can use this different example, because both the correctness of a coin toss guess and the data that we collect in an experiment are random variables: they follow the same principles of probability.

Imagine that I want to convince you that I have clairvoyant powers, and that I can guess the outcome of a coin toss. You will certainly be impressed, and think whether there might not be something to that, if I toss the coin 10 times, and correctly guess the outcome every single time. You will be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not present, I can instead make a video recording, cut out all of my unsuccessful attempts, and leave in only the ten correct guesses. From a probabilistic perspective, this would be the same as optional stopping.2

Now to cherry-picking of outcome variables: In order to convince you of my clairvoyant skills, I declare that I will toss the coin 10 times and guess the outcome correctly every single time. I toss the coin 10 times, and guess correctly 7 times out of 10. I can back-flip and argue that I still guessed correctly more often than incorrectly, and therefore my claim about my supernatural ability holds true. But you will not be convinced at all.

The removal of outliers, in the coin-toss metaphor, would be equivalent to discarding every toss where I guess incorrectly. HARKing would be equivalent to tossing the coin multiple times, then deciding on the story to tell later (“My guessing accuracy was significantly above chance”/ “I broke the world record in throwing the most coin tosses within 60 seconds” / “I broke the world record in throwing the coin higher than anyone else” / “I’m the only person to have ever succeeded in accidentally hitting my eye while tossing a coin!”). 

So, if you're unsure whether a given practice can be considered p-hacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not. 

More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the p-value is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the p-value (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a p-value between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).

Some final thoughts
I hope that readers who have not heard explanations of p-values that made sense to them found my attempt useful. I also hope that researchers who often find themselves in a position where they explain p-values will see my explanations as a useful example. I would be happy to get feedback on my explanation, specifically about whether I achieved these goals: I occasionally do talks about open science and relevant methods, and I aim to streamline my explanation such that it is truly understandable to a wide audience.

It is not the case that people who engage in p-hacking practices are incompetent or dishonest. At the same time, it is also not the case that people who engage in p-hacking practices would not do so if they learn to rattle off the formal definition of a p-value in their sleep. But, clearly, fixing the replication crisis involves improving the general knowledge of statistics (including how to correctly use p-values for inference). There seems to be a general consensus about this, but judging by many Twitter conversations that I’ve followed, there is no consensus about how much statistics, say, a researcher in psychological science needs to learn. After all, time is limited, and there is many other general as well as topic-specific knowledge that needs to be acquired in order to do good research. If I’m asked how much statistics psychology researchers should know, my answer is, as always: “As much as possible”. But a wide-spread understanding of statistics is not going to be achieved by wagging a finger at researchers who get confused when confronted with the difference between the probability of the hypothesis versus the probability of the data conditional on the hypothesis. Instead, providing intuitive explanations should not only improve understanding, but also show that, on some level (that, arguably, is both necessary and sufficient for researchers to do sound research), statistics is not an impenetrable swamp of maths.

-----------------------------------------
Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

-----------------------------------------
1 There are some intuitive explanations that I’ve found very helpful, like this one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it! 
2 Optional stopping does not involve willingly excluding relevant data, while my video recording example does. Unless disclosed, however, optional stopping does involve the withholding of information that is critical in order to correctly interpret the p-value. Therefore, I consider the two cases to be equivalent.