Pvalues are those
things that are we want to be smaller than 0.05, as I knew them during my
undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a
nonscientist or work in a field that does not use pvalues, you’ve probably heard of terms like phacking (e.g., from this video by
John Oliver). Pvalues and, more
specifically, phacking, are getting
the blame for a lot of the things that go wrong in psychological science, and
probably other fields as well.
So, what exactly are pvalues,
what is phacking, and what does all
of that have to do with the replication crisis? Here, a lot of statssavvy people
shake their heads and say: “Well, it’s complicated.” Or they start explaining
the formal definition of the pvalue,
which, in my experience, to someone who doesn’t already know a lot about
statistics, sounds like blowing smoke and nitpicking on wording. There are
rules, which, if one doesn’t understand how a pvalue works, just sound like religious rituals, such as: “If
you calculate a pvalue halfway
through data collection, the pvalue
that you calculate at the end of your study will be invalid.”
Of course, the mathematics behind pvalues is on the complicated side: I had the luxury of being able
to take the time to do a graduate certificate in statistics, and learn more
about statistics than most of my colleagues will have the time for. Still, I won’t be
able to explain how to calculate the pvalue
without revising and preparing. However, the logic behind it is relatively
simple, and I often wonder why people don’t explain pvalues (and phacking)
in a much less complicated way^{1}. So here goes my attempt at an
intuitive, mathsfree, but at the same time mathematically correct explanation.
What are pvalues?
A pvalue is the
conditional probability of obtaining the data or data more extreme, given the
null hypothesis. This is a paraphrased formal textbook definition. So, what
does this mean?
Conditional means
that we’re assuming a universe where the null hypothesis is always true. For
example, let’s say I pick 100 people and randomly divide them into two groups.
Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined
the null hypothesis to be true.
Then I conduct my study; I might ask the participants
whether they feel better after having taken this pill and look at differences
in wellbeing between Group 1 and Group 2. In the long run, I don’t expect any
differences, but in a given sample, there will be at least some numerical differences
due to random variation. Then – and this point is key for the philosophy behind
the pvalue – I repeat this
procedure, 10 times, 100 times, 1,000,000 times (this is why the pvalue is called a frequentist statistic). And, according to the
definition of the pvalue, I will
find that, as the sample size increases, the percentage of experiments where I
get a pvalue smaller than or equal
to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.
The second important aspect of the definition of the pvalue is that it is the probability of
the data, not of the null hypothesis. Why this is important to bear in mind
can be, again, explained with the example above. Let’s say we conduct our
experiment with the two identical placebo treatments, and get a pvalue of 0.037. What’s the probability
that the null hypothesis is false? 0%: We know that it’s true, because this is
how we designed the experiment. What if we get a pvalue of 0.0000001? The probability of the null hypothesis being
false is still 0%.
If we’re interested in the probability of the hypothesis
rather than the probability of the data, we’ll need to use Bayes’ Theorem
(which I won’t explain here, as this would be a whole different blog post). The
reason why the pvalue is not
interpretable as a probability about the hypothesis (null or otherwise) is that
we need to consider how likely the null hypothesis is to be true. In the
example above, if we do the experiment 1,000,000 times, the null hypothesis
will always be true, by design, therefore we can be confident that every single
significant pvalue that we get is a
false positive.
We can take a different example, where we know that the null
hypothesis is always false. For example, we can take children between 0 and 10
year old, and correlate their height with their age. If we collect data
repeatedly and accumulate a huge number of samples to calculate the pvalue associated with the correlation
coefficient, we will occasionally get p > 0.05 (the proportion of times that
this will happen depends both on the sample size and the true effect size, and
equals to 1 statistical power). So, if we do the experiment and find that the
correlation is not significantly different from zero, what is the probability
that the null hypothesis is false? It’s 100%, because everyone knows that
children’s size increases with age.
In these two cases we know that, in reality, the null
hypothesis is true and false, respectively. In an actual experiment, of course,
we don’t know if it’s true of false – that’s why we’re doing the experiment in
the first place. It is relevant, however, how many hypotheses we expect to be
true in the long run.
If we study mainly crazy ideas that we had in the shower,
chances are, the null hypothesis will often be true. This means that the
posterior probability of the null hypothesis being false, even after obtaining
a pvalue smaller than 0.05, will still
be relatively small.
If we carefully build on previous empirical work, design a
methodologically sound study, and derive our predictions from wellfounded
theories, we will be more likely to study effects where the null hypothesis is
often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical pvalue.
So what about phacking?
That was the crash course to pvalues, so now we turn to phacking.
Phacking practices are often presented as lists
in talks about open science, and include:

Optional stopping: Collecting N participants,
calculating a pvalue, and deciding
whether or not to continue with data collection depending on whether this first
peek gives them a significant pvalue
or not,

Cherrypicking of outcome variables: Collecting
many potential outcome variables in a given experiment, and deciding on or changing the
outcomeofinterest after having looked at the results,

Determining or changing data cleaning procedures
after conditional on whether the pvalue
is significant or not,

Hypothesising after results are known (HARKing):
First collecting the data, and then writing a story to fit the results and
framing it as a confirmatory study.
The reasons why each of these are wrong are not intuitive
when we think about data. However, they become clearer when we think of a
different example: Guessing the outcome of a coin toss. We can use this
different example, because both the correctness of a coin toss guess and the
data that we collect in an experiment are random variables: they follow the
same principles of probability.
Imagine that I want to convince you that I have clairvoyant
powers, and that I can guess the outcome of a coin toss. You will certainly be
impressed, and think whether there might not be something to that, if I toss
the coin 10 times, and correctly guess the outcome every single time. You will
be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not
present, I can instead make a video recording, cut out all of my unsuccessful
attempts, and leave in only the ten correct guesses. From a probabilistic
perspective, this would be the same as optional stopping.^{2}
Now to cherrypicking of outcome variables: In order to
convince you of my clairvoyant skills, I declare that I will toss the coin 10
times and guess the outcome correctly every single time. I toss the coin 10
times, and guess correctly 7 times out of 10. I can backflip and argue that I
still guessed correctly more often than incorrectly, and therefore my claim
about my supernatural ability holds true. But you will not be convinced at all.
The removal of outliers, in the cointoss metaphor, would be
equivalent to discarding every toss where I guess incorrectly. HARKing
would be equivalent to tossing the coin multiple times, then deciding on the
story to tell later (“My guessing accuracy was significantly above chance”/ “I
broke the world record in throwing the most coin tosses within 60 seconds” / “I
broke the world record in throwing the coin higher than anyone else” / “I’m the
only person to have ever succeeded in accidentally hitting my eye while tossing
a coin!”).
So, if you're unsure whether a given practice can be considered phacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not.
More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the pvalue is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the pvalue (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a pvalue between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).
Some final thoughts
I hope that readers who have not heard explanations of pvalues that made sense to them found
my attempt useful. I also hope that researchers who often find themselves
in a position where they explain pvalues
will see my explanations as a useful example. I would be happy to get feedback
on my explanation, specifically about whether I achieved these goals: I
occasionally do talks about open science and relevant methods, and I aim to
streamline my explanation such that it is truly understandable to a wide
audience.
It is not the case that people who engage in phacking practices are incompetent or
dishonest. At the same time, it is also not the case that people who engage in phacking practices would not do so if
they learn to rattle off the formal definition of a pvalue in their sleep. But, clearly, fixing the replication crisis
involves improving the general knowledge of statistics (including how to
correctly use pvalues for inference).
There seems to be a general consensus about this, but judging by many Twitter
conversations that I’ve followed, there is no consensus about how much
statistics, say, a researcher in psychological science needs to learn. After
all, time is limited, and there is many other general as well as
topicspecific knowledge that needs to be acquired in order to do good research.
If I’m asked how much statistics psychology researchers should know, my answer
is, as
always: “As much as possible”. But a widespread understanding of statistics
is not going to be achieved by wagging a finger at researchers who get confused
when confronted with the difference between the probability of the hypothesis
versus the probability of the data conditional on the hypothesis. Instead,
providing intuitive explanations should not only improve understanding, but
also show that, on some level (that, arguably, is both necessary and sufficient
for researchers to do sound research), statistics is not an impenetrable swamp
of maths.

Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

^{1} There are some intuitive explanations that I’ve
found very helpful, like this
one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it!
^{2} Optional stopping does not involve willingly
excluding relevant data, while my video recording example does. Unless
disclosed, however, optional stopping does involve the withholding of
information that is critical in order to correctly interpret the pvalue. Therefore, I consider the two
cases to be equivalent.