Wednesday, February 13, 2019

P-values 101: An attempt at an intuitive but mathematically correct explanation

P-values are those things that are we want to be smaller than 0.05, as I knew them during my undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a non-scientist or work in a field that does not use p-values, you’ve probably heard of terms like p-hacking (e.g., from this video by John Oliver). P-values and, more specifically, p-hacking, are getting the blame for a lot of the things that go wrong in psychological science, and probably other fields as well.

So, what exactly are p-values, what is p-hacking, and what does all of that have to do with the replication crisis? Here, a lot of stats-savvy people shake their heads and say: “Well, it’s complicated.” Or they start explaining the formal definition of the p-value, which, in my experience, to someone who doesn’t already know a lot about statistics, sounds like blowing smoke and nitpicking on wording. There are rules, which, if one doesn’t understand how a p-value works, just sound like religious rituals, such as: “If you calculate a p-value halfway through data collection, the p-value that you calculate at the end of your study will be invalid.”

Of course, the mathematics behind p-values is on the complicated side: I had the luxury of being able to take the time to do a graduate certificate in statistics, and learn more about statistics than most of my colleagues will have the time for. Still, I won’t be able to explain how to calculate the -value without revising and preparing. However, the logic behind it is relatively simple, and I often wonder why people don’t explain p-values (and p-hacking) in a much less complicated way1. So here goes my attempt at an intuitive, maths-free, but at the same time mathematically correct explanation.

What are p-values?
A p-value is the conditional probability of obtaining the data or data more extreme, given the null hypothesis. This is a paraphrased formal text-book definition. So, what does this mean?

Conditional means that we’re assuming a universe where the null hypothesis is always true. For example, let’s say I pick 100 people and randomly divide them into two groups. Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined the null hypothesis to be true.

Then I conduct my study; I might ask the participants whether they feel better after having taken this pill and look at differences in well-being between Group 1 and Group 2. In the long run, I don’t expect any differences, but in a given sample, there will be at least some numerical differences due to random variation. Then – and this point is key for the philosophy behind the p-value – I repeat this procedure, 10 times, 100 times, 1,000,000 times (this is why the p-value is called a frequentist statistic). And, according to the definition of the p-value, I will find that, as the sample size increases, the percentage of experiments where I get a p-value smaller than or equal to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.

The second important aspect of the definition of the p-value is that it is the probability of the data, not of the null hypothesis. Why this is important to bear in mind can be, again, explained with the example above. Let’s say we conduct our experiment with the two identical placebo treatments, and get a p-value of 0.037. What’s the probability that the null hypothesis is false? 0%: We know that it’s true, because this is how we designed the experiment. What if we get a p-value of 0.0000001? The probability of the null hypothesis being false is still 0%.

If we’re interested in the probability of the hypothesis rather than the probability of the data, we’ll need to use Bayes’ Theorem (which I won’t explain here, as this would be a whole different blog post). The reason why the p-value is not interpretable as a probability about the hypothesis (null or otherwise) is that we need to consider how likely the null hypothesis is to be true. In the example above, if we do the experiment 1,000,000 times, the null hypothesis will always be true, by design, therefore we can be confident that every single significant p-value that we get is a false positive.

We can take a different example, where we know that the null hypothesis is always false. For example, we can take children between 0 and 10 year old, and correlate their height with their age. If we collect data repeatedly and accumulate a huge number of samples to calculate the p-value associated with the correlation coefficient, we will occasionally get p > 0.05 (the proportion of times that this will happen depends both on the sample size and the true effect size, and equals to 1- statistical power). So, if we do the experiment and find that the correlation is not significantly different from zero, what is the probability that the null hypothesis is false? It’s 100%, because everyone knows that children’s size increases with age.

In these two cases we know that, in reality, the null hypothesis is true and false, respectively. In an actual experiment, of course, we don’t know if it’s true of false – that’s why we’re doing the experiment in the first place. It is relevant, however, how many hypotheses we expect to be true in the long run.

If we study mainly crazy ideas that we had in the shower, chances are, the null hypothesis will often be true. This means that the posterior probability of the null hypothesis being false, even after obtaining a p-value smaller than 0.05, will still be relatively small.

If we carefully build on previous empirical work, design a methodologically sound study, and derive our predictions from well-founded theories, we will be more likely to study effects where the null hypothesis is often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical p-value.

So what about p-hacking?
That was the crash course to p-values, so now we turn to p-hacking. P-hacking practices are often presented as lists in talks about open science, and include:
-       Optional stopping: Collecting N participants, calculating a p-value, and deciding whether or not to continue with data collection depending on whether this first peek gives them a significant p-value or not,
-       Cherry-picking of outcome variables: Collecting many potential outcome variables in a given experiment, and deciding on or changing the outcome-of-interest after having looked at the results,
-       Determining or changing data cleaning procedures after conditional on whether the p-value is significant or not,
-       Hypothesising after results are known (HARKing): First collecting the data, and then writing a story to fit the results and framing it as a confirmatory study.

The reasons why each of these are wrong are not intuitive when we think about data. However, they become clearer when we think of a different example: Guessing the outcome of a coin toss. We can use this different example, because both the correctness of a coin toss guess and the data that we collect in an experiment are random variables: they follow the same principles of probability.

Imagine that I want to convince you that I have clairvoyant powers, and that I can guess the outcome of a coin toss. You will certainly be impressed, and think whether there might not be something to that, if I toss the coin 10 times, and correctly guess the outcome every single time. You will be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not present, I can instead make a video recording, cut out all of my unsuccessful attempts, and leave in only the ten correct guesses. From a probabilistic perspective, this would be the same as optional stopping.2

Now to cherry-picking of outcome variables: In order to convince you of my clairvoyant skills, I declare that I will toss the coin 10 times and guess the outcome correctly every single time. I toss the coin 10 times, and guess correctly 7 times out of 10. I can back-flip and argue that I still guessed correctly more often than incorrectly, and therefore my claim about my supernatural ability holds true. But you will not be convinced at all.

The removal of outliers, in the coin-toss metaphor, would be equivalent to discarding every toss where I guess incorrectly. HARKing would be equivalent to tossing the coin multiple times, then deciding on the story to tell later (“My guessing accuracy was significantly above chance”/ “I broke the world record in throwing the most coin tosses within 60 seconds” / “I broke the world record in throwing the coin higher than anyone else” / “I’m the only person to have ever succeeded in accidentally hitting my eye while tossing a coin!”). 

So, if you're unsure whether a given practice can be considered p-hacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not. 

More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the p-value is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the p-value (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a p-value between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).

Some final thoughts
I hope that readers who have not heard explanations of p-values that made sense to them found my attempt useful. I also hope that researchers who often find themselves in a position where they explain p-values will see my explanations as a useful example. I would be happy to get feedback on my explanation, specifically about whether I achieved these goals: I occasionally do talks about open science and relevant methods, and I aim to streamline my explanation such that it is truly understandable to a wide audience.

It is not the case that people who engage in p-hacking practices are incompetent or dishonest. At the same time, it is also not the case that people who engage in p-hacking practices would not do so if they learn to rattle off the formal definition of a p-value in their sleep. But, clearly, fixing the replication crisis involves improving the general knowledge of statistics (including how to correctly use p-values for inference). There seems to be a general consensus about this, but judging by many Twitter conversations that I’ve followed, there is no consensus about how much statistics, say, a researcher in psychological science needs to learn. After all, time is limited, and there is many other general as well as topic-specific knowledge that needs to be acquired in order to do good research. If I’m asked how much statistics psychology researchers should know, my answer is, as always: “As much as possible”. But a wide-spread understanding of statistics is not going to be achieved by wagging a finger at researchers who get confused when confronted with the difference between the probability of the hypothesis versus the probability of the data conditional on the hypothesis. Instead, providing intuitive explanations should not only improve understanding, but also show that, on some level (that, arguably, is both necessary and sufficient for researchers to do sound research), statistics is not an impenetrable swamp of maths.

Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

1 There are some intuitive explanations that I’ve found very helpful, like this one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it! 
2 Optional stopping does not involve willingly excluding relevant data, while my video recording example does. Unless disclosed, however, optional stopping does involve the withholding of information that is critical in order to correctly interpret the p-value. Therefore, I consider the two cases to be equivalent.

Tuesday, January 8, 2019

Scientific New Year’s Resolutions

Last year, I wrote an awful lot of blog posts about registered reports. My first New Year’s Resolution is to scale down my registered reports activism, which I start off by writing a blog post about my new year’s resolutions, where I’ll try hard not to mention registered reports more than 5 times.

Scientifically, last year was very successful for me, with the major event being a grant from the DFG, which will give me a full-time position until mid-2021 and the opportunity to work on my own project. This gives me my second New Year’s Resolution, which is to focus on the project that I’m supposed to be working on, and not to get distracted by starting any (or too many) new side projects.

Having a full-time job means I’ll have to somewhat reduce the amount of time I’ll spend on Open Science, compared to the past few years. However, Open Science is still very important to me: I see it as an integral part of the scientific process, and consequently as one of the things I should do as part of my job as a researcher. As a Freies Wissen Fellow, an ambassador of the Center for Open Science, and a member of the LMU Open Science Center, I have additional motivation and support in improving the openness of my own research and helping others to do the same. My third New Year’s Resolution is to start prioritising various Open Science projects.

In order to prioritise, I need to select some areas which I think are the most effective in increasing the openness of research. In my experience, talks and workshops are particularly effective (in fact, that’s how I became interested in Open Science). Last year, I gave talks as part of two Open Science Workshops (in Neuchâtel and Linz), and at our department, I organised an introduction to R (to encourage reproducible data analyses). The attendance of these workshops was quite good, and the attendees seemed very interested and motivated, which further supports my hypothesis about the efficiency of such events. I hope to hold more workshops this year: so far, I have one invitation for a workshop in Göttingen.

I still have two mentions of Registered Reports left for this blogpost (oops, just one now): as I think that they provide one of the most efficient ways to mitigate the biases that make a lot of psychological science non-replicable, I will continue trying to encourage journals to accept them as an additional publication format. I have explained why I think that this is important here, here, here, and here [in German], and how this can be achieved here and here. Please note that we’re still accepting signatures for the letters to editors and for an open letter, for more information see here.

As a somewhat less effective thing that I did in the previous years to support open science, I had signed all of my peer reviews. I am now wondering if it’s not doing more harm than good. Signing my reviews is a good way to force myself to stay constructive, but sometimes there really are mistakes in a paper that, objectively speaking, mean that the conclusions are wrong. One cannot blame authors for being upset at the reviewers for pointing this out – after all, we’ve all experienced this ourselves, and so many things depend on the number of publications. And, as much as I try to convince myself to see the review process is a constructive discussion between scientists, perhaps there is a good reason that peer review often happens anonymously, especially for an early career researcher.

As my fourth New Year’s Resolution, I will strongly reduce my use of Twitter compared to the last years. I’ve learned a lot through Twitter, especially about statistics and open science. Lately, however, I started feeling like a lot of the discussions are repeating; the Open Science community has grown since I joined Twitter, leading to interpersonal conflicts within this community that have little to do with the common goal of improving reproducibility and replicability of research. A few months ago, I’ve created a lurking account, where I will follow only a few reading researchers: this way, I can compulsively check Twitter without scrolling down endlessly, and I can keep up-to-date with any developments that are discussed in my actual field of research. So far, I really don’t have the feeling that I’m missing out, though I still check my old Twitter account occasionally, especially when I get notifications.

My fifth New Year’s Resolution is to continue learning, especially about programming and statistics. The more I learn, the more I realise how little I know. However, looking back, I also realise that I can now do things that I couldn’t even have dreamed of a couple of years ago, and it’s a nice feeling (and it substantially improves the quality of my work).

My final New Year’s Resolution goes both for my working life and for my personal life: Be nice: judge less, focus on the good things, not on the bad, take actions instead of complaining about things, be constructive in criticism. 

So here's to a good 2019! 

Oh, and I have one left: Please support Registered Reports!

Friday, December 28, 2018

Realising Registered Reports Part 3: Dispelling some myths and addressing some FAQs

While collecting more signatures for our Registered Reports campaign, I spent some time arguing with people on the internet – something that I don’t normally do. Internet arguments, as well as most arguments in real life, tend to be about people with opposing views shouting at each other, then going away without changing their opinions in the slightest. I’m no exception: nothing of what people on the internet told me so far convinced me that registered reports are a bad idea. But as their arguments kept repeating, I decided to write a blog post with my counter-arguments. I hope that laying out my arguments and counter-arguments in a systematic manner will be more effective in showing what exactly we want to achieve, and why we think that registered reports are the best way to reach these goals. And, if nothing else, I can refer people on the internet to this blog post rather than reiterating my points in the future.

What do we actually want?
For a registered report, the authors write the introduction and methods section, as well as a detailed analysis plan, and submit the report to a journal, where it undergoes peer review. Once the reviewers are happy with the proposed methods (i.e., they think the methods give the authors the biggest chance to obtain interpretable results), the authors get conditional acceptance, and can go ahead with data collection. The paper will be published regardless of the results. Registered Reports is not synonymous with Preregistration: For the latter, you also write a report before collecting the data, but rather than submitting it to a journal for peer review, you upload a time-stamped, non-modifiable version of the report, and include a link to it in the final report.
The registered reports format is not suited for all studies, and at this point, it is worth stressing that we don’t want to ban the publishing of any study which is not pre-registered. This is an important point, because it is my impression that at least 90% of the criticisms of registered reports assume that we want them to be the only publication format. So, once again: We merely want journals to offer the possibility to submit registered reports, alongside with the traditional formats that they already offer. Exploratory studies or well-powered multi-experiment studies offer a lot of insights, and banning them and publishing only registered reports would be a very stupid idea. Among the supporters of registered reports that I’ve met, I don’t think a single one would disagree with me on this account.
Having said this, I think there are studies for which the majority of people (scientists and the general public alike) would really prefer for registered reports and pre-registration to be the norm. I guess that very few people would be happy to take some medication, when the only study supporting its effectiveness is a non-registered trial conducted by the pharmaceutical company that sells this drug, and shows that it totally works and has absolutely no side effects. In my research area, some researchers (or pseudo-researchers) occasionally produce a “cure” for dyslexia, for example in the form of a video game that supposedly trains some cognitive skill that is supposedly deficient in dyslexia. This miracle cure then gets sold for a lot of money, thus taking away both time and money from children with dyslexia and their parents. For studies showing the effectiveness of such “cures”, I think it would be justifiable, from the side of the parents, as well as the tax payers who fund the research, to demand that such studies are pre-registered.
To reiterate: Registered reports should absolutely not be the only publication format. But when it comes to making important decisions based on research, we should be as sure as possible that the research is replicable. In my ideal world, one should conduct a registered study before marketing a product or basing a policy on a result.

Aren’t there better ways to achieve a sound and reproducible psychological science?
Introducing the possibility of publishing registered reports also by no means suggests that other ways of addressing the replication crisis are unnecessary. Of course, it’s important that researchers understand methodological and statistical issues that can lead to a result being non-replicable or irreproducible. Open data and analysis scripts are also very important. If researchers treated a single experiment as a brick in the wall rather than a source of the absolute truth, if there was no selective publication of positive results, if there was transparency and data sharing, and researchers would conduct incremental studies that would be regularly synthesised in the form of meta-analyses, registered reports might indeed be unnecessary. But until we achieve such a utopia, registered reports are arguably the fastest way to work towards a replicable and reproducible science. Perhaps I’m wrong here: I’m more than happy to be convinced that I am. If there is a more efficient way to reduce publication bias, p-hacking, and HARKing, I will reduce the amount of time I spend pushing for registered reports, and start supporting this more efficient method instead.

Don’t registered reports take away the power from the authors?
As an author, you submit your study before you even collect the data. Some people might perceive it as unpleasant to get their study torn to pieces by reviewers before it’s even finished. Others worry that reviewers and editors get more power to stop studies that they don’t want to see published. The former concern, I guess, is a matter of opinion. As far as I’m concerned, the more feedback I get before I collect data, the more likely it will be that any flaws in the study will be picked up. Yes, I don’t love being told that I designed a stupid study, but I hate even more when I conduct an experiment only to be told (or realise myself) that it’s totally worthless because I overlooked some methodological issue. Other people may be more confident in their ability to design a good study, which is fine: They don’t have to submit their studies as registered reports if they don’t want to.
As for the second point: Imagine that you submit a registered report, and the editor or one of the reviewers says: “This study should not be conducted unless the authors use my/my friend’s measurement of X.” In this way, the argument goes, the editor has the power to influence not only what kind of studies get published, but even what kind of studies are being conducted. Except, if many journals publish registered reports, the authors can simply submit their registered report to a different journal, if they think that the reviewer’s or editor’s request is driven by politics rather than scientific considerations. This is why I’m trying to encourage as many journals as possible to start offering registered reports.
Besides, if we compare this to the current situation, I would argue that the power that editors and reviewers have would either diminish or stay the same. It doesn’t matter how many studies get conducted, if (as in the current system) many of them get rejected on the basis of “they don’t fit my/my friend’s pet theory”. Let’s say I want to test somebody’s pet theory, or replicate some important result. In my experience, original authors genuinely believe in their work: chances are, they will be supportive during the planning stage of the experiment. Things might look different if the results are already in, and the theory is not supported: then, they often try to find any reason to convince the editor that the study is flawed and that the authors are incompetent.
As an example, let’s imagine the worst case scenario: You want to replicate Professor X’s ground-breaking study, but you don’t know that Professor X actually fabricated the data. It’s in Professor X’s interest to prevent any replication of this original study, because it would likely show that it’s not replicable. As the replicator, you can submit a registered report to journal after journal after journal, and it is likely that Professor X will be asked to review it. Sure, it’s annoying, but at some stage you’re likely to either find an editor who looks through Professor X’s strategy, or Professor X will be too busy to review. Either way, I don’t see how this is different from the usual game of get-your-study-published-in-the-highest-possible-IF-journal that we all know and love in the traditional publication system.
And, if you really can’t get your registered report accepted and you think it’s for political reasons, you can still conduct the study. I will bet that the final version of the study, with the data and conclusion, will be even more difficult to publish than the registered report. But at least you’ll be able to publish it as a preprint, which would be an extremely valuable addition to the literature.

I’m still not convinced.
That’s fine – we can agree to disagree. I would be very happy if you could add any further arguments against registered reports in the comments section below, that cannot be countered by the following points:
(1) Other publication formats should continue to be available, even if registered reports become the norm for some types of studies, and
(2) Definitely, we need to continue working on other ways to improve the replication crisis, such as improving statistics courses for undergraduate and graduate psychology programs.

I think that registered reports are a good idea. What can I do to support registered reports?*
I’m very happy to hear that! See here (about writing letters to editors) and here (being a signatory on an open letter and letters to editors). 

* I wish this was a FAQ...

Edit (1.8.2019): In response to a comment, I've added a sentence about preregistration, and how it differs from RRs.

Monday, December 3, 2018

Realising Registered Reports: Part 2

TL;DR: Please support Registered Reports by becoming a signatory on letters to editors by leaving your name here.

The current blog post is a continuation of a blogpost that I wrote half a year ago. Back then, frustrated at the lack of field-specific journals which accept Registered Reports, I’d decided to start collecting signatures from reading researchers and write to the editors of journals which publish reading-related research, to encourage them to offer Registered Reports as a publication format. My exact procedure is described in the blog post (linked above).

On the 18th of June, I have contacted 44 journal editors. Seven of the editors wrote back to promise that they would discuss Registered Reports at the next meeting of the editorial board. One had already started piloting this new format (Discourse Processes), and one declined for understandable organisational reasons. To my knowledge, two reading-related journals have already taken the decision to offer Registered Reports in the future: the Journal of Research in Reading, and the Journal of the Scientific Studies of Reading. So, there is change, but it’s slow.

At the same time as I was collecting signatories and sending emails to editors about Registered Reports, a number of researchers decided to do the same. A summary of all journals that have been approached by one of us, and the journals’ responses, can be found here. A few months ago, we decided to join forces.

First, as a few of us were from the area of psycholinguistics, we decided to pool our signatories, and continue writing to editors in this field. The template on which the letters to the editors would be based can be found here.

Second, we decided to start approaching journals which are aimed at a broader audience and have a higher impact. Here, our pooled lists would already contain hundreds of researchers who would like to see more Registered Reports offered by academic journals. The first journal that we aim to contact is PNAS: As they recently announced that they will be offering a new publication format (Brief Reports), we would like to encourage them to consider also adding Registered Reports to their repertoire. The draft letter to the editor can be found here.

Third, we also decided to write an open letter, addressed to all editors and publishers. The ambition is to publish it in Nature News and Comments or a similar outlet. The draft letter can be found here.

I am writing this blog post, because we’re still on the lookout for signatories. You can choose to support all three initiatives, or any combination of them, by taking two minutes to fill in this Google form. You can also email me directly – whichever is easier for you. Every signatory matters: from any field, any career stage, any country. It would also be helpful if you could forward this link to anyone who you think might support this initiative. I’m also happy to answer any questions, take in suggestions, or discuss concerns about this format.

Tuesday, August 21, 2018

“But has this study been pre-registered?” Can registered reports improve the credibility of science?

There is a lot of bullshit out there. Every day, we are faced with situations where we need to decide whether a given piece of information is trustworthy or not. This is a difficult task: A lot of information that we encounter requires a great deal of expertise in a specific field, and nobody is able to become an expert on all issues which we encounter on a day-to-day basis (just to name a few: politics, history, nutrition, medicine, psychology, educational sciences, physics, law, artificial intelligence).

In the current blogpost, I will focus on educational sciences. This is an area where it is very important – to everyone ranging from parents and teachers through to education researchers – to be able to distinguish bullshit from trustworthy information. Believing in bullshit can, in the best case, lead to a waste of time and money (parents investing into educational methods that don’t work; researchers building on studies which turn out to be non-replicable). In the worst case, children will undergo educational practices or interventions for developmental disorders which distract from more effective, evidence-based methods or may even be harmful in the long run.

Many people are interested in science and the scientific method. These people mostly know that the first question you ask if you encounter something that sounds dodgy is: “But has this study been peer-reviewed?” We know that peer-review is fallible: This can be shown simply by taking the example of predatory journals, which will publish anything, under the appearance of a peer-reviewed paper, for a fee. While it is often (but not always) obvious to experts in a field that a given journal is predatory, this will be a more difficult task for someone without scientific training. In this blogpost, I will mainly focus on a thought-experiment: What if, instead, we (researchers, as well as the general public), asked: “But has this study been pre-registered?”

I will discuss the advantages and potential pitfalls of this shift in mind-set. But first, because I’m still working on convincing some of my colleagues that pre-registration is important for educational sciences and developmental psychology, I describe two examples that demonstrate how important it is to be able to tell between trustworthy and untrustworthy research. These are real-life examples that I encountered in the last few weeks, but I changed some of the names: while the concepts described raise a lot of red flags associated with pseudoscience, I don’t have the time or resources to conduct a thorough investigation to show that they are not trustworthy, and I don’t want to get into any fights (or law-suits) about these particular issues.

Example 1: Assessment and treatment for healthy kids
The first example comes from someone I know who asked me for advice. They had found a centre which assesses children on a range of tests, to see if they have any hidden developmental problems or talents. After a thorough assessment session (and, as I found out through a quick google search, a $300 fee), the child received a report of about 20 pages. As the centre specialises in children who have both a problem and a talent, it is not surprising that the child was diagnosed with both a problem and a talent (although, interestingly, a series of standardised IQ tests showed no problems). The non-standardised assessments tested for disorders that, during 7 years of study and 4 years of working as a post-doc in cognitive psychology, I had never heard of before. A quick google search revealed that there was some peer-reviewed literature on these disorders. But the research on a given disorder came always from one-and-the-same person or “research” group, mostly with the affiliation of an institute that made money by selling treatments for this disorder.

The problem with the above assessment is: Most skills are normally distributed, meaning that, on a given test, some children will be very good, and some children will be very bad. If you take a single child and give them a gazillion tests, you will always find a test on which they perform particularly badly and one on which they perform particularly well. One child might be particularly fast at peeling eggs, for example. A publication could describe a study where 200 children were asked to peel as many eggs as possible within 3 minutes, and there was a small number of children who were shockingly bad at peeling eggs (“Egg Peeling Disorder”, or EPD for short). This does not mean that this ability will have any influence whatsoever on their academic or social development. But, in addition, we can collect data on a large number of variables that are indicative of children’s abilities: five reading tests, five mathematics tests, tests of fine motor skills, gross motor skills, vocabulary, syntactic skills, physical strength, the frequency of social interactions – the list goes on and on. Again, by the laws of probability, as we increase the number of variables, we increase the probability that at least one of them will be correlated with the ability to peel an egg, just by chance.

Would it help to ask: “Has this studies been pre-registered?” Above, I described a way in which any stupid idea can be turned into a paper showing that a given skill can be measured and correlates with real-life outcomes. By maximising the number of variables, the laws of probability give us a very good chance to find a significant result. In a pre-registered report, the researchers would have to declare, before they collect or look at the data, which tests they plan to use, and where they expect to find significant correlations. This gives less wiggle-space for significance-fishing, or combing the data for significant results which likely just reflect random noise.

Example 2: Clinical implications of ghost studies
The second example is from the perspective of a researcher. A recent paper I came across reviewed studies on a certain clinical population performing tasks tapping a cognitive skill – let’s call it “stylistical turning”. The review concluded that clinical groups perform, on average, worse than control groups on stylistical turning tasks, and suggests stylistical turning training to improve outcomes in this clinical population. Even disregarding the correlation-causation confusion, the conclusion of this paper is problematic, because in this particular case, I happen to know of two well-designed unpublished studies which did not find that the clinical group performed worse than a control group – in fact, both found that the stylistical turning task used by the original study doesn’t even work! Yet, as far as I know, neither has been published (even though I’d encouraged the researchers behind these studies to submit). So, the presence of unpublished “ghost” studies, which cannot be found through a literature search, has profound consequences for a question of clinical importance.

Would it help in this case to demand that studies are pre-registered? Yes, because pre-registration involves creating a record, prior to the collection of data, that this study will be conducted. In the case of our ghost studies, someone conducting a literature review would at least be able to find the registration plan. Even if the data did not end up being published, the person conducting the literature review could (and should) contact the authors and ask what became of these studies.

Is pre-registration really the holy grail?
As for most complex issues, it would be overly simplistic to conclude that pre-registration would fix everything. Pre-registration should help combat questionable research practices (fishing for significance in a large sea of data, as described in Example 1), and publication bias (selective publication of positive results, leading to a distorted literature, as described in Example 2). These are issues that standard peer-review cannot address: When I review the manuscript of a non-pre-registered paper, I cannot possibly know if the authors collected data on 50 other variables and report only the ones that came out as significant. Similarly, I cannot possibly know if, somewhere else in the world, a dozen other researchers conducted the exact same experiment and did not find a significant result.

What would happen if we – researchers and the general public alike – began to demand that studies must be pre-registered if they are to be used to inform practice? Luckily, medical research is ahead of educational sciences on this front: Indeed, pre-registration seems to decrease the number of significant findings, which probably reflects a more realistic view of the world.

So, what could possibly go wrong with pre-registration? First, if we start demanding pre-registered reports now, we can pretty much throw away everything we’ve learned about educational sciences so far. There is a lot of bullshit out there for sure, but there are also effective methods, which show consistent benefits across many studies. But, as pre-registration has not really kicked off yet in educational sciences, none of these studies have been pre-registered. This raises important questions: Should we repeat all existing studies in a pre-registered format, even when is a general consensus among researchers that a given method is effective? On the one hand, this would be very time- and resource-consuming. On the other hand, some things that we think we know turn out to be false. And besides, even when a method is, in reality, effective, the selective publication of positive results makes it looks like the method is much more effective than it really is. In addition to knowing what works and what doesn’t, we also need to make decisions about which method works better: this requires a good understanding of the extent to which a method helps.

It is also clear that we need to change the research structure before we unconditionally demand pre-registered reports: at this stage, it would be unfair to judge a paper as worthless if it has not been pre-registered, because pre-registration is just not done in educational sciences (yet). If more journals offered the registered report format, and researchers were incentivised for publishing pre-registered studies rather than mass-producing significant results, this would set the conditions for the general public to start demanding that practice is informed by pre-registered studies only.

As a second issue, there is one thing we have learned from peer-review: When there are strong incentives, people learn to play the system. At this stage, peer-review is the major determining factor of which studies are considered trustworthy by fellow researchers, policy-makers and stakeholders. This has resulted in a market for predatory journals, which publish anything under the appearance of a peer-reviewed paper, if you give them money. Would it be possible to play the system of pre-registered reports?

It is worth noting that there are two ways to do a pre-registration. One way is for a researcher to write the pre-registration report, upload it by themselves as a time-stamped, non-modifiable document, and then to go ahead with the data collection. In the final paper, they add a link to the pre-registration report. Peer-review occurs at the stage when the data has already been collected. Both the peer-reviewers and the readers can download the pre-registration report and compare the authors’ plans with what they actually did. It is possible to cheat with this format: Run the study, look at the result, and write the pre-registered report retrospectively, based on the results that have already come out as significant. The final paper can then be submitted with a link to the fake pre-registered report, and with a bit of luck, the study would appear as pre-registered in a peer-reviewed journal. This would be straight-out cheating as opposed to being in a moral grey-zone, which is the current status of questionable research practices. But it could be a real concern when there are strong incentives involved.

The second way to do a pre-registration is the so-called registered report (RR) format. Here, journals conduct peer-review of the pre-registered report rather than the final paper. This means that the paper is evaluated based on its methodological soundness and the strength of the proposed analyses. After the reviewers approve of the pre-registered plan, the authors get the thumbs-up to start data collection. Cheating by submitting a plan for a study that has already been conducted becomes difficult in this format, because reviewers are likely to propose some changes to the methodology: if the data has already been collected, the cheating authors would be put in a checkmate position because they would need to collect new data after making the methodological changes.

For both formats, there are more subtle ways to maximise the chances of supporting your theory (let’s say, if you have a financial interest in the results coming out in a certain way). A bad pre-registration report could be written in a way that is vague: As we saw in Example 1, this would still give the authors room to wiggle with their analyses until they find a significant result (e.g., “We will test mathematical skills”, but neglecting to mention that 5 different tests will be used, and all possible permutations of these tests will be used to calculate an average score until one of them turns out to be significant). This would be less likely to happen with the RR format than with non-peer-reviewed pre-registration, because a good peer-reviewer should be able to pick up on this vagueness, and demand that the authors specify exactly which variables they will measure, how they will measure them, and how they will analyse them. But the writer of the registered report could hope for inattentive reviewers, or submit to many different journals until one finally accepts the sloppily-written report. To circumvent this problem, then, it is necessary to combine RRs with rigorous peer-review. From this perspective, the most important task of the reviewer is to make sure that the registered report is written in a clear and unambiguous manner, and that the resulting paper closely follows what the authors said they would do in the registered report.

So, should we start demanding that educational practice is based on pre-registered studies? In an ideal world: Yes. But for now, we need top-down changes inside the academic system, which would encourage researchers to conduct pre-registered studies.

Is it possible to cheat with pre-registered reports in such a way that we don’t end up solving the problems I outlined in this blogpost? Probably yes, although a combination of the RR format (where the pre-registered report rather than the final paper is submitted to a journal) and rigorous peer-review should minimise such issues.

What should we do in the meantime? My proposed course of action will be to focus on making it more common among education researchers to pre-register their studies. One way to achieve this is to encourage as many journals as possible to adopt the RR format. To have good peer-review for RRs, we also need to spread awareness among researchers about what to look out for when reviewing a RR. Some journals which publish RRs, such as Cortex, have very detailed guidelines for reviewers. In addition, perhaps workshops about how to review a RR could be useful.

Monday, June 18, 2018

Realising Registered Reports: Part 1

Two weeks ago, I set out on a quest to increase the number of journals which publish reading-related research and offer the publication format of Registered Reports (RRs: I wrote a tweet about my vague idea to start writing to journal editors, which was answered by a detailed description of 7 easy steps by Chris Chambers (reproduced in full below, because I can’t figure out how to embed tweets):

1)    Make a list of journals that you want to see offer Registered Reports
2)    Check Responses/ to see if each journal has already been publicly approached
3)    If it hasn’t, adapt template invitation letter here: Requests/
4)    Assemble colleagues (ECRs & faculty, as many as possible) & send a group letter to chief editor & senior editors. Feel free to include [Chris Chambers] as a signatory and include [him] and David Mellor (@EvoMellor) in CC.
5)    Login with your OSF account (if you don’t have one then create one:
6)    Once logged in via OSF, add the journal and status (e.g., “Under consideration [date]”) to Responses/. Update status as applicable. Any OSF user can edit.
7)    Repeat for every relevant journal & contact [Chris] or David Mellor if you have any questions.

So far, I have completed Steps 1-3, and I’m halfway through Step 4: I have a list of signatories, and will start emailing editors this afternoon. It could be helpful for some, and interesting for others to read about my experiences with following the above steps, so I decided to describe them in a series of blog posts. So, here is my experience so far:

My motivation
I submitted my first RR a few months ago. It was a plan to conduct a developmental study in Russian-speaking children, which would assess what kind of orthographic characteristics may pose problems during reading acquisition. There were two issues that I encountered while writing this report: (1) For practical reasons, it’s very difficult to recruit a large sample size to make it a well-powered study, and (2) it’s not particularly interesting for a general audience. This made it very difficult to make a good case for submitting it to any of the journals that currently offer the RR format. Still, I think it’s important that such studies are being conducted, and that there is the possibility to pre-register, in order to avoid publication bias and unintentional p-hacking (and to get peer review and conditional acceptance before you start data collection – this, I think, should be a very good ‘selfish’ reason for everyone to support RRs). So it would be good if some more specialist journals started accepting the RR format for smaller studies that may be only interesting to a narrow audience.

Making lists
The 7-step process involves making two lists: One list of journals where I’d like to see RRs, and another list of signatories who agree that promoting RRs is a good thing. I created a Google Doc spreadsheet, which anyone can modify to add journals to one sheet and their name to another sheet, here. A similar list has been created, at around the same time, by Tim Schoof, for the area of hearing science, here.

Getting signatories
The next question is how to recruit people to modify the list and add their names as signatories. I didn’t want to spam anyone, but at the same time I wanted to get as many signatures as possible, and of course, I was curious how many reading researchers would actively support RRs. I started off by posting on twitter, which already got me around 15 signatures. Then I wrote to my current department, my previous department, and made a list of reading researchers that came to my mind. Below is the email that I sent to them:

Dear fellow reading/dyslexia researchers,

Many of you have probably heard of a new format for journal articles: Registered Reports (RR). With RRs, you write a study and analysis plan, and submit it to a journal before you collect the data. You get feedback from the reviewers on the design of the study. If the reviewers approve of the methods, you get conditional acceptance. This means that the study will be published regardless of its outcome. Exploratory analyses can still be reported, but they will be explicitly distinguished from the confirmatory analyses relating to the original hypotheses.

The RR format is good for science, because it combats publication bias. It is also good for us as individual researchers, because it helps us to avoid situations where we invest time and resources into a study, only to find out in retrospect that we overlooked some design flaw and/or that non-significant findings are uninformative, yielding the study unpublishable. You can find more information and answers to some FAQs about RRs here:

There are some journals which offer the RR format, but not many of them are relevant to a specialised audience of reading researchers. Therefore, I'd like to contact the editors of some more specialised journals to suggest accepting the RR format. I started off by making a list of journals which publish reading-related research, which you can view and modify here:

To increase the probability of as many journals as possible deciding to offer RRs (alongside the traditional article formats) in the future, I would like to ask you three favours:

1) If there are any journals where you would like to see RRs, please add them to the list (linked above) before the 15th of June.
2) If you would like to be a signatory on the emails to the editors, please let me know or add your name to the list of signatories in the second sheet (same link as the list). I will then add your name to the emails that I will send to the editors. Here is a template of the email:
3) If you are part of any networks for which this could be relevant, please help to spread the information and my two requests above.
If you have any questions or concerns about RRs, I would be very happy to discuss.

Thank you very much in advance!
Kind regards,

The list of reading researchers that I could think of contained 62 names, including collaborators, friends, people I talked to at conferences and who seemed to be enthusiastic about open science, some big names, people who have voiced scepticism about open science; in short: a mixture of early-career and senior researchers, with varying degrees of support for open science practices. After I sent these emails, I gained about 15 more signatures.

After some consideration, I also wrote to the mailing list of the Scientific Studies of Reading. I hesitated because, again, I didn’t want to spam anyone. But in the end, I decided that there is value in disseminating the information to those who would like to do something to support RRs, even if it means annoying some other people. There were some other societies’ mailing lists that I would have liked to try, but I where I couldn’t find a public mailing list and did not get a response from the contact person.

After having done all this, I have 45 signatures, excluding my own. In response to my emails and tweets, I also learned that two reading-related journals are already considering implementing RRs: The Journal of Research in Reading and Reading Psychology. 

Who supports RRs, anyway?
I would find it incredibly interesting to answer some questions using the signature “data” that I have, including: At what career stage are people likely to support RRs? Are there some countries and universities which support RRs more than others? I will describe the observed trends below. However, the data is rather anecdotal, so it should not be taken any more seriously than a horoscope.

Out of the 46 signatories, I classified 17 as early career researchers (PhD students or post-docs), and 29 as seniors (either based on my knowledge of their position or through a quick google search). This is in contrast to the conventional wisdom that young people strive for change while older people cling on to the existing system. However, there are alternative explanations: for example, it could be that ECRs are more shy about adding their name to such a public list.

The signatories gave affiliations from 11 different countries, namely UK (N = 10), US (N = 7), Canada (N = 6), Australia (N = 5), Belgium (N = 4), Germany (N = 4), the Netherlands (N = 4), Italy (N = 2), Norway (N = 2), Brazil (N = 1), and China (N = 1).

The 46 signatories came from 32 different affiliations. The most signatures came from Macquarie University, Australia (my Alma Mater, N = 5). The second place is shared between Dalhousie University, Canada, and Université Libre de Bruxelles, Belgium (N = 3). The shared third place goes to Radboud University, the Netherlands; Royal Holloway University, London, UK; Scuola Internationale Superiore di Studi Avanzati (SISSA), Italy; University of Oslo, Norway; University of York, UK, and University of Oxford, UK (N = 2).

All of these numbers are difficult to interpret, because I don’t know exactly how widely and to which networks my emails and tweets were distributed. However, I can see whether there are any clear trends in the response rate among the researchers I contacted directly, via my list of researchers I could think of. This list contained 62 names of 22 ECRs and 40 senior researchers. Of these, 5 ECRs and 6 senior signed, which indeed is a higher response rate amongst ECRs than senior researchers.

Before sending the group email, I’d tried to rate the researchers, based on my impression of them, of how likely they would be to sign in support of RRs. My original idea was to write only to those who I thought are likely to sign, but ultimately I decided to see if any of those whom I’d consider unlikely would positively surprise me. I rated the researchers on a scale of 1 (very unlikely) to 10 (I’m positive they will sign). If I didn’t know what to expect, I left this field blank. This way, I rated 26 out of the 62 researchers, with a mean rating of 6.1 (SD = 2.1, min = 2, max = 10). Splitting up by ECR and senior researchers, my average ratings were similar across the two groups (ECRs: mean = 5.9, SD = 1.5, min = 3, max = 7; seniors: mean = 6.2, SD = 2.4, min = 2, max = 10).

How accurate were my a priori guesses? Of the 26 people I’d rated, 9 had signed. Their average rating was 7.3 (SD = 1.1, min = 6, max = 9). The other 17 researchers had an average rating of 5.4 (SD = 2.2, min = 2, max = 10). So it seems I was pretty accurate in my guesses about who would be unlikely to sign (i.e., there were no ‘pleasant surprises’). Some of the people I considered to be highly likely to sign did not do so (‘unpleasant surprises’), though I like to think that there are reasons other than a lack of support for the cause (e.g., they are on sabbatical, forgot about my email, or they think there are better ways to promote open science).

What now?
Now I will start sending out the emails to the editors. I’m very curious how it will go, and I’m looking forward to sharing my experiences and the reactions in an upcoming blogpost.

Tuesday, May 8, 2018

Some thoughts on trivial results, or: Yet another argument for Registered Reports

A senior colleague once joked: “If I read about a new result, my reaction is either: ‘That’s trivial!’, or ‘I don’t believe that!’”

These types of reactions are pretty common when presenting the results of a new study (in my experience, anyway). In peer review, especially the former can be a reason for a paper rejection. In conversations with colleagues, one sometimes gets told, jokingly: “Well, I could have told you in advance that you’d get this result, you didn’t have to run the study!” This can be quite discouraging, especially if, while you were planning your study, it did not seem at all obvious to you that you would get the obtained result.

In many cases, perhaps, the outcomes of a result are obvious, especially to someone who has been in the field for much longer than you are. For some effects, there might be huge file drawers, such that it’s a well-known secret that an experimental paradigm which seems perfectly reasonable at first sight doesn’t actually work. In this case, it would be very helpful to hear that it’s probably not the best idea to invest time and resources on this paradigm. However, it would be even more helpful to hear about this before you plan and execute your study.  

One also needs to take into account that there is hindsight bias. If you hear the results first, it’s easy to come up with an explanation for the exact obtained pattern. Thus, a some result that might seem trivial in hindsight would actually have been not so east to predict a priori. There is also often disagreement about the triviality of an outcome: It's not unheard of (not only in my experience) that Reviewer 1 claims that the paper shouldn't be published because the result is trivial, while Reviewer 2 recommends rejection because (s)he doesn’t believe this result.

Registered reports should strongly reduce the amount of times that people tell you that your results are trivial. If you submit a plan to do an experiment that really is trivial, the reviewers should point this out while evaluating the Stage 1 manuscript. If they have a good point, this will save you from collecting data for a study that many people might not find interesting. And if the reviewers agree that the research question is novel and interesting, they cannot later do a backflip and say that it’s trivial after having seen the results.

So, this is another advantage of registered reports. And, if I’m brave enough, I’ll change the way I tell (senior) colleagues about my work in informal conversations, from: “I did experiment X, and I got result Y” to “I did experiment X. What do you think happened?”