Thursday, March 14, 2019

Why I plan to move away from statistical learning research

Statistical learning is a hot topic, with papers about a link between statistical learning ability and reading and/or dyslexia mushrooming all over the place. In this blog post, I am very sceptical about statistical learning, but before I continue, I should make it clear that it is, in principle, an interesting topic, and there are a lot of studies which I like very much.

I’ve published two papers on statistical learning and reading/dyslexia. My main research interest is in cross-linguistic differences in skilled reading, reading acquisition, and dyslexia, which was also the topic of my PhD. The reason why, during my first post-doc, I became interested in the statistical learning literature, was, in retrospect, exactly the reason why I should have stayed away from it: It seemed relevant to everything I was doing.

From the perspective of cross-linguistic reading research, statistical learning seemed to be integral to understanding cross-linguistic differences. This is because the statistical distributions underlying the print-to-speech correspondences differ across orthographies: in orthographies such as English, children need to extract statistical regularities such as a being often pronounced as /ɔ/ when it succeeds a w (e.g., in “swan”). The degree to which these statistical regularities provide reliable cues differ across orthographies: for example, in Finnish, letter-phoneme correspondences are reliable, such that children don’t need to extract a large number of subtle regularities in order to be able to read accurately.

From a completely different perspective, I became interested in the role of letter bigram frequency during reading. One can count how often a given letter pair co-occurs in a given orthography. The question is whether the average (or summed) frequency of the bigrams within a word affects the speed with which this word is processed. This is relevant to psycholinguistic experiments from a methodological perspective: if letter bigram frequency affects reading efficiency, it’s a factor that needs to be controlled while selecting items for an experiment. Learning the frequency of letter combinations can be thought of as a sort of statistical learning task, because it involves the conditional probabilities of a letter given the other.

The relevance of statistical learning to everything should have been a warning sign, because, as we know from Karl Popper, something that explains everything actually explains nothing. This becomes clearer when we ask the first question that a researcher should ask: What is statistical learning? I don’t want to claim that there is no answer to this question, nor do I want to provide an extensive literature review of the studies that do provide a precise definition. Suffice it to say: Some papers have definitions of statistical learning that are extremely broad, which is the reason why it is often used as a hand-wavy term denoting a mechanism that explains everything. This is an example of a one-word explanation, a term coined by Gerd Gigerenzer in his paper “Surrogates for theories” (one of my favourite papers). Other papers provide more specific definitions, for example, by defining statistical learning based on a specific task that is supposed to measure it. However, I have found no consensus among these definitions: and given that different researchers have different definitions for the same terminology, the resulting theoretical and empirical work is (in my view) a huge mess.

In addition to these theoretical issues, there is also a big methodological mess when it comes to the literature on statistical learning and reading or dyslexia. I’ve written about this in more detail in our two papers (linked above), but here I will list the methodological issues in a more compact manner: First, when we’re looking at individual differences (for example, by correlating reading ability and statistical learning ability), the lack of a task with good psychometric properties becomes a huge problem. This issue has been discussed in a number of publications by Noam Siegelman and colleagues, who even developed a task with good psychometric properties for adults (e.g., here and here). However, as far as I’ve seen, there are still no published studies on reading ability or dyslexia using improved tasks. Furthermore, recent evidence suggests that a statistical learning task which works well with adults still has very poor psychometric properties when applied to children.

Second, the statistical learning and reading literature is a good illustration of all the issues that are associated with the replication crisis. Some of these are discussed in our systematic review about statistical learning and dyslexia (linked above). The publication bias in this area (selective publication of significant results) became even clearer to me when I presented our study on statistical learning and reading ability – where we obtained a null result – at the SSSR conference in Brighton (2018). There were several proponents of the statistical learning theory (if we can call it that) of reading and dyslexia, but none of them came to my poster to discuss this null result. Conversely, a number of people dropped by to let me know that they’ve conducted similar studies and also gotten null results.

Papers on statistical learning and reading/dyslexia continue to be published, and at some point, I was close to being convinced that maybe, visual statistical learning is related to learning to read in orthographies with a visually complex orthography. But then, some major methodological or statistical issue always jumps out at me when I read a paper closely enough. The literature reviews of these papers tend to be biased, often listing studies with null-results as evidence for the presence of an effect, or else picking out all the flaws of papers with null results, while treating the studies with positive results as a holy grail. I have stopped reading such papers, because it does not feel like a productive use of my time.

I have also stopped accepting invitations to review papers about statistical learning and reading/dyslexia, because I have started to doubt my ability to give an objective review. By now, I have a strong prior that there is no link between domain-general statistical learning ability and reading/dyslexia. I could be convinced otherwise, but would require very strong evidence (i.e., a number large-scale pre-registered studies from independent labs with psychometrically well-established tasks). While I strongly believe that such evidence is required, I realise that it is unreasonable to expect such studies from most researchers who conduct this type of research, who are mainly early-career researchers who base their methodology on previous studies.

I also stopped doing or planning any studies on domain-general statistical learning. The amount of energy necessary to refute bullshit is an order of magnitude bigger than to produce it, as Alberto Brandolini famously tweeted. This is not to say that everything to do with statistical learning and reading/dyslexia is bullshit, but – well, some of it definitely is. I hope that good research will continue to be done in this area, and that the state of the literature will become clearer because of this. In the meantime, I have made the personal decision to move away from this line of research. I have received good advice from one of my PhD supervisors: not to get hung up on research that I think is bad, but to pick an area where I think there is good work and to build on that. Sticking to this advice definitely makes the research process more fun (for me). Statistical learning studies are likely to yield null results, which end up uninterpretable because of the psychometric issues with statistical learning tasks. Trying to publish this kind of work is not a pleasant experience.

Why did I write this blog post? Partly, just to vent. I wrote it as a blog post and not as a theoretical paper, because it lacks the objectivity and a systematic approach which would be required for a scientifically sound piece of writing. If I were to write a scientifically sound paper, I would need to break my resolution to stop doing research on statistical learning, so a blog post it is. Some of the issues above have been discussed in our systematic review about statistical learning and dyslexia, but I also thought it would be good to summarise these arguments in a more concise form. Perhaps some beginning PhD student who is thinking about doing their project on statistical learning and reading will come across this post. In this case, my advice would be: pick a different topic. 

Sunday, March 10, 2019

What’s next for Registered Reports? Selective summary of a meeting (7.3.2019)

Last week, I attended a meeting about Registered Reports. It was a great opportunity, not only to discuss Registered Reports, but also to meet some people whom I had previously only known from twitter, over a glass of Whiskey close to London Bridge.

The meeting felt very productive, and I took away a lot of new information, about the Registered Report Format in general, and also some specific things that will be useful to me when I submit my next Registered Report. Here, I don’t aim to summarise everything that was discussed, but to focus on those aspects that could be of practical importance to individual researchers.

What’s stopping researchers from submitting Registered Reports?
We dedicated the entire morning to discussing how to increase the submission rate of Registered Reports. Before the meeting, I had done an informal survey among colleagues and on twitter to see what reasons people had for not submitting Registered Reports. The response rate was pretty low, suggesting that a lack of interest may be a leading factor (due either to apathy or scepticism – from my informal survey, I can’t tell). From people who did respond, the main reason was time: often, especially younger researchers are on short-term contracts (1-3 years), and are pressured for various reasons to start data collection as soon as possible. Among such reasons, people mentioned grants: funders often expect strict adherence to a timeline. And, unfortunately, such timing pressures disproportionately affect earlier career researchers, exactly the demographic which is most open to trying out a new way of conducting and publishing research.

Submitting a Registered Report may take a while – there is no point sugar-coating this. In contrast to standard studies, authors of Registered Reports need to spend more time to plan the study, because writing the report involves planning in detail; there may be several rounds of review before in-principle acceptance, and addressing reviewers’ comments may involve collecting pilot data. Given my limited experience, I would estimate that about 6-9 months would need to be added to the study timeline before one can count with in-principle acceptance and data collection can be started.

Of course, the increase in time that you spend before conducting the experiment will substantially improve the quality of the paper. A Registered Report is very likely to cut a lot of time at the end of the research cycle: when realising how long it may take to get in-principle acceptance, you should always bear in mind the painstakingly slow and frustrating process of submitting a paper to one journal after the other, accumulating piles of reviews varying in constructiveness and politeness, being told about methodological flaws that now you can’t fix, about how your results should have been different, and eventually unceremoniously throwing the study which you started with such great enthusiasm into the file-drawer.

Long-term benefits aside, unfortunately the issue of time remains for researchers on short-term contracts and with grant pressures. We could not think of any quick fix to this problem. In the long term, solutions may involve planning in this time when you write your next grant application. One possibility could be to write that you plan to conduct a systematic review during the time that you wait for in-principle acceptance. In my recently approved grant from the Deutsche Forschungsgemeinschaft, I proposed two studies: for the first study, I optimistically included a period of three months for “pre-registration and set-up”, and for the second study a period of twelve months (because this would happen in parallel to data collection for the first study). This somewhat backfired, because, while the grant was approved, they cut 6 months from my proposed timeline because they considered 12 months to be way too long for “pre-registration and set-up”. So, the strategy of planning for registered reports in grant applications may work, but bear in mind that it’s not risk-free.

A new thing that I learned about during the meeting are Registered Report Research Grants: Here, journals pair up with funding agencies, and reviews of the Registered Report happens in parallel to the review of the funding proposal. This way, once in-principle acceptance is in, the funding is released and data collection can start. This sounds like an amazingly efficient win-win-win solution, and I sincerely hope that funding agencies will routinely offer such grants.

How to encourage researchers to publish Registered Reports?
Here, I’ll list a few bits and pieces that were suggested as solutions. Some of these are aimed at the individual researcher, though many would require some top-down changes. The demographic most happy to try out a new publication system, as mentioned above, are likely to be early-career researchers, especially PhD students.

Members at the meeting reported positive experiences with department-driven working groups, such as the ReproducibiliTea initiative or Open Science Cafés. In some departments, such working groups have led to PhD students taking the initiative and proposing to their advisors that they would like to do their next study as a Registered Report. We discussed that encouraging PhD students to publish one study as a Registered Report could be a good recommendation. For departments which have formal requirements about the number of publications that are needed in order to graduate, a Registered Report could count more than a standard publication: let’s say, they either need to publish three standard papers, or one standard paper and a Registered Report (or two Registered Reports).

Deciding to publish a Registered Report is like jumping into cold water: the format requires some pretty big changes in the way that a study is conducted, and one is unsure if it will really impress practically important people (such as potential employers or grant reviewers) over pumping out standard publications. Taking a step back, taking a deep breath and thinking about the pros and cons, I would say that, in many cases, the advantages outweigh the disadvantages. Yes, the planning stage may take longer, but you will cut time at the end,  during the publication process, with a much higher success that the study will be published. A fun fact I learned during the meeting: At the journal Cortex, once a Registered Report gets past the editorial desk (i.e., the editors established that the paper fits the scope of the journal), the rejection rate is only 10% (which is why we need more journals adopting the Registered Report format: this way, any paper, including those of interest to a specialised audience, will be able to find a good home). And, once you have in-principle acceptance, you can list the paper in on your CV, which is (to many professors) much more impressive than a list of "in preparation"/"submitted" publications. If the Stage 1 review process takes unusually long and you're running out of time in your contract, you can withdraw the Registered Report, incorporate the comments to date, and conduct the experiment as a Preregistered Study. 

Some of the suggestions listed above are aimed at individual researchers. The meeting was encouraging and helpful in terms of getting some suggestions that could be applied here and now. It also made it clear that top-down changes are required: the Registered Report format involves a different timeline compared to standard submissions, so university expectations (e.g., in terms of the required number of publications for PhD students, short-term post-doc contracts) and funding structures need to be changed.

Wednesday, February 13, 2019

P-values 101: An attempt at an intuitive but mathematically correct explanation

P-values are those things that are we want to be smaller than 0.05, as I knew them during my undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a non-scientist or work in a field that does not use p-values, you’ve probably heard of terms like p-hacking (e.g., from this video by John Oliver). P-values and, more specifically, p-hacking, are getting the blame for a lot of the things that go wrong in psychological science, and probably other fields as well.

So, what exactly are p-values, what is p-hacking, and what does all of that have to do with the replication crisis? Here, a lot of stats-savvy people shake their heads and say: “Well, it’s complicated.” Or they start explaining the formal definition of the p-value, which, in my experience, to someone who doesn’t already know a lot about statistics, sounds like blowing smoke and nitpicking on wording. There are rules, which, if one doesn’t understand how a p-value works, just sound like religious rituals, such as: “If you calculate a p-value halfway through data collection, the p-value that you calculate at the end of your study will be invalid.”

Of course, the mathematics behind p-values is on the complicated side: I had the luxury of being able to take the time to do a graduate certificate in statistics, and learn more about statistics than most of my colleagues will have the time for. Still, I won’t be able to explain how to calculate the -value without revising and preparing. However, the logic behind it is relatively simple, and I often wonder why people don’t explain p-values (and p-hacking) in a much less complicated way1. So here goes my attempt at an intuitive, maths-free, but at the same time mathematically correct explanation.

What are p-values?
A p-value is the conditional probability of obtaining the data or data more extreme, given the null hypothesis. This is a paraphrased formal text-book definition. So, what does this mean?

Conditional means that we’re assuming a universe where the null hypothesis is always true. For example, let’s say I pick 100 people and randomly divide them into two groups. Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined the null hypothesis to be true.

Then I conduct my study; I might ask the participants whether they feel better after having taken this pill and look at differences in well-being between Group 1 and Group 2. In the long run, I don’t expect any differences, but in a given sample, there will be at least some numerical differences due to random variation. Then – and this point is key for the philosophy behind the p-value – I repeat this procedure, 10 times, 100 times, 1,000,000 times (this is why the p-value is called a frequentist statistic). And, according to the definition of the p-value, I will find that, as the sample size increases, the percentage of experiments where I get a p-value smaller than or equal to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.

The second important aspect of the definition of the p-value is that it is the probability of the data, not of the null hypothesis. Why this is important to bear in mind can be, again, explained with the example above. Let’s say we conduct our experiment with the two identical placebo treatments, and get a p-value of 0.037. What’s the probability that the null hypothesis is false? 0%: We know that it’s true, because this is how we designed the experiment. What if we get a p-value of 0.0000001? The probability of the null hypothesis being false is still 0%.

If we’re interested in the probability of the hypothesis rather than the probability of the data, we’ll need to use Bayes’ Theorem (which I won’t explain here, as this would be a whole different blog post). The reason why the p-value is not interpretable as a probability about the hypothesis (null or otherwise) is that we need to consider how likely the null hypothesis is to be true. In the example above, if we do the experiment 1,000,000 times, the null hypothesis will always be true, by design, therefore we can be confident that every single significant p-value that we get is a false positive.

We can take a different example, where we know that the null hypothesis is always false. For example, we can take children between 0 and 10 year old, and correlate their height with their age. If we collect data repeatedly and accumulate a huge number of samples to calculate the p-value associated with the correlation coefficient, we will occasionally get p > 0.05 (the proportion of times that this will happen depends both on the sample size and the true effect size, and equals to 1- statistical power). So, if we do the experiment and find that the correlation is not significantly different from zero, what is the probability that the null hypothesis is false? It’s 100%, because everyone knows that children’s size increases with age.

In these two cases we know that, in reality, the null hypothesis is true and false, respectively. In an actual experiment, of course, we don’t know if it’s true of false – that’s why we’re doing the experiment in the first place. It is relevant, however, how many hypotheses we expect to be true in the long run.

If we study mainly crazy ideas that we had in the shower, chances are, the null hypothesis will often be true. This means that the posterior probability of the null hypothesis being false, even after obtaining a p-value smaller than 0.05, will still be relatively small.

If we carefully build on previous empirical work, design a methodologically sound study, and derive our predictions from well-founded theories, we will be more likely to study effects where the null hypothesis is often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical p-value.

So what about p-hacking?
That was the crash course to p-values, so now we turn to p-hacking. P-hacking practices are often presented as lists in talks about open science, and include:
-       Optional stopping: Collecting N participants, calculating a p-value, and deciding whether or not to continue with data collection depending on whether this first peek gives them a significant p-value or not,
-       Cherry-picking of outcome variables: Collecting many potential outcome variables in a given experiment, and deciding on or changing the outcome-of-interest after having looked at the results,
-       Determining or changing data cleaning procedures after conditional on whether the p-value is significant or not,
-       Hypothesising after results are known (HARKing): First collecting the data, and then writing a story to fit the results and framing it as a confirmatory study.

The reasons why each of these are wrong are not intuitive when we think about data. However, they become clearer when we think of a different example: Guessing the outcome of a coin toss. We can use this different example, because both the correctness of a coin toss guess and the data that we collect in an experiment are random variables: they follow the same principles of probability.

Imagine that I want to convince you that I have clairvoyant powers, and that I can guess the outcome of a coin toss. You will certainly be impressed, and think whether there might not be something to that, if I toss the coin 10 times, and correctly guess the outcome every single time. You will be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not present, I can instead make a video recording, cut out all of my unsuccessful attempts, and leave in only the ten correct guesses. From a probabilistic perspective, this would be the same as optional stopping.2

Now to cherry-picking of outcome variables: In order to convince you of my clairvoyant skills, I declare that I will toss the coin 10 times and guess the outcome correctly every single time. I toss the coin 10 times, and guess correctly 7 times out of 10. I can back-flip and argue that I still guessed correctly more often than incorrectly, and therefore my claim about my supernatural ability holds true. But you will not be convinced at all.

The removal of outliers, in the coin-toss metaphor, would be equivalent to discarding every toss where I guess incorrectly. HARKing would be equivalent to tossing the coin multiple times, then deciding on the story to tell later (“My guessing accuracy was significantly above chance”/ “I broke the world record in throwing the most coin tosses within 60 seconds” / “I broke the world record in throwing the coin higher than anyone else” / “I’m the only person to have ever succeeded in accidentally hitting my eye while tossing a coin!”). 

So, if you're unsure whether a given practice can be considered p-hacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not. 

More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the p-value is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the p-value (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a p-value between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).

Some final thoughts
I hope that readers who have not heard explanations of p-values that made sense to them found my attempt useful. I also hope that researchers who often find themselves in a position where they explain p-values will see my explanations as a useful example. I would be happy to get feedback on my explanation, specifically about whether I achieved these goals: I occasionally do talks about open science and relevant methods, and I aim to streamline my explanation such that it is truly understandable to a wide audience.

It is not the case that people who engage in p-hacking practices are incompetent or dishonest. At the same time, it is also not the case that people who engage in p-hacking practices would not do so if they learn to rattle off the formal definition of a p-value in their sleep. But, clearly, fixing the replication crisis involves improving the general knowledge of statistics (including how to correctly use p-values for inference). There seems to be a general consensus about this, but judging by many Twitter conversations that I’ve followed, there is no consensus about how much statistics, say, a researcher in psychological science needs to learn. After all, time is limited, and there is many other general as well as topic-specific knowledge that needs to be acquired in order to do good research. If I’m asked how much statistics psychology researchers should know, my answer is, as always: “As much as possible”. But a wide-spread understanding of statistics is not going to be achieved by wagging a finger at researchers who get confused when confronted with the difference between the probability of the hypothesis versus the probability of the data conditional on the hypothesis. Instead, providing intuitive explanations should not only improve understanding, but also show that, on some level (that, arguably, is both necessary and sufficient for researchers to do sound research), statistics is not an impenetrable swamp of maths.

Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

1 There are some intuitive explanations that I’ve found very helpful, like this one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it! 
2 Optional stopping does not involve willingly excluding relevant data, while my video recording example does. Unless disclosed, however, optional stopping does involve the withholding of information that is critical in order to correctly interpret the p-value. Therefore, I consider the two cases to be equivalent.

Tuesday, January 8, 2019

Scientific New Year’s Resolutions

Last year, I wrote an awful lot of blog posts about registered reports. My first New Year’s Resolution is to scale down my registered reports activism, which I start off by writing a blog post about my new year’s resolutions, where I’ll try hard not to mention registered reports more than 5 times.

Scientifically, last year was very successful for me, with the major event being a grant from the DFG, which will give me a full-time position until mid-2021 and the opportunity to work on my own project. This gives me my second New Year’s Resolution, which is to focus on the project that I’m supposed to be working on, and not to get distracted by starting any (or too many) new side projects.

Having a full-time job means I’ll have to somewhat reduce the amount of time I’ll spend on Open Science, compared to the past few years. However, Open Science is still very important to me: I see it as an integral part of the scientific process, and consequently as one of the things I should do as part of my job as a researcher. As a Freies Wissen Fellow, an ambassador of the Center for Open Science, and a member of the LMU Open Science Center, I have additional motivation and support in improving the openness of my own research and helping others to do the same. My third New Year’s Resolution is to start prioritising various Open Science projects.

In order to prioritise, I need to select some areas which I think are the most effective in increasing the openness of research. In my experience, talks and workshops are particularly effective (in fact, that’s how I became interested in Open Science). Last year, I gave talks as part of two Open Science Workshops (in Neuchâtel and Linz), and at our department, I organised an introduction to R (to encourage reproducible data analyses). The attendance of these workshops was quite good, and the attendees seemed very interested and motivated, which further supports my hypothesis about the efficiency of such events. I hope to hold more workshops this year: so far, I have one invitation for a workshop in Göttingen.

I still have two mentions of Registered Reports left for this blogpost (oops, just one now): as I think that they provide one of the most efficient ways to mitigate the biases that make a lot of psychological science non-replicable, I will continue trying to encourage journals to accept them as an additional publication format. I have explained why I think that this is important here, here, here, and here [in German], and how this can be achieved here and here. Please note that we’re still accepting signatures for the letters to editors and for an open letter, for more information see here.

As a somewhat less effective thing that I did in the previous years to support open science, I had signed all of my peer reviews. I am now wondering if it’s not doing more harm than good. Signing my reviews is a good way to force myself to stay constructive, but sometimes there really are mistakes in a paper that, objectively speaking, mean that the conclusions are wrong. One cannot blame authors for being upset at the reviewers for pointing this out – after all, we’ve all experienced this ourselves, and so many things depend on the number of publications. And, as much as I try to convince myself to see the review process is a constructive discussion between scientists, perhaps there is a good reason that peer review often happens anonymously, especially for an early career researcher.

As my fourth New Year’s Resolution, I will strongly reduce my use of Twitter compared to the last years. I’ve learned a lot through Twitter, especially about statistics and open science. Lately, however, I started feeling like a lot of the discussions are repeating; the Open Science community has grown since I joined Twitter, leading to interpersonal conflicts within this community that have little to do with the common goal of improving reproducibility and replicability of research. A few months ago, I’ve created a lurking account, where I will follow only a few reading researchers: this way, I can compulsively check Twitter without scrolling down endlessly, and I can keep up-to-date with any developments that are discussed in my actual field of research. So far, I really don’t have the feeling that I’m missing out, though I still check my old Twitter account occasionally, especially when I get notifications.

My fifth New Year’s Resolution is to continue learning, especially about programming and statistics. The more I learn, the more I realise how little I know. However, looking back, I also realise that I can now do things that I couldn’t even have dreamed of a couple of years ago, and it’s a nice feeling (and it substantially improves the quality of my work).

My final New Year’s Resolution goes both for my working life and for my personal life: Be nice: judge less, focus on the good things, not on the bad, take actions instead of complaining about things, be constructive in criticism. 

So here's to a good 2019! 

Oh, and I have one left: Please support Registered Reports!