Thursday, April 4, 2019

On being happy in academia

tl;dr: Don’t take your research too seriously.

I like reading blog posts with advice about how to survive a PhD, things one wished one had known before one started a PhD, and other similar topics. Here goes my own attempt at writing such a blog post. I’m not a PhD student anymore, so I can’t talk about my current PhD experiences, nor am I a professor who can look back and list all of the personal mistakes and successes that have led to “making it” in academia. It has been a bit over 4 years since I finished my PhD and started working as a post-doc, and comparing myself now and then I realise that I’m happier working in academia now. This is not to say that I was ever unhappy during my time in academia, but some changes in attitude have lead to – let’s say – a healthier relationship to my research. This is what I would like to write this blog post about.

Don’t let your research define you
In the end, all of the points below can be summarised as: Don’t take your research too seriously. Research inevitably involves successes and failures; everybody produces some good research and some bad research, and it’s not always easy for the researcher to decide which it is at the time. So there will always be criticism, some of it justified, some of it reflecting the bad luck of meeting Reviewer 2 on a bad day.

Receiving criticism has become infinitely easier for me over the years: after getting an article rejected, it used to take at least one evening of moping and a bottle of wine to recover, while now I only shrug. It’s difficult to identify exactly why my reaction to rejection changed over time, but I think it has something to do with seeing my research less as an integral part of my identity. I sometimes produce bad research, but this doesn’t make me a bad person. This way, even if a reviewer rightfully tears my paper to shreds, my ego remains intact.

Picking a research topic
Following up from the very abstract point above, I’ll try to isolate some more concrete suggestions that, in my case, may or may not have contributed to my changed mindset. The first one is about picking a research topic. At the beginning of my PhD, I wanted to pick a topic that is of personal relevance, such as bilingualism or reading in different orthographies. Then, becoming more and more cynical about the research literature, I started following up on topics where I’d read a paper and think: “That’s gotta be bullshit!”

Now, I’ve moved away from both approaches. On the one hand, picking a topic that one is too passionate about can, in my view, lead to a personal involvement which can (a) negatively impact one’s ability to view the research from an objective perspective, and (b) become an unhealthy obsession. To take a hypothetical example: if I had followed up on my interest in bilingualism, it is – just theoretically – possible that I would consistently find that being bilingual comes with some cognitive disadvantages. As someone who strongly believes in the benefit of a multilingual society, it would be difficult for me to objectively interpret and report my findings.

On the other hand, focussing on bad research can result in existential crises, anger at poor researchers, a permanently bad mood, and from a practical perspective, annoying some people with high statuses while having a relatively small impact on improving the state of the literature.

My conclusion has been that it’s good to choose topics that I find interesting, where there is good ground work, and where I know that, no matter what the outcome of my research, I will be comfortable to report it.

Working 9-to-5
My shift in mindset coincides with having met my husband (during my first post-doc in Italy). As a result, I started spending less time working outside of office hours. Coming home at a reasonable time, trying out some new hobbies (cross-country skiing, hiking, cycling), and spending weekends together or catching up with my old hobbies (music, reading) distracts from research, in a good way. When I get to work, I can approach my research with a fresh mind and potentially from a new perspective.

Having said this, I’ve always been good at not working too hard, which is probably the reason why I’ve always been pretty happy during my time in academia. (Having strong Australian and Russian cultural ties, I have both the “she’ll be right” and the “авось повезёт” attitudes. Contrary to popular belief, a relaxed attitude towards work is also compatible with a German mindset: in Germany, people tend to work hard during the day, but switch off as soon as they leave the office.) At the beginning of my PhD, one of the best pieces of advice that I received was to travel as much as possible. I tried to combine my trips with lab or conference visits, but I also spent a lot of time discovering new places and not thinking about research at all. During my PhD in Sydney, I also pursued old and new hobbies: I joined a book club, an orchestra, a French conversation group, took karate lessons, and thereby met lots of great people and have many good memories from my time in Sydney.

Stick to your principles
For me, this point is especially relevant from an Open Science perspective. Perhaps, if I spent less time on doing research in a way that is acceptable for me, I’d have double the amount of publications. This could, of course, be extremely advantageous on the job market. On the flip side, there are also more and more researchers who value quality over quantity: a job application and CV with lots of shoddy publications may be valued by some professors, but may be immediately trashed by others who are more onboard with the open science movement.

The moral of this story is: One can’t make everyone happy, so it’s best to stick to one’s own principles, which also has the side effect that you’ll be valued by researchers who share your principles. 

A project always takes longer than one initially thinks
Writing a research proposal of any kind involves writing a timeline. In my experience, the actual project will always take much longer than anticipated, often due to circumstances beyond your control (e.g., recruitment takes longer than expected, collaborators take a long time to read drafts). For planning purposes, it’s good to add a couple of months to account for this. And if you notice that you can’t keep up with your timeline: that’s perfectly normal.

Have a backup plan
For a long time, I saw the prospect of leaving academia as the ultimate personal failure. This changed when I made the decision that my priority is to work within commutable distance of my husband, which, in the case of an academic couple, may very well involve one or both leaving academia at some stage. It helped to get a more concrete idea of what leaving academia would actually mean. It is ideal if there is a “real world” profession where one’s research experience would be an advantage. In my case, I decided to learn more about statistics and data science. In addition to opening job prospects that sound very interesting and involve a higher salary than the one I would get in academia, it gave me an opportunity to learn things that helped take my research to a different level.

Choosing a mentor
From observing colleagues, I have concluded that the PhD supervisor controls at least 90% of a student’s PhD experience. For prospective PhD students, my advice would be to be very careful in choosing a supervisor. One of the biggest warning signs (from observing colleagues’ experiences) is a supervisor who reacts negatively when a (female) PhD student or post-doc decides to start a family. If you get the possibility to talk to your future colleagues before starting a PhD, ask them about their family life, and how easy they find it to combine family with their PhD or post-doc work. If you’re stuck in a toxic lab, my advice would be: Get out as soon as you can. Graduate as soon as possible and get a post-doc in a better lab; start a new PhD in a better lab, even if it means losing a few years; leave academia altogether. I’ve seen friends and colleagues getting long-lasting physical and psychological health problems because of a toxic research environment: nothing is worth going through this.

Having a backup plan, as per the point above, could be particularly helpful in getting away from a toxic research environment. Probably one would be much less willing to put up with an abusive supervisor if one is confident that there are alternatives out there.

Choosing collaborators
Collaborators are very helpful when it comes to providing feedback about aspects that you may not have thought about. One should bear in mind, though, that they have projects of their own: chances are, they will not be as enthusiastic about your project as you are, and may not have time to contribute as much as you expect. This is good to take into account when planning a project: assuming that you will need to do most of the work yourself will reduce misunderstandings and stress due to the perception of collaborators not working hard enough on this project.

Be aware of the Imposter Syndrome
During my PhD, there were several compulsory administrative events that, at the time, I thought were a waste of time. Among other things, we were told about the imposter syndrome at one such event (also, we were given the advice to travel as much as possible, by a recently graduated PhD student). It was relatively recently that I discovered that many other early-career researchers have never heard of the imposter syndrome before, and often feel inadequate, guilty, and tired from their research. Putting a label on this syndrome may help researchers to become more aware that most people often feel like an impostor in academia, and take this feeling less seriously.

Thursday, March 14, 2019

Why I plan to move away from statistical learning research

Statistical learning is a hot topic, with papers about a link between statistical learning ability and reading and/or dyslexia mushrooming all over the place. In this blog post, I am very sceptical about statistical learning, but before I continue, I should make it clear that it is, in principle, an interesting topic, and there are a lot of studies which I like very much.

I’ve published two papers on statistical learning and reading/dyslexia. My main research interest is in cross-linguistic differences in skilled reading, reading acquisition, and dyslexia, which was also the topic of my PhD. The reason why, during my first post-doc, I became interested in the statistical learning literature, was, in retrospect, exactly the reason why I should have stayed away from it: It seemed relevant to everything I was doing.

From the perspective of cross-linguistic reading research, statistical learning seemed to be integral to understanding cross-linguistic differences. This is because the statistical distributions underlying the print-to-speech correspondences differ across orthographies: in orthographies such as English, children need to extract statistical regularities such as a being often pronounced as /ɔ/ when it succeeds a w (e.g., in “swan”). The degree to which these statistical regularities provide reliable cues differ across orthographies: for example, in Finnish, letter-phoneme correspondences are reliable, such that children don’t need to extract a large number of subtle regularities in order to be able to read accurately.

From a completely different perspective, I became interested in the role of letter bigram frequency during reading. One can count how often a given letter pair co-occurs in a given orthography. The question is whether the average (or summed) frequency of the bigrams within a word affects the speed with which this word is processed. This is relevant to psycholinguistic experiments from a methodological perspective: if letter bigram frequency affects reading efficiency, it’s a factor that needs to be controlled while selecting items for an experiment. Learning the frequency of letter combinations can be thought of as a sort of statistical learning task, because it involves the conditional probabilities of a letter given the other.

The relevance of statistical learning to everything should have been a warning sign, because, as we know from Karl Popper, something that explains everything actually explains nothing. This becomes clearer when we ask the first question that a researcher should ask: What is statistical learning? I don’t want to claim that there is no answer to this question, nor do I want to provide an extensive literature review of the studies that do provide a precise definition. Suffice it to say: Some papers have definitions of statistical learning that are extremely broad, which is the reason why it is often used as a hand-wavy term denoting a mechanism that explains everything. This is an example of a one-word explanation, a term coined by Gerd Gigerenzer in his paper “Surrogates for theories” (one of my favourite papers). Other papers provide more specific definitions, for example, by defining statistical learning based on a specific task that is supposed to measure it. However, I have found no consensus among these definitions: and given that different researchers have different definitions for the same terminology, the resulting theoretical and empirical work is (in my view) a huge mess.

In addition to these theoretical issues, there is also a big methodological mess when it comes to the literature on statistical learning and reading or dyslexia. I’ve written about this in more detail in our two papers (linked above), but here I will list the methodological issues in a more compact manner: First, when we’re looking at individual differences (for example, by correlating reading ability and statistical learning ability), the lack of a task with good psychometric properties becomes a huge problem. This issue has been discussed in a number of publications by Noam Siegelman and colleagues, who even developed a task with good psychometric properties for adults (e.g., here and here). However, as far as I’ve seen, there are still no published studies on reading ability or dyslexia using improved tasks. Furthermore, recent evidence suggests that a statistical learning task which works well with adults still has very poor psychometric properties when applied to children.

Second, the statistical learning and reading literature is a good illustration of all the issues that are associated with the replication crisis. Some of these are discussed in our systematic review about statistical learning and dyslexia (linked above). The publication bias in this area (selective publication of significant results) became even clearer to me when I presented our study on statistical learning and reading ability – where we obtained a null result – at the SSSR conference in Brighton (2018). There were several proponents of the statistical learning theory (if we can call it that) of reading and dyslexia, but none of them came to my poster to discuss this null result. Conversely, a number of people dropped by to let me know that they’ve conducted similar studies and also gotten null results.

Papers on statistical learning and reading/dyslexia continue to be published, and at some point, I was close to being convinced that maybe, visual statistical learning is related to learning to read in orthographies with a visually complex orthography. But then, some major methodological or statistical issue always jumps out at me when I read a paper closely enough. The literature reviews of these papers tend to be biased, often listing studies with null-results as evidence for the presence of an effect, or else picking out all the flaws of papers with null results, while treating the studies with positive results as a holy grail. I have stopped reading such papers, because it does not feel like a productive use of my time.

I have also stopped accepting invitations to review papers about statistical learning and reading/dyslexia, because I have started to doubt my ability to give an objective review. By now, I have a strong prior that there is no link between domain-general statistical learning ability and reading/dyslexia. I could be convinced otherwise, but would require very strong evidence (i.e., a number large-scale pre-registered studies from independent labs with psychometrically well-established tasks). While I strongly believe that such evidence is required, I realise that it is unreasonable to expect such studies from most researchers who conduct this type of research, who are mainly early-career researchers who base their methodology on previous studies.

I also stopped doing or planning any studies on domain-general statistical learning. The amount of energy necessary to refute bullshit is an order of magnitude bigger than to produce it, as Alberto Brandolini famously tweeted. This is not to say that everything to do with statistical learning and reading/dyslexia is bullshit, but – well, some of it definitely is. I hope that good research will continue to be done in this area, and that the state of the literature will become clearer because of this. In the meantime, I have made the personal decision to move away from this line of research. I have received good advice from one of my PhD supervisors: not to get hung up on research that I think is bad, but to pick an area where I think there is good work and to build on that. Sticking to this advice definitely makes the research process more fun (for me). Statistical learning studies are likely to yield null results, which end up uninterpretable because of the psychometric issues with statistical learning tasks. Trying to publish this kind of work is not a pleasant experience.

Why did I write this blog post? Partly, just to vent. I wrote it as a blog post and not as a theoretical paper, because it lacks the objectivity and a systematic approach which would be required for a scientifically sound piece of writing. If I were to write a scientifically sound paper, I would need to break my resolution to stop doing research on statistical learning, so a blog post it is. Some of the issues above have been discussed in our systematic review about statistical learning and dyslexia, but I also thought it would be good to summarise these arguments in a more concise form. Perhaps some beginning PhD student who is thinking about doing their project on statistical learning and reading will come across this post. In this case, my advice would be: pick a different topic. 

Sunday, March 10, 2019

What’s next for Registered Reports? Selective summary of a meeting (7.3.2019)

Last week, I attended a meeting about Registered Reports. It was a great opportunity, not only to discuss Registered Reports, but also to meet some people whom I had previously only known from twitter, over a glass of Whiskey close to London Bridge.

The meeting felt very productive, and I took away a lot of new information, about the Registered Report Format in general, and also some specific things that will be useful to me when I submit my next Registered Report. Here, I don’t aim to summarise everything that was discussed, but to focus on those aspects that could be of practical importance to individual researchers.

What’s stopping researchers from submitting Registered Reports?
We dedicated the entire morning to discussing how to increase the submission rate of Registered Reports. Before the meeting, I had done an informal survey among colleagues and on twitter to see what reasons people had for not submitting Registered Reports. The response rate was pretty low, suggesting that a lack of interest may be a leading factor (due either to apathy or scepticism – from my informal survey, I can’t tell). From people who did respond, the main reason was time: often, especially younger researchers are on short-term contracts (1-3 years), and are pressured for various reasons to start data collection as soon as possible. Among such reasons, people mentioned grants: funders often expect strict adherence to a timeline. And, unfortunately, such timing pressures disproportionately affect earlier career researchers, exactly the demographic which is most open to trying out a new way of conducting and publishing research.

Submitting a Registered Report may take a while – there is no point sugar-coating this. In contrast to standard studies, authors of Registered Reports need to spend more time to plan the study, because writing the report involves planning in detail; there may be several rounds of review before in-principle acceptance, and addressing reviewers’ comments may involve collecting pilot data. Given my limited experience, I would estimate that about 6-9 months would need to be added to the study timeline before one can count with in-principle acceptance and data collection can be started.

Of course, the increase in time that you spend before conducting the experiment will substantially improve the quality of the paper. A Registered Report is very likely to cut a lot of time at the end of the research cycle: when realising how long it may take to get in-principle acceptance, you should always bear in mind the painstakingly slow and frustrating process of submitting a paper to one journal after the other, accumulating piles of reviews varying in constructiveness and politeness, being told about methodological flaws that now you can’t fix, about how your results should have been different, and eventually unceremoniously throwing the study which you started with such great enthusiasm into the file-drawer.

Long-term benefits aside, unfortunately the issue of time remains for researchers on short-term contracts and with grant pressures. We could not think of any quick fix to this problem. In the long term, solutions may involve planning in this time when you write your next grant application. One possibility could be to write that you plan to conduct a systematic review during the time that you wait for in-principle acceptance. In my recently approved grant from the Deutsche Forschungsgemeinschaft, I proposed two studies: for the first study, I optimistically included a period of three months for “pre-registration and set-up”, and for the second study a period of twelve months (because this would happen in parallel to data collection for the first study). This somewhat backfired, because, while the grant was approved, they cut 6 months from my proposed timeline because they considered 12 months to be way too long for “pre-registration and set-up”. So, the strategy of planning for registered reports in grant applications may work, but bear in mind that it’s not risk-free.

A new thing that I learned about during the meeting are Registered Report Research Grants: Here, journals pair up with funding agencies, and reviews of the Registered Report happens in parallel to the review of the funding proposal. This way, once in-principle acceptance is in, the funding is released and data collection can start. This sounds like an amazingly efficient win-win-win solution, and I sincerely hope that funding agencies will routinely offer such grants.

How to encourage researchers to publish Registered Reports?
Here, I’ll list a few bits and pieces that were suggested as solutions. Some of these are aimed at the individual researcher, though many would require some top-down changes. The demographic most happy to try out a new publication system, as mentioned above, are likely to be early-career researchers, especially PhD students.

Members at the meeting reported positive experiences with department-driven working groups, such as the ReproducibiliTea initiative or Open Science Cafés. In some departments, such working groups have led to PhD students taking the initiative and proposing to their advisors that they would like to do their next study as a Registered Report. We discussed that encouraging PhD students to publish one study as a Registered Report could be a good recommendation. For departments which have formal requirements about the number of publications that are needed in order to graduate, a Registered Report could count more than a standard publication: let’s say, they either need to publish three standard papers, or one standard paper and a Registered Report (or two Registered Reports).

Deciding to publish a Registered Report is like jumping into cold water: the format requires some pretty big changes in the way that a study is conducted, and one is unsure if it will really impress practically important people (such as potential employers or grant reviewers) over pumping out standard publications. Taking a step back, taking a deep breath and thinking about the pros and cons, I would say that, in many cases, the advantages outweigh the disadvantages. Yes, the planning stage may take longer, but you will cut time at the end,  during the publication process, with a much higher success that the study will be published. A fun fact I learned during the meeting: At the journal Cortex, once a Registered Report gets past the editorial desk (i.e., the editors established that the paper fits the scope of the journal), the rejection rate is only 10% (which is why we need more journals adopting the Registered Report format: this way, any paper, including those of interest to a specialised audience, will be able to find a good home). And, once you have in-principle acceptance, you can list the paper in on your CV, which is (to many professors) much more impressive than a list of "in preparation"/"submitted" publications. If the Stage 1 review process takes unusually long and you're running out of time in your contract, you can withdraw the Registered Report, incorporate the comments to date, and conduct the experiment as a Preregistered Study. 

Some of the suggestions listed above are aimed at individual researchers. The meeting was encouraging and helpful in terms of getting some suggestions that could be applied here and now. It also made it clear that top-down changes are required: the Registered Report format involves a different timeline compared to standard submissions, so university expectations (e.g., in terms of the required number of publications for PhD students, short-term post-doc contracts) and funding structures need to be changed.

Wednesday, February 13, 2019

P-values 101: An attempt at an intuitive but mathematically correct explanation

P-values are those things that are we want to be smaller than 0.05, as I knew them during my undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a non-scientist or work in a field that does not use p-values, you’ve probably heard of terms like p-hacking (e.g., from this video by John Oliver). P-values and, more specifically, p-hacking, are getting the blame for a lot of the things that go wrong in psychological science, and probably other fields as well.

So, what exactly are p-values, what is p-hacking, and what does all of that have to do with the replication crisis? Here, a lot of stats-savvy people shake their heads and say: “Well, it’s complicated.” Or they start explaining the formal definition of the p-value, which, in my experience, to someone who doesn’t already know a lot about statistics, sounds like blowing smoke and nitpicking on wording. There are rules, which, if one doesn’t understand how a p-value works, just sound like religious rituals, such as: “If you calculate a p-value halfway through data collection, the p-value that you calculate at the end of your study will be invalid.”

Of course, the mathematics behind p-values is on the complicated side: I had the luxury of being able to take the time to do a graduate certificate in statistics, and learn more about statistics than most of my colleagues will have the time for. Still, I won’t be able to explain how to calculate the -value without revising and preparing. However, the logic behind it is relatively simple, and I often wonder why people don’t explain p-values (and p-hacking) in a much less complicated way1. So here goes my attempt at an intuitive, maths-free, but at the same time mathematically correct explanation.

What are p-values?
A p-value is the conditional probability of obtaining the data or data more extreme, given the null hypothesis. This is a paraphrased formal text-book definition. So, what does this mean?

Conditional means that we’re assuming a universe where the null hypothesis is always true. For example, let’s say I pick 100 people and randomly divide them into two groups. Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined the null hypothesis to be true.

Then I conduct my study; I might ask the participants whether they feel better after having taken this pill and look at differences in well-being between Group 1 and Group 2. In the long run, I don’t expect any differences, but in a given sample, there will be at least some numerical differences due to random variation. Then – and this point is key for the philosophy behind the p-value – I repeat this procedure, 10 times, 100 times, 1,000,000 times (this is why the p-value is called a frequentist statistic). And, according to the definition of the p-value, I will find that, as the sample size increases, the percentage of experiments where I get a p-value smaller than or equal to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.

The second important aspect of the definition of the p-value is that it is the probability of the data, not of the null hypothesis. Why this is important to bear in mind can be, again, explained with the example above. Let’s say we conduct our experiment with the two identical placebo treatments, and get a p-value of 0.037. What’s the probability that the null hypothesis is false? 0%: We know that it’s true, because this is how we designed the experiment. What if we get a p-value of 0.0000001? The probability of the null hypothesis being false is still 0%.

If we’re interested in the probability of the hypothesis rather than the probability of the data, we’ll need to use Bayes’ Theorem (which I won’t explain here, as this would be a whole different blog post). The reason why the p-value is not interpretable as a probability about the hypothesis (null or otherwise) is that we need to consider how likely the null hypothesis is to be true. In the example above, if we do the experiment 1,000,000 times, the null hypothesis will always be true, by design, therefore we can be confident that every single significant p-value that we get is a false positive.

We can take a different example, where we know that the null hypothesis is always false. For example, we can take children between 0 and 10 year old, and correlate their height with their age. If we collect data repeatedly and accumulate a huge number of samples to calculate the p-value associated with the correlation coefficient, we will occasionally get p > 0.05 (the proportion of times that this will happen depends both on the sample size and the true effect size, and equals to 1- statistical power). So, if we do the experiment and find that the correlation is not significantly different from zero, what is the probability that the null hypothesis is false? It’s 100%, because everyone knows that children’s size increases with age.

In these two cases we know that, in reality, the null hypothesis is true and false, respectively. In an actual experiment, of course, we don’t know if it’s true of false – that’s why we’re doing the experiment in the first place. It is relevant, however, how many hypotheses we expect to be true in the long run.

If we study mainly crazy ideas that we had in the shower, chances are, the null hypothesis will often be true. This means that the posterior probability of the null hypothesis being false, even after obtaining a p-value smaller than 0.05, will still be relatively small.

If we carefully build on previous empirical work, design a methodologically sound study, and derive our predictions from well-founded theories, we will be more likely to study effects where the null hypothesis is often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical p-value.

So what about p-hacking?
That was the crash course to p-values, so now we turn to p-hacking. P-hacking practices are often presented as lists in talks about open science, and include:
-       Optional stopping: Collecting N participants, calculating a p-value, and deciding whether or not to continue with data collection depending on whether this first peek gives them a significant p-value or not,
-       Cherry-picking of outcome variables: Collecting many potential outcome variables in a given experiment, and deciding on or changing the outcome-of-interest after having looked at the results,
-       Determining or changing data cleaning procedures after conditional on whether the p-value is significant or not,
-       Hypothesising after results are known (HARKing): First collecting the data, and then writing a story to fit the results and framing it as a confirmatory study.

The reasons why each of these are wrong are not intuitive when we think about data. However, they become clearer when we think of a different example: Guessing the outcome of a coin toss. We can use this different example, because both the correctness of a coin toss guess and the data that we collect in an experiment are random variables: they follow the same principles of probability.

Imagine that I want to convince you that I have clairvoyant powers, and that I can guess the outcome of a coin toss. You will certainly be impressed, and think whether there might not be something to that, if I toss the coin 10 times, and correctly guess the outcome every single time. You will be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not present, I can instead make a video recording, cut out all of my unsuccessful attempts, and leave in only the ten correct guesses. From a probabilistic perspective, this would be the same as optional stopping.2

Now to cherry-picking of outcome variables: In order to convince you of my clairvoyant skills, I declare that I will toss the coin 10 times and guess the outcome correctly every single time. I toss the coin 10 times, and guess correctly 7 times out of 10. I can back-flip and argue that I still guessed correctly more often than incorrectly, and therefore my claim about my supernatural ability holds true. But you will not be convinced at all.

The removal of outliers, in the coin-toss metaphor, would be equivalent to discarding every toss where I guess incorrectly. HARKing would be equivalent to tossing the coin multiple times, then deciding on the story to tell later (“My guessing accuracy was significantly above chance”/ “I broke the world record in throwing the most coin tosses within 60 seconds” / “I broke the world record in throwing the coin higher than anyone else” / “I’m the only person to have ever succeeded in accidentally hitting my eye while tossing a coin!”). 

So, if you're unsure whether a given practice can be considered p-hacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not. 

More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the p-value is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the p-value (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a p-value between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).

Some final thoughts
I hope that readers who have not heard explanations of p-values that made sense to them found my attempt useful. I also hope that researchers who often find themselves in a position where they explain p-values will see my explanations as a useful example. I would be happy to get feedback on my explanation, specifically about whether I achieved these goals: I occasionally do talks about open science and relevant methods, and I aim to streamline my explanation such that it is truly understandable to a wide audience.

It is not the case that people who engage in p-hacking practices are incompetent or dishonest. At the same time, it is also not the case that people who engage in p-hacking practices would not do so if they learn to rattle off the formal definition of a p-value in their sleep. But, clearly, fixing the replication crisis involves improving the general knowledge of statistics (including how to correctly use p-values for inference). There seems to be a general consensus about this, but judging by many Twitter conversations that I’ve followed, there is no consensus about how much statistics, say, a researcher in psychological science needs to learn. After all, time is limited, and there is many other general as well as topic-specific knowledge that needs to be acquired in order to do good research. If I’m asked how much statistics psychology researchers should know, my answer is, as always: “As much as possible”. But a wide-spread understanding of statistics is not going to be achieved by wagging a finger at researchers who get confused when confronted with the difference between the probability of the hypothesis versus the probability of the data conditional on the hypothesis. Instead, providing intuitive explanations should not only improve understanding, but also show that, on some level (that, arguably, is both necessary and sufficient for researchers to do sound research), statistics is not an impenetrable swamp of maths.

Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

1 There are some intuitive explanations that I’ve found very helpful, like this one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it! 
2 Optional stopping does not involve willingly excluding relevant data, while my video recording example does. Unless disclosed, however, optional stopping does involve the withholding of information that is critical in order to correctly interpret the p-value. Therefore, I consider the two cases to be equivalent.

Tuesday, January 8, 2019

Scientific New Year’s Resolutions

Last year, I wrote an awful lot of blog posts about registered reports. My first New Year’s Resolution is to scale down my registered reports activism, which I start off by writing a blog post about my new year’s resolutions, where I’ll try hard not to mention registered reports more than 5 times.

Scientifically, last year was very successful for me, with the major event being a grant from the DFG, which will give me a full-time position until mid-2021 and the opportunity to work on my own project. This gives me my second New Year’s Resolution, which is to focus on the project that I’m supposed to be working on, and not to get distracted by starting any (or too many) new side projects.

Having a full-time job means I’ll have to somewhat reduce the amount of time I’ll spend on Open Science, compared to the past few years. However, Open Science is still very important to me: I see it as an integral part of the scientific process, and consequently as one of the things I should do as part of my job as a researcher. As a Freies Wissen Fellow, an ambassador of the Center for Open Science, and a member of the LMU Open Science Center, I have additional motivation and support in improving the openness of my own research and helping others to do the same. My third New Year’s Resolution is to start prioritising various Open Science projects.

In order to prioritise, I need to select some areas which I think are the most effective in increasing the openness of research. In my experience, talks and workshops are particularly effective (in fact, that’s how I became interested in Open Science). Last year, I gave talks as part of two Open Science Workshops (in Neuchâtel and Linz), and at our department, I organised an introduction to R (to encourage reproducible data analyses). The attendance of these workshops was quite good, and the attendees seemed very interested and motivated, which further supports my hypothesis about the efficiency of such events. I hope to hold more workshops this year: so far, I have one invitation for a workshop in Göttingen.

I still have two mentions of Registered Reports left for this blogpost (oops, just one now): as I think that they provide one of the most efficient ways to mitigate the biases that make a lot of psychological science non-replicable, I will continue trying to encourage journals to accept them as an additional publication format. I have explained why I think that this is important here, here, here, and here [in German], and how this can be achieved here and here. Please note that we’re still accepting signatures for the letters to editors and for an open letter, for more information see here.

As a somewhat less effective thing that I did in the previous years to support open science, I had signed all of my peer reviews. I am now wondering if it’s not doing more harm than good. Signing my reviews is a good way to force myself to stay constructive, but sometimes there really are mistakes in a paper that, objectively speaking, mean that the conclusions are wrong. One cannot blame authors for being upset at the reviewers for pointing this out – after all, we’ve all experienced this ourselves, and so many things depend on the number of publications. And, as much as I try to convince myself to see the review process is a constructive discussion between scientists, perhaps there is a good reason that peer review often happens anonymously, especially for an early career researcher.

As my fourth New Year’s Resolution, I will strongly reduce my use of Twitter compared to the last years. I’ve learned a lot through Twitter, especially about statistics and open science. Lately, however, I started feeling like a lot of the discussions are repeating; the Open Science community has grown since I joined Twitter, leading to interpersonal conflicts within this community that have little to do with the common goal of improving reproducibility and replicability of research. A few months ago, I’ve created a lurking account, where I will follow only a few reading researchers: this way, I can compulsively check Twitter without scrolling down endlessly, and I can keep up-to-date with any developments that are discussed in my actual field of research. So far, I really don’t have the feeling that I’m missing out, though I still check my old Twitter account occasionally, especially when I get notifications.

My fifth New Year’s Resolution is to continue learning, especially about programming and statistics. The more I learn, the more I realise how little I know. However, looking back, I also realise that I can now do things that I couldn’t even have dreamed of a couple of years ago, and it’s a nice feeling (and it substantially improves the quality of my work).

My final New Year’s Resolution goes both for my working life and for my personal life: Be nice: judge less, focus on the good things, not on the bad, take actions instead of complaining about things, be constructive in criticism. 

So here's to a good 2019! 

Oh, and I have one left: Please support Registered Reports!

Friday, December 28, 2018

Realising Registered Reports Part 3: Dispelling some myths and addressing some FAQs

While collecting more signatures for our Registered Reports campaign, I spent some time arguing with people on the internet – something that I don’t normally do. Internet arguments, as well as most arguments in real life, tend to be about people with opposing views shouting at each other, then going away without changing their opinions in the slightest. I’m no exception: nothing of what people on the internet told me so far convinced me that registered reports are a bad idea. But as their arguments kept repeating, I decided to write a blog post with my counter-arguments. I hope that laying out my arguments and counter-arguments in a systematic manner will be more effective in showing what exactly we want to achieve, and why we think that registered reports are the best way to reach these goals. And, if nothing else, I can refer people on the internet to this blog post rather than reiterating my points in the future.

What do we actually want?
For a registered report, the authors write the introduction and methods section, as well as a detailed analysis plan, and submit the report to a journal, where it undergoes peer review. Once the reviewers are happy with the proposed methods (i.e., they think the methods give the authors the biggest chance to obtain interpretable results), the authors get conditional acceptance, and can go ahead with data collection. The paper will be published regardless of the results. Registered Reports is not synonymous with Preregistration: For the latter, you also write a report before collecting the data, but rather than submitting it to a journal for peer review, you upload a time-stamped, non-modifiable version of the report, and include a link to it in the final report.
The registered reports format is not suited for all studies, and at this point, it is worth stressing that we don’t want to ban the publishing of any study which is not pre-registered. This is an important point, because it is my impression that at least 90% of the criticisms of registered reports assume that we want them to be the only publication format. So, once again: We merely want journals to offer the possibility to submit registered reports, alongside with the traditional formats that they already offer. Exploratory studies or well-powered multi-experiment studies offer a lot of insights, and banning them and publishing only registered reports would be a very stupid idea. Among the supporters of registered reports that I’ve met, I don’t think a single one would disagree with me on this account.
Having said this, I think there are studies for which the majority of people (scientists and the general public alike) would really prefer for registered reports and pre-registration to be the norm. I guess that very few people would be happy to take some medication, when the only study supporting its effectiveness is a non-registered trial conducted by the pharmaceutical company that sells this drug, and shows that it totally works and has absolutely no side effects. In my research area, some researchers (or pseudo-researchers) occasionally produce a “cure” for dyslexia, for example in the form of a video game that supposedly trains some cognitive skill that is supposedly deficient in dyslexia. This miracle cure then gets sold for a lot of money, thus taking away both time and money from children with dyslexia and their parents. For studies showing the effectiveness of such “cures”, I think it would be justifiable, from the side of the parents, as well as the tax payers who fund the research, to demand that such studies are pre-registered.
To reiterate: Registered reports should absolutely not be the only publication format. But when it comes to making important decisions based on research, we should be as sure as possible that the research is replicable. In my ideal world, one should conduct a registered study before marketing a product or basing a policy on a result.

Aren’t there better ways to achieve a sound and reproducible psychological science?
Introducing the possibility of publishing registered reports also by no means suggests that other ways of addressing the replication crisis are unnecessary. Of course, it’s important that researchers understand methodological and statistical issues that can lead to a result being non-replicable or irreproducible. Open data and analysis scripts are also very important. If researchers treated a single experiment as a brick in the wall rather than a source of the absolute truth, if there was no selective publication of positive results, if there was transparency and data sharing, and researchers would conduct incremental studies that would be regularly synthesised in the form of meta-analyses, registered reports might indeed be unnecessary. But until we achieve such a utopia, registered reports are arguably the fastest way to work towards a replicable and reproducible science. Perhaps I’m wrong here: I’m more than happy to be convinced that I am. If there is a more efficient way to reduce publication bias, p-hacking, and HARKing, I will reduce the amount of time I spend pushing for registered reports, and start supporting this more efficient method instead.

Don’t registered reports take away the power from the authors?
As an author, you submit your study before you even collect the data. Some people might perceive it as unpleasant to get their study torn to pieces by reviewers before it’s even finished. Others worry that reviewers and editors get more power to stop studies that they don’t want to see published. The former concern, I guess, is a matter of opinion. As far as I’m concerned, the more feedback I get before I collect data, the more likely it will be that any flaws in the study will be picked up. Yes, I don’t love being told that I designed a stupid study, but I hate even more when I conduct an experiment only to be told (or realise myself) that it’s totally worthless because I overlooked some methodological issue. Other people may be more confident in their ability to design a good study, which is fine: They don’t have to submit their studies as registered reports if they don’t want to.
As for the second point: Imagine that you submit a registered report, and the editor or one of the reviewers says: “This study should not be conducted unless the authors use my/my friend’s measurement of X.” In this way, the argument goes, the editor has the power to influence not only what kind of studies get published, but even what kind of studies are being conducted. Except, if many journals publish registered reports, the authors can simply submit their registered report to a different journal, if they think that the reviewer’s or editor’s request is driven by politics rather than scientific considerations. This is why I’m trying to encourage as many journals as possible to start offering registered reports.
Besides, if we compare this to the current situation, I would argue that the power that editors and reviewers have would either diminish or stay the same. It doesn’t matter how many studies get conducted, if (as in the current system) many of them get rejected on the basis of “they don’t fit my/my friend’s pet theory”. Let’s say I want to test somebody’s pet theory, or replicate some important result. In my experience, original authors genuinely believe in their work: chances are, they will be supportive during the planning stage of the experiment. Things might look different if the results are already in, and the theory is not supported: then, they often try to find any reason to convince the editor that the study is flawed and that the authors are incompetent.
As an example, let’s imagine the worst case scenario: You want to replicate Professor X’s ground-breaking study, but you don’t know that Professor X actually fabricated the data. It’s in Professor X’s interest to prevent any replication of this original study, because it would likely show that it’s not replicable. As the replicator, you can submit a registered report to journal after journal after journal, and it is likely that Professor X will be asked to review it. Sure, it’s annoying, but at some stage you’re likely to either find an editor who looks through Professor X’s strategy, or Professor X will be too busy to review. Either way, I don’t see how this is different from the usual game of get-your-study-published-in-the-highest-possible-IF-journal that we all know and love in the traditional publication system.
And, if you really can’t get your registered report accepted and you think it’s for political reasons, you can still conduct the study. I will bet that the final version of the study, with the data and conclusion, will be even more difficult to publish than the registered report. But at least you’ll be able to publish it as a preprint, which would be an extremely valuable addition to the literature.

I’m still not convinced.
That’s fine – we can agree to disagree. I would be very happy if you could add any further arguments against registered reports in the comments section below, that cannot be countered by the following points:
(1) Other publication formats should continue to be available, even if registered reports become the norm for some types of studies, and
(2) Definitely, we need to continue working on other ways to improve the replication crisis, such as improving statistics courses for undergraduate and graduate psychology programs.

I think that registered reports are a good idea. What can I do to support registered reports?*
I’m very happy to hear that! See here (about writing letters to editors) and here (being a signatory on an open letter and letters to editors). 

* I wish this was a FAQ...

Edit (1.8.2019): In response to a comment, I've added a sentence about preregistration, and how it differs from RRs.

Monday, December 3, 2018

Realising Registered Reports: Part 2

TL;DR: Please support Registered Reports by becoming a signatory on letters to editors by leaving your name here.

The current blog post is a continuation of a blogpost that I wrote half a year ago. Back then, frustrated at the lack of field-specific journals which accept Registered Reports, I’d decided to start collecting signatures from reading researchers and write to the editors of journals which publish reading-related research, to encourage them to offer Registered Reports as a publication format. My exact procedure is described in the blog post (linked above).

On the 18th of June, I have contacted 44 journal editors. Seven of the editors wrote back to promise that they would discuss Registered Reports at the next meeting of the editorial board. One had already started piloting this new format (Discourse Processes), and one declined for understandable organisational reasons. To my knowledge, two reading-related journals have already taken the decision to offer Registered Reports in the future: the Journal of Research in Reading, and the Journal of the Scientific Studies of Reading. So, there is change, but it’s slow.

At the same time as I was collecting signatories and sending emails to editors about Registered Reports, a number of researchers decided to do the same. A summary of all journals that have been approached by one of us, and the journals’ responses, can be found here. A few months ago, we decided to join forces.

First, as a few of us were from the area of psycholinguistics, we decided to pool our signatories, and continue writing to editors in this field. The template on which the letters to the editors would be based can be found here.

Second, we decided to start approaching journals which are aimed at a broader audience and have a higher impact. Here, our pooled lists would already contain hundreds of researchers who would like to see more Registered Reports offered by academic journals. The first journal that we aim to contact is PNAS: As they recently announced that they will be offering a new publication format (Brief Reports), we would like to encourage them to consider also adding Registered Reports to their repertoire. The draft letter to the editor can be found here.

Third, we also decided to write an open letter, addressed to all editors and publishers. The ambition is to publish it in Nature News and Comments or a similar outlet. The draft letter can be found here.

I am writing this blog post, because we’re still on the lookout for signatories. You can choose to support all three initiatives, or any combination of them, by taking two minutes to fill in this Google form. You can also email me directly – whichever is easier for you. Every signatory matters: from any field, any career stage, any country. It would also be helpful if you could forward this link to anyone who you think might support this initiative. I’m also happy to answer any questions, take in suggestions, or discuss concerns about this format.