Friday, May 24, 2019

The perfect article

Last year, I went to an R Ladies event. This event took place at the Süddeutsche Zeitung, one of the biggest and most serious newspapers in Germany. The workshop was presented by two R Ladies from the data-driven journalism department of the newspaper. The event was extremely interesting: as it turns out, the job of a data-driven journalist is to collect or find data, and present it to the readers in an understandable way. One project which was presented included an analysis of the transcripts from the Bundestag meetings, presented in easy-to-digest graphs. Another project contained new data on the very socially relevant question of housing prices in Germany.

Throughout the event, I kept thinking: They are much further in terms of open communication than we are. As an essential part of their job, data-driven journalists need to present often complex data in a way that any interested reader can interpret it. At the same time, the R Ladies at the event kept emphasising that the data and R/RMarkdown scripts were publicly available, for anyone who doubted their conclusions or wanted to try out things for themselves.

This brings me to the idea of what the perfect article would look like. I guess you know where this is going, but before I go there, to avoid disappointment, I will add that, in this blog post, I will not give any advice on how to actually write such a perfect article, nor how to achieve a research world where such articles will be the norm. I will just provide a dreamer’s description of a utopian world, and finish off with some questions that I have no answer for.

The perfect article would have a pyramidal structure. At the top layer would be a description of the study, written at a level that a high school student should understand it. The data could be presented in an interactive shiny app, and there would be easy-to-read explanations of the research question, its importance, how the data should be interpreted to answer this research question, and any limitations that may affect the interpretation of the data.

Undergraduate students in the field of study (or very interested readers) would be navigated to a more detailed description of the study, which describes the research methods in more detail. Here, the statistical analyses and the theoretical relevance would need to be explained, and a more thorough description of methodological limitations should be provided.

The next level would be aimed at researchers in the field of study. Here, the study would need to be placed in relation to previous work on this topic, and a very thorough discussion of the theoretical implications would be needed.

The final level would include all the data, all the materials, and all the analysis script. This level would be aimed at researchers who plan to build on this work. It will allow them to double check that the results are robust and that there are no mistakes in the data analysis. They would also be able to get the materials, allowing them to build as closely as possible on previous work.

Even in an ideal world, this format would not be suitable for all fields. For example, in theoretical mathematics, it would probably be very difficult to come up with a project that could be explained to a lay audience through a shiny app. More applied mathematics could, however, be presented as the deeper layers of a project where these methods are applied.

Many practical concerns jump out of my perfect article proposal. Most obviously, an article of this form would be unsuitable for a paper format. It would, however, be relatively straight-forward to implement in online journals. This, however, would require expertise that not all academic authors have. (In fact, I would guess: an expertise that most academic authors don’t have.) Even for those that do have the skills, it would require much more time, and as we all know, time is something that we don’t have, because we need to publish in large quantities if we want to have a job. Another issue with this format is: many studies are incremental, and they would not be at all interesting to a general audience. So why spend time on creating the upper layers of the pyramid?

A solution to the last issue would be to completely re-think the role that papers have in the academic process. Instead of publishing papers, the mentality could switch to publishing projects. Often, a researcher or lab is concerned with a broader research question. Perhaps what would be, in our current system, ten separate publications could be combined to make a more general point about such a broad research question, which would be of interest to a general public. Such a switch in mind set would also give researchers a greater sense of purpose, as they would need to keep this broad research question in the back of their minds while they conduct separate studies.

Another question would fall out of this proposal to publish projects rather than individual studies: What would happen with authorship? If five different PhD students conducted the individual studies, some of them would need to give up their first authorship if their work is combined into a single project. Here, the solution would be to move away from an authorship model, and instead list each researcher’s contribution along with the project’s content. And, as part of the team, one could also find a programmer (or data-driven journalist), who would be able to contribute to the technical side of presenting the content, and to making sure that the upper layers of the presentation are really understandable to the intended audience.

The problem would remain that PhD students would go without first authorship. But, in an ideal world, this would not matter, because their contributions to the project would be clearly acknowledged, and potential employers could actually judge them based on the quality, not the quantity of their work. In an ideal world…

Thursday, May 16, 2019

Why I stopped signing my reviews

Since the beginning of this year, I stopped signing my peer reviews. I had systematically signed my reviews for a few years: I think I started this at the beginning of my first post-doc, back in 2015. My reasons for signing were the following: (1) Science should be about an open exchange of ideas. I have previously benefitted from signed reviews, because I could contact the reviewer with follow-up questions, which has resulted in very fruitful discussion. (2) Something ideological about open science (I don’t remember the details). (3) As an early career researcher, one is still very unknown. Signing reviews might help colleagues to associate your name with your work. As for the draw-backs, there is the often-cited concern that authors may want to take revenge if they receive a negative review, and even in the absence of any bad intentions, they may develop implicit biases against you. I weighed this disadvantage against the advantages listed above, and I decided that it’s worth the risk.

So then, why did I stop? There was a specific review that made me change my mind, because I realised that by signing reviews, one might get into all kinds of unanticipated awkward situations. I will recount this particular experience, of course, removing all details to protect the authors’ identity (which, by the way, I don’t know, but perhaps others might be able to guess with sufficient detail).

A few months ago, I was asked to review a paper about an effect, which I had not found in one of my previous studies. This study reported a significant effect. I could not find anything wrong with the methods or analyses, but the introduction was rather biased, in the sense that it cited only studies that did show this effect, and did not cite my study. I asked the authors to cite my study. I also asked them to provide a scatterplot of their data.

The next version of this manuscript that I received included the scatterplot, as I’d asked, and a citation of my study. Except, my study was cited in the following context (of course, fully paraphrased): “The effect was found in a previous study (citation). Schmalz et al. did not find the effect, but their study sucks.” At the same time, I noticed something very strange about the scatterplot. After asking several stats-savvy colleagues to verify that this strange thing was, indeed, very strange, I wrote in my review that I don’t believe the results, because the authors must have made a coding error during data processing.

I really did not like sending this review, because I was afraid that it would look (both to the editor and to the authors) like I had picked out a reason to dismiss the study because they had criticised my paper. However, I had signed my previous review, and whether or not I would sign during this round, it would be clear to the authors that it was me.

In general, I still think that signing reviews has a lot of advantages. Whether the disadvantages outweigh the benefits depends on each reviewer’s preference. For myself, the additional drawback that there may be unexpected awkward situations that one really doesn’t want to get into as an early career researcher tipped the balance, but it’s still a close call.

Thursday, April 4, 2019

On being happy in academia

tl;dr: Don’t take your research too seriously.

I like reading blog posts with advice about how to survive a PhD, things one wished one had known before one started a PhD, and other similar topics. Here goes my own attempt at writing such a blog post. I’m not a PhD student anymore, so I can’t talk about my current PhD experiences, nor am I a professor who can look back and list all of the personal mistakes and successes that have led to “making it” in academia. It has been a bit over 4 years since I finished my PhD and started working as a post-doc, and comparing myself now and then I realise that I’m happier working in academia now. This is not to say that I was ever unhappy during my time in academia, but some changes in attitude have lead to – let’s say – a healthier relationship to my research. This is what I would like to write this blog post about.

Don’t let your research define you
In the end, all of the points below can be summarised as: Don’t take your research too seriously. Research inevitably involves successes and failures; everybody produces some good research and some bad research, and it’s not always easy for the researcher to decide which it is at the time. So there will always be criticism, some of it justified, some of it reflecting the bad luck of meeting Reviewer 2 on a bad day.

Receiving criticism has become infinitely easier for me over the years: after getting an article rejected, it used to take at least one evening of moping and a bottle of wine to recover, while now I only shrug. It’s difficult to identify exactly why my reaction to rejection changed over time, but I think it has something to do with seeing my research less as an integral part of my identity. I sometimes produce bad research, but this doesn’t make me a bad person. This way, even if a reviewer rightfully tears my paper to shreds, my ego remains intact.

Picking a research topic
Following up from the very abstract point above, I’ll try to isolate some more concrete suggestions that, in my case, may or may not have contributed to my changed mindset. The first one is about picking a research topic. At the beginning of my PhD, I wanted to pick a topic that is of personal relevance, such as bilingualism or reading in different orthographies. Then, becoming more and more cynical about the research literature, I started following up on topics where I’d read a paper and think: “That’s gotta be bullshit!”

Now, I’ve moved away from both approaches. On the one hand, picking a topic that one is too passionate about can, in my view, lead to a personal involvement which can (a) negatively impact one’s ability to view the research from an objective perspective, and (b) become an unhealthy obsession. To take a hypothetical example: if I had followed up on my interest in bilingualism, it is – just theoretically – possible that I would consistently find that being bilingual comes with some cognitive disadvantages. As someone who strongly believes in the benefit of a multilingual society, it would be difficult for me to objectively interpret and report my findings.

On the other hand, focussing on bad research can result in existential crises, anger at poor researchers, a permanently bad mood, and from a practical perspective, annoying some people with high statuses while having a relatively small impact on improving the state of the literature.

My conclusion has been that it’s good to choose topics that I find interesting, where there is good ground work, and where I know that, no matter what the outcome of my research, I will be comfortable to report it.

Working 9-to-5
My shift in mindset coincides with having met my husband (during my first post-doc in Italy). As a result, I started spending less time working outside of office hours. Coming home at a reasonable time, trying out some new hobbies (cross-country skiing, hiking, cycling), and spending weekends together or catching up with my old hobbies (music, reading) distracts from research, in a good way. When I get to work, I can approach my research with a fresh mind and potentially from a new perspective.

Having said this, I’ve always been good at not working too hard, which is probably the reason why I’ve always been pretty happy during my time in academia. (Having strong Australian and Russian cultural ties, I have both the “she’ll be right” and the “авось повезёт” attitudes. Contrary to popular belief, a relaxed attitude towards work is also compatible with a German mindset: in Germany, people tend to work hard during the day, but switch off as soon as they leave the office.) At the beginning of my PhD, one of the best pieces of advice that I received was to travel as much as possible. I tried to combine my trips with lab or conference visits, but I also spent a lot of time discovering new places and not thinking about research at all. During my PhD in Sydney, I also pursued old and new hobbies: I joined a book club, an orchestra, a French conversation group, took karate lessons, and thereby met lots of great people and have many good memories from my time in Sydney.

Stick to your principles
For me, this point is especially relevant from an Open Science perspective. Perhaps, if I spent less time on doing research in a way that is acceptable for me, I’d have double the amount of publications. This could, of course, be extremely advantageous on the job market. On the flip side, there are also more and more researchers who value quality over quantity: a job application and CV with lots of shoddy publications may be valued by some professors, but may be immediately trashed by others who are more onboard with the open science movement.

The moral of this story is: One can’t make everyone happy, so it’s best to stick to one’s own principles, which also has the side effect that you’ll be valued by researchers who share your principles. 

A project always takes longer than one initially thinks
Writing a research proposal of any kind involves writing a timeline. In my experience, the actual project will always take much longer than anticipated, often due to circumstances beyond your control (e.g., recruitment takes longer than expected, collaborators take a long time to read drafts). For planning purposes, it’s good to add a couple of months to account for this. And if you notice that you can’t keep up with your timeline: that’s perfectly normal.

Have a backup plan
For a long time, I saw the prospect of leaving academia as the ultimate personal failure. This changed when I made the decision that my priority is to work within commutable distance of my husband, which, in the case of an academic couple, may very well involve one or both leaving academia at some stage. It helped to get a more concrete idea of what leaving academia would actually mean. It is ideal if there is a “real world” profession where one’s research experience would be an advantage. In my case, I decided to learn more about statistics and data science. In addition to opening job prospects that sound very interesting and involve a higher salary than the one I would get in academia, it gave me an opportunity to learn things that helped take my research to a different level.

Choosing a mentor
From observing colleagues, I have concluded that the PhD supervisor controls at least 90% of a student’s PhD experience. For prospective PhD students, my advice would be to be very careful in choosing a supervisor. One of the biggest warning signs (from observing colleagues’ experiences) is a supervisor who reacts negatively when a (female) PhD student or post-doc decides to start a family. If you get the possibility to talk to your future colleagues before starting a PhD, ask them about their family life, and how easy they find it to combine family with their PhD or post-doc work. If you’re stuck in a toxic lab, my advice would be: Get out as soon as you can. Graduate as soon as possible and get a post-doc in a better lab; start a new PhD in a better lab, even if it means losing a few years; leave academia altogether. I’ve seen friends and colleagues getting long-lasting physical and psychological health problems because of a toxic research environment: nothing is worth going through this.

Having a backup plan, as per the point above, could be particularly helpful in getting away from a toxic research environment. Probably one would be much less willing to put up with an abusive supervisor if one is confident that there are alternatives out there.

Choosing collaborators
Collaborators are very helpful when it comes to providing feedback about aspects that you may not have thought about. One should bear in mind, though, that they have projects of their own: chances are, they will not be as enthusiastic about your project as you are, and may not have time to contribute as much as you expect. This is good to take into account when planning a project: assuming that you will need to do most of the work yourself will reduce misunderstandings and stress due to the perception of collaborators not working hard enough on this project.

Be aware of the Imposter Syndrome
During my PhD, there were several compulsory administrative events that, at the time, I thought were a waste of time. Among other things, we were told about the imposter syndrome at one such event (also, we were given the advice to travel as much as possible, by a recently graduated PhD student). It was relatively recently that I discovered that many other early-career researchers have never heard of the imposter syndrome before, and often feel inadequate, guilty, and tired from their research. Putting a label on this syndrome may help researchers to become more aware that most people often feel like an impostor in academia, and take this feeling less seriously.

Thursday, March 14, 2019

Why I plan to move away from statistical learning research

Statistical learning is a hot topic, with papers about a link between statistical learning ability and reading and/or dyslexia mushrooming all over the place. In this blog post, I am very sceptical about statistical learning, but before I continue, I should make it clear that it is, in principle, an interesting topic, and there are a lot of studies which I like very much.

I’ve published two papers on statistical learning and reading/dyslexia. My main research interest is in cross-linguistic differences in skilled reading, reading acquisition, and dyslexia, which was also the topic of my PhD. The reason why, during my first post-doc, I became interested in the statistical learning literature, was, in retrospect, exactly the reason why I should have stayed away from it: It seemed relevant to everything I was doing.

From the perspective of cross-linguistic reading research, statistical learning seemed to be integral to understanding cross-linguistic differences. This is because the statistical distributions underlying the print-to-speech correspondences differ across orthographies: in orthographies such as English, children need to extract statistical regularities such as a being often pronounced as /ɔ/ when it succeeds a w (e.g., in “swan”). The degree to which these statistical regularities provide reliable cues differ across orthographies: for example, in Finnish, letter-phoneme correspondences are reliable, such that children don’t need to extract a large number of subtle regularities in order to be able to read accurately.

From a completely different perspective, I became interested in the role of letter bigram frequency during reading. One can count how often a given letter pair co-occurs in a given orthography. The question is whether the average (or summed) frequency of the bigrams within a word affects the speed with which this word is processed. This is relevant to psycholinguistic experiments from a methodological perspective: if letter bigram frequency affects reading efficiency, it’s a factor that needs to be controlled while selecting items for an experiment. Learning the frequency of letter combinations can be thought of as a sort of statistical learning task, because it involves the conditional probabilities of a letter given the other.

The relevance of statistical learning to everything should have been a warning sign, because, as we know from Karl Popper, something that explains everything actually explains nothing. This becomes clearer when we ask the first question that a researcher should ask: What is statistical learning? I don’t want to claim that there is no answer to this question, nor do I want to provide an extensive literature review of the studies that do provide a precise definition. Suffice it to say: Some papers have definitions of statistical learning that are extremely broad, which is the reason why it is often used as a hand-wavy term denoting a mechanism that explains everything. This is an example of a one-word explanation, a term coined by Gerd Gigerenzer in his paper “Surrogates for theories” (one of my favourite papers). Other papers provide more specific definitions, for example, by defining statistical learning based on a specific task that is supposed to measure it. However, I have found no consensus among these definitions: and given that different researchers have different definitions for the same terminology, the resulting theoretical and empirical work is (in my view) a huge mess.

In addition to these theoretical issues, there is also a big methodological mess when it comes to the literature on statistical learning and reading or dyslexia. I’ve written about this in more detail in our two papers (linked above), but here I will list the methodological issues in a more compact manner: First, when we’re looking at individual differences (for example, by correlating reading ability and statistical learning ability), the lack of a task with good psychometric properties becomes a huge problem. This issue has been discussed in a number of publications by Noam Siegelman and colleagues, who even developed a task with good psychometric properties for adults (e.g., here and here). However, as far as I’ve seen, there are still no published studies on reading ability or dyslexia using improved tasks. Furthermore, recent evidence suggests that a statistical learning task which works well with adults still has very poor psychometric properties when applied to children.

Second, the statistical learning and reading literature is a good illustration of all the issues that are associated with the replication crisis. Some of these are discussed in our systematic review about statistical learning and dyslexia (linked above). The publication bias in this area (selective publication of significant results) became even clearer to me when I presented our study on statistical learning and reading ability – where we obtained a null result – at the SSSR conference in Brighton (2018). There were several proponents of the statistical learning theory (if we can call it that) of reading and dyslexia, but none of them came to my poster to discuss this null result. Conversely, a number of people dropped by to let me know that they’ve conducted similar studies and also gotten null results.

Papers on statistical learning and reading/dyslexia continue to be published, and at some point, I was close to being convinced that maybe, visual statistical learning is related to learning to read in orthographies with a visually complex orthography. But then, some major methodological or statistical issue always jumps out at me when I read a paper closely enough. The literature reviews of these papers tend to be biased, often listing studies with null-results as evidence for the presence of an effect, or else picking out all the flaws of papers with null results, while treating the studies with positive results as a holy grail. I have stopped reading such papers, because it does not feel like a productive use of my time.

I have also stopped accepting invitations to review papers about statistical learning and reading/dyslexia, because I have started to doubt my ability to give an objective review. By now, I have a strong prior that there is no link between domain-general statistical learning ability and reading/dyslexia. I could be convinced otherwise, but would require very strong evidence (i.e., a number large-scale pre-registered studies from independent labs with psychometrically well-established tasks). While I strongly believe that such evidence is required, I realise that it is unreasonable to expect such studies from most researchers who conduct this type of research, who are mainly early-career researchers who base their methodology on previous studies.

I also stopped doing or planning any studies on domain-general statistical learning. The amount of energy necessary to refute bullshit is an order of magnitude bigger than to produce it, as Alberto Brandolini famously tweeted. This is not to say that everything to do with statistical learning and reading/dyslexia is bullshit, but – well, some of it definitely is. I hope that good research will continue to be done in this area, and that the state of the literature will become clearer because of this. In the meantime, I have made the personal decision to move away from this line of research. I have received good advice from one of my PhD supervisors: not to get hung up on research that I think is bad, but to pick an area where I think there is good work and to build on that. Sticking to this advice definitely makes the research process more fun (for me). Statistical learning studies are likely to yield null results, which end up uninterpretable because of the psychometric issues with statistical learning tasks. Trying to publish this kind of work is not a pleasant experience.

Why did I write this blog post? Partly, just to vent. I wrote it as a blog post and not as a theoretical paper, because it lacks the objectivity and a systematic approach which would be required for a scientifically sound piece of writing. If I were to write a scientifically sound paper, I would need to break my resolution to stop doing research on statistical learning, so a blog post it is. Some of the issues above have been discussed in our systematic review about statistical learning and dyslexia, but I also thought it would be good to summarise these arguments in a more concise form. Perhaps some beginning PhD student who is thinking about doing their project on statistical learning and reading will come across this post. In this case, my advice would be: pick a different topic. 

Sunday, March 10, 2019

What’s next for Registered Reports? Selective summary of a meeting (7.3.2019)

Last week, I attended a meeting about Registered Reports. It was a great opportunity, not only to discuss Registered Reports, but also to meet some people whom I had previously only known from twitter, over a glass of Whiskey close to London Bridge.

The meeting felt very productive, and I took away a lot of new information, about the Registered Report Format in general, and also some specific things that will be useful to me when I submit my next Registered Report. Here, I don’t aim to summarise everything that was discussed, but to focus on those aspects that could be of practical importance to individual researchers.

What’s stopping researchers from submitting Registered Reports?
We dedicated the entire morning to discussing how to increase the submission rate of Registered Reports. Before the meeting, I had done an informal survey among colleagues and on twitter to see what reasons people had for not submitting Registered Reports. The response rate was pretty low, suggesting that a lack of interest may be a leading factor (due either to apathy or scepticism – from my informal survey, I can’t tell). From people who did respond, the main reason was time: often, especially younger researchers are on short-term contracts (1-3 years), and are pressured for various reasons to start data collection as soon as possible. Among such reasons, people mentioned grants: funders often expect strict adherence to a timeline. And, unfortunately, such timing pressures disproportionately affect earlier career researchers, exactly the demographic which is most open to trying out a new way of conducting and publishing research.

Submitting a Registered Report may take a while – there is no point sugar-coating this. In contrast to standard studies, authors of Registered Reports need to spend more time to plan the study, because writing the report involves planning in detail; there may be several rounds of review before in-principle acceptance, and addressing reviewers’ comments may involve collecting pilot data. Given my limited experience, I would estimate that about 6-9 months would need to be added to the study timeline before one can count with in-principle acceptance and data collection can be started.

Of course, the increase in time that you spend before conducting the experiment will substantially improve the quality of the paper. A Registered Report is very likely to cut a lot of time at the end of the research cycle: when realising how long it may take to get in-principle acceptance, you should always bear in mind the painstakingly slow and frustrating process of submitting a paper to one journal after the other, accumulating piles of reviews varying in constructiveness and politeness, being told about methodological flaws that now you can’t fix, about how your results should have been different, and eventually unceremoniously throwing the study which you started with such great enthusiasm into the file-drawer.

Long-term benefits aside, unfortunately the issue of time remains for researchers on short-term contracts and with grant pressures. We could not think of any quick fix to this problem. In the long term, solutions may involve planning in this time when you write your next grant application. One possibility could be to write that you plan to conduct a systematic review during the time that you wait for in-principle acceptance. In my recently approved grant from the Deutsche Forschungsgemeinschaft, I proposed two studies: for the first study, I optimistically included a period of three months for “pre-registration and set-up”, and for the second study a period of twelve months (because this would happen in parallel to data collection for the first study). This somewhat backfired, because, while the grant was approved, they cut 6 months from my proposed timeline because they considered 12 months to be way too long for “pre-registration and set-up”. So, the strategy of planning for registered reports in grant applications may work, but bear in mind that it’s not risk-free.

A new thing that I learned about during the meeting are Registered Report Research Grants: Here, journals pair up with funding agencies, and reviews of the Registered Report happens in parallel to the review of the funding proposal. This way, once in-principle acceptance is in, the funding is released and data collection can start. This sounds like an amazingly efficient win-win-win solution, and I sincerely hope that funding agencies will routinely offer such grants.

How to encourage researchers to publish Registered Reports?
Here, I’ll list a few bits and pieces that were suggested as solutions. Some of these are aimed at the individual researcher, though many would require some top-down changes. The demographic most happy to try out a new publication system, as mentioned above, are likely to be early-career researchers, especially PhD students.

Members at the meeting reported positive experiences with department-driven working groups, such as the ReproducibiliTea initiative or Open Science Cafés. In some departments, such working groups have led to PhD students taking the initiative and proposing to their advisors that they would like to do their next study as a Registered Report. We discussed that encouraging PhD students to publish one study as a Registered Report could be a good recommendation. For departments which have formal requirements about the number of publications that are needed in order to graduate, a Registered Report could count more than a standard publication: let’s say, they either need to publish three standard papers, or one standard paper and a Registered Report (or two Registered Reports).

Deciding to publish a Registered Report is like jumping into cold water: the format requires some pretty big changes in the way that a study is conducted, and one is unsure if it will really impress practically important people (such as potential employers or grant reviewers) over pumping out standard publications. Taking a step back, taking a deep breath and thinking about the pros and cons, I would say that, in many cases, the advantages outweigh the disadvantages. Yes, the planning stage may take longer, but you will cut time at the end,  during the publication process, with a much higher success that the study will be published. A fun fact I learned during the meeting: At the journal Cortex, once a Registered Report gets past the editorial desk (i.e., the editors established that the paper fits the scope of the journal), the rejection rate is only 10% (which is why we need more journals adopting the Registered Report format: this way, any paper, including those of interest to a specialised audience, will be able to find a good home). And, once you have in-principle acceptance, you can list the paper in on your CV, which is (to many professors) much more impressive than a list of "in preparation"/"submitted" publications. If the Stage 1 review process takes unusually long and you're running out of time in your contract, you can withdraw the Registered Report, incorporate the comments to date, and conduct the experiment as a Preregistered Study. 

Some of the suggestions listed above are aimed at individual researchers. The meeting was encouraging and helpful in terms of getting some suggestions that could be applied here and now. It also made it clear that top-down changes are required: the Registered Report format involves a different timeline compared to standard submissions, so university expectations (e.g., in terms of the required number of publications for PhD students, short-term post-doc contracts) and funding structures need to be changed.

Wednesday, February 13, 2019

P-values 101: An attempt at an intuitive but mathematically correct explanation

P-values are those things that are we want to be smaller than 0.05, as I knew them during my undergraduate degree, and (I must admit) throughout my PhD. Even if you’re a non-scientist or work in a field that does not use p-values, you’ve probably heard of terms like p-hacking (e.g., from this video by John Oliver). P-values and, more specifically, p-hacking, are getting the blame for a lot of the things that go wrong in psychological science, and probably other fields as well.

So, what exactly are p-values, what is p-hacking, and what does all of that have to do with the replication crisis? Here, a lot of stats-savvy people shake their heads and say: “Well, it’s complicated.” Or they start explaining the formal definition of the p-value, which, in my experience, to someone who doesn’t already know a lot about statistics, sounds like blowing smoke and nitpicking on wording. There are rules, which, if one doesn’t understand how a p-value works, just sound like religious rituals, such as: “If you calculate a p-value halfway through data collection, the p-value that you calculate at the end of your study will be invalid.”

Of course, the mathematics behind p-values is on the complicated side: I had the luxury of being able to take the time to do a graduate certificate in statistics, and learn more about statistics than most of my colleagues will have the time for. Still, I won’t be able to explain how to calculate the -value without revising and preparing. However, the logic behind it is relatively simple, and I often wonder why people don’t explain p-values (and p-hacking) in a much less complicated way1. So here goes my attempt at an intuitive, maths-free, but at the same time mathematically correct explanation.

What are p-values?
A p-value is the conditional probability of obtaining the data or data more extreme, given the null hypothesis. This is a paraphrased formal text-book definition. So, what does this mean?

Conditional means that we’re assuming a universe where the null hypothesis is always true. For example, let’s say I pick 100 people and randomly divide them into two groups. Group 1 gets a placebo, and Group 2 also gets a placebo. Here, we’ve defined the null hypothesis to be true.

Then I conduct my study; I might ask the participants whether they feel better after having taken this pill and look at differences in well-being between Group 1 and Group 2. In the long run, I don’t expect any differences, but in a given sample, there will be at least some numerical differences due to random variation. Then – and this point is key for the philosophy behind the p-value – I repeat this procedure, 10 times, 100 times, 1,000,000 times (this is why the p-value is called a frequentist statistic). And, according to the definition of the p-value, I will find that, as the sample size increases, the percentage of experiments where I get a p-value smaller than or equal to 0.05 will get closer and closer to 5%. Again, this works in our modeled universe where the null hypothesis is true.

The second important aspect of the definition of the p-value is that it is the probability of the data, not of the null hypothesis. Why this is important to bear in mind can be, again, explained with the example above. Let’s say we conduct our experiment with the two identical placebo treatments, and get a p-value of 0.037. What’s the probability that the null hypothesis is false? 0%: We know that it’s true, because this is how we designed the experiment. What if we get a p-value of 0.0000001? The probability of the null hypothesis being false is still 0%.

If we’re interested in the probability of the hypothesis rather than the probability of the data, we’ll need to use Bayes’ Theorem (which I won’t explain here, as this would be a whole different blog post). The reason why the p-value is not interpretable as a probability about the hypothesis (null or otherwise) is that we need to consider how likely the null hypothesis is to be true. In the example above, if we do the experiment 1,000,000 times, the null hypothesis will always be true, by design, therefore we can be confident that every single significant p-value that we get is a false positive.

We can take a different example, where we know that the null hypothesis is always false. For example, we can take children between 0 and 10 year old, and correlate their height with their age. If we collect data repeatedly and accumulate a huge number of samples to calculate the p-value associated with the correlation coefficient, we will occasionally get p > 0.05 (the proportion of times that this will happen depends both on the sample size and the true effect size, and equals to 1- statistical power). So, if we do the experiment and find that the correlation is not significantly different from zero, what is the probability that the null hypothesis is false? It’s 100%, because everyone knows that children’s size increases with age.

In these two cases we know that, in reality, the null hypothesis is true and false, respectively. In an actual experiment, of course, we don’t know if it’s true of false – that’s why we’re doing the experiment in the first place. It is relevant, however, how many hypotheses we expect to be true in the long run.

If we study mainly crazy ideas that we had in the shower, chances are, the null hypothesis will often be true. This means that the posterior probability of the null hypothesis being false, even after obtaining a p-value smaller than 0.05, will still be relatively small.

If we carefully build on previous empirical work, design a methodologically sound study, and derive our predictions from well-founded theories, we will be more likely to study effects where the null hypothesis is often false. Therefore, the posterior probability that the null hypothesis is false will be greater than for our crazy shower idea, even if we get an identical p-value.

So what about p-hacking?
That was the crash course to p-values, so now we turn to p-hacking. P-hacking practices are often presented as lists in talks about open science, and include:
-       Optional stopping: Collecting N participants, calculating a p-value, and deciding whether or not to continue with data collection depending on whether this first peek gives them a significant p-value or not,
-       Cherry-picking of outcome variables: Collecting many potential outcome variables in a given experiment, and deciding on or changing the outcome-of-interest after having looked at the results,
-       Determining or changing data cleaning procedures after conditional on whether the p-value is significant or not,
-       Hypothesising after results are known (HARKing): First collecting the data, and then writing a story to fit the results and framing it as a confirmatory study.

The reasons why each of these are wrong are not intuitive when we think about data. However, they become clearer when we think of a different example: Guessing the outcome of a coin toss. We can use this different example, because both the correctness of a coin toss guess and the data that we collect in an experiment are random variables: they follow the same principles of probability.

Imagine that I want to convince you that I have clairvoyant powers, and that I can guess the outcome of a coin toss. You will certainly be impressed, and think whether there might not be something to that, if I toss the coin 10 times, and correctly guess the outcome every single time. You will be less impressed, however, if I toss the coin until I guess ten outcomes in a row. Of course, if you’re not present, I can instead make a video recording, cut out all of my unsuccessful attempts, and leave in only the ten correct guesses. From a probabilistic perspective, this would be the same as optional stopping.2

Now to cherry-picking of outcome variables: In order to convince you of my clairvoyant skills, I declare that I will toss the coin 10 times and guess the outcome correctly every single time. I toss the coin 10 times, and guess correctly 7 times out of 10. I can back-flip and argue that I still guessed correctly more often than incorrectly, and therefore my claim about my supernatural ability holds true. But you will not be convinced at all.

The removal of outliers, in the coin-toss metaphor, would be equivalent to discarding every toss where I guess incorrectly. HARKing would be equivalent to tossing the coin multiple times, then deciding on the story to tell later (“My guessing accuracy was significantly above chance”/ “I broke the world record in throwing the most coin tosses within 60 seconds” / “I broke the world record in throwing the coin higher than anyone else” / “I’m the only person to have ever succeeded in accidentally hitting my eye while tossing a coin!”). 

So, if you're unsure whether a given practice can be considered p-hacking or not, try to think of an equivalent coin example, where it will be more intuitive to decide whether the reasoning behind the analysis or data processing choice is logically sound or not. 

More formally, when the final analysis choice is conditional on the results, it messes with the frequentist properties of the statistic (i.e., the p-value is designed to have certain properties which hold true if you repeat a procedure an infinite amount of times, which no longer hold true if you add an unaccounted conditional term to the procedure). Such an additional conditional term could be based on the p-value (we will continue with data collection under the condition that p > 0.05), or it could be based on descriptive statistics (we will calculate a p-value between two participant groups out of ten under the condition that the graph shows the biggest difference between these particular groups).

Some final thoughts
I hope that readers who have not heard explanations of p-values that made sense to them found my attempt useful. I also hope that researchers who often find themselves in a position where they explain p-values will see my explanations as a useful example. I would be happy to get feedback on my explanation, specifically about whether I achieved these goals: I occasionally do talks about open science and relevant methods, and I aim to streamline my explanation such that it is truly understandable to a wide audience.

It is not the case that people who engage in p-hacking practices are incompetent or dishonest. At the same time, it is also not the case that people who engage in p-hacking practices would not do so if they learn to rattle off the formal definition of a p-value in their sleep. But, clearly, fixing the replication crisis involves improving the general knowledge of statistics (including how to correctly use p-values for inference). There seems to be a general consensus about this, but judging by many Twitter conversations that I’ve followed, there is no consensus about how much statistics, say, a researcher in psychological science needs to learn. After all, time is limited, and there is many other general as well as topic-specific knowledge that needs to be acquired in order to do good research. If I’m asked how much statistics psychology researchers should know, my answer is, as always: “As much as possible”. But a wide-spread understanding of statistics is not going to be achieved by wagging a finger at researchers who get confused when confronted with the difference between the probability of the hypothesis versus the probability of the data conditional on the hypothesis. Instead, providing intuitive explanations should not only improve understanding, but also show that, on some level (that, arguably, is both necessary and sufficient for researchers to do sound research), statistics is not an impenetrable swamp of maths.

Edit (18/2/2019): If you prefer a different format of the above information, here is a video of me explaining the same things in German.

1 There are some intuitive explanations that I’ve found very helpful, like this one by Dorothy Bishop, or in general anything by Daniel Lakens (many of his blog entries, his coursera course). If he holds any workshops anywhere near you, I strongly recommend to go. In 2015, I woke up at 5am to get from Padova to Daniel's workshop in Rovereto, and at the end of the day, I was stuck in Rovereto because my credit card had been blocked and I couldn't buy a ticket back to Padova. It was totally worth it! 
2 Optional stopping does not involve willingly excluding relevant data, while my video recording example does. Unless disclosed, however, optional stopping does involve the withholding of information that is critical in order to correctly interpret the p-value. Therefore, I consider the two cases to be equivalent.

Tuesday, January 8, 2019

Scientific New Year’s Resolutions

Last year, I wrote an awful lot of blog posts about registered reports. My first New Year’s Resolution is to scale down my registered reports activism, which I start off by writing a blog post about my new year’s resolutions, where I’ll try hard not to mention registered reports more than 5 times.

Scientifically, last year was very successful for me, with the major event being a grant from the DFG, which will give me a full-time position until mid-2021 and the opportunity to work on my own project. This gives me my second New Year’s Resolution, which is to focus on the project that I’m supposed to be working on, and not to get distracted by starting any (or too many) new side projects.

Having a full-time job means I’ll have to somewhat reduce the amount of time I’ll spend on Open Science, compared to the past few years. However, Open Science is still very important to me: I see it as an integral part of the scientific process, and consequently as one of the things I should do as part of my job as a researcher. As a Freies Wissen Fellow, an ambassador of the Center for Open Science, and a member of the LMU Open Science Center, I have additional motivation and support in improving the openness of my own research and helping others to do the same. My third New Year’s Resolution is to start prioritising various Open Science projects.

In order to prioritise, I need to select some areas which I think are the most effective in increasing the openness of research. In my experience, talks and workshops are particularly effective (in fact, that’s how I became interested in Open Science). Last year, I gave talks as part of two Open Science Workshops (in Neuchâtel and Linz), and at our department, I organised an introduction to R (to encourage reproducible data analyses). The attendance of these workshops was quite good, and the attendees seemed very interested and motivated, which further supports my hypothesis about the efficiency of such events. I hope to hold more workshops this year: so far, I have one invitation for a workshop in Göttingen.

I still have two mentions of Registered Reports left for this blogpost (oops, just one now): as I think that they provide one of the most efficient ways to mitigate the biases that make a lot of psychological science non-replicable, I will continue trying to encourage journals to accept them as an additional publication format. I have explained why I think that this is important here, here, here, and here [in German], and how this can be achieved here and here. Please note that we’re still accepting signatures for the letters to editors and for an open letter, for more information see here.

As a somewhat less effective thing that I did in the previous years to support open science, I had signed all of my peer reviews. I am now wondering if it’s not doing more harm than good. Signing my reviews is a good way to force myself to stay constructive, but sometimes there really are mistakes in a paper that, objectively speaking, mean that the conclusions are wrong. One cannot blame authors for being upset at the reviewers for pointing this out – after all, we’ve all experienced this ourselves, and so many things depend on the number of publications. And, as much as I try to convince myself to see the review process is a constructive discussion between scientists, perhaps there is a good reason that peer review often happens anonymously, especially for an early career researcher.

As my fourth New Year’s Resolution, I will strongly reduce my use of Twitter compared to the last years. I’ve learned a lot through Twitter, especially about statistics and open science. Lately, however, I started feeling like a lot of the discussions are repeating; the Open Science community has grown since I joined Twitter, leading to interpersonal conflicts within this community that have little to do with the common goal of improving reproducibility and replicability of research. A few months ago, I’ve created a lurking account, where I will follow only a few reading researchers: this way, I can compulsively check Twitter without scrolling down endlessly, and I can keep up-to-date with any developments that are discussed in my actual field of research. So far, I really don’t have the feeling that I’m missing out, though I still check my old Twitter account occasionally, especially when I get notifications.

My fifth New Year’s Resolution is to continue learning, especially about programming and statistics. The more I learn, the more I realise how little I know. However, looking back, I also realise that I can now do things that I couldn’t even have dreamed of a couple of years ago, and it’s a nice feeling (and it substantially improves the quality of my work).

My final New Year’s Resolution goes both for my working life and for my personal life: Be nice: judge less, focus on the good things, not on the bad, take actions instead of complaining about things, be constructive in criticism. 

So here's to a good 2019! 

Oh, and I have one left: Please support Registered Reports!