Tuesday, August 21, 2018

“But has this study been pre-registered?” Can registered reports improve the credibility of science?


There is a lot of bullshit out there. Every day, we are faced with situations where we need to decide whether a given piece of information is trustworthy or not. This is a difficult task: A lot of information that we encounter requires a great deal of expertise in a specific field, and nobody is able to become an expert on all issues which we encounter on a day-to-day basis (just to name a few: politics, history, nutrition, medicine, psychology, educational sciences, physics, law, artificial intelligence).

In the current blogpost, I will focus on educational sciences. This is an area where it is very important – to everyone ranging from parents and teachers through to education researchers – to be able to distinguish bullshit from trustworthy information. Believing in bullshit can, in the best case, lead to a waste of time and money (parents investing into educational methods that don’t work; researchers building on studies which turn out to be non-replicable). In the worst case, children will undergo educational practices or interventions for developmental disorders which distract from more effective, evidence-based methods or may even be harmful in the long run.

Many people are interested in science and the scientific method. These people mostly know that the first question you ask if you encounter something that sounds dodgy is: “But has this study been peer-reviewed?” We know that peer-review is fallible: This can be shown simply by taking the example of predatory journals, which will publish anything, under the appearance of a peer-reviewed paper, for a fee. While it is often (but not always) obvious to experts in a field that a given journal is predatory, this will be a more difficult task for someone without scientific training. In this blogpost, I will mainly focus on a thought-experiment: What if, instead, we (researchers, as well as the general public), asked: “But has this study been pre-registered?”

I will discuss the advantages and potential pitfalls of this shift in mind-set. But first, because I’m still working on convincing some of my colleagues that pre-registration is important for educational sciences and developmental psychology, I describe two examples that demonstrate how important it is to be able to tell between trustworthy and untrustworthy research. These are real-life examples that I encountered in the last few weeks, but I changed some of the names: while the concepts described raise a lot of red flags associated with pseudoscience, I don’t have the time or resources to conduct a thorough investigation to show that they are not trustworthy, and I don’t want to get into any fights (or law-suits) about these particular issues.

Example 1: Assessment and treatment for healthy kids
The first example comes from someone I know who asked me for advice. They had found a centre which assesses children on a range of tests, to see if they have any hidden developmental problems or talents. After a thorough assessment session (and, as I found out through a quick google search, a $300 fee), the child received a report of about 20 pages. As the centre specialises in children who have both a problem and a talent, it is not surprising that the child was diagnosed with both a problem and a talent (although, interestingly, a series of standardised IQ tests showed no problems). The non-standardised assessments tested for disorders that, during 7 years of study and 4 years of working as a post-doc in cognitive psychology, I had never heard of before. A quick google search revealed that there was some peer-reviewed literature on these disorders. But the research on a given disorder came always from one-and-the-same person or “research” group, mostly with the affiliation of an institute that made money by selling treatments for this disorder.

The problem with the above assessment is: Most skills are normally distributed, meaning that, on a given test, some children will be very good, and some children will be very bad. If you take a single child and give them a gazillion tests, you will always find a test on which they perform particularly badly and one on which they perform particularly well. One child might be particularly fast at peeling eggs, for example. A publication could describe a study where 200 children were asked to peel as many eggs as possible within 3 minutes, and there was a small number of children who were shockingly bad at peeling eggs (“Egg Peeling Disorder”, or EPD for short). This does not mean that this ability will have any influence whatsoever on their academic or social development. But, in addition, we can collect data on a large number of variables that are indicative of children’s abilities: five reading tests, five mathematics tests, tests of fine motor skills, gross motor skills, vocabulary, syntactic skills, physical strength, the frequency of social interactions – the list goes on and on. Again, by the laws of probability, as we increase the number of variables, we increase the probability that at least one of them will be correlated with the ability to peel an egg, just by chance.

Would it help to ask: “Has this studies been pre-registered?” Above, I described a way in which any stupid idea can be turned into a paper showing that a given skill can be measured and correlates with real-life outcomes. By maximising the number of variables, the laws of probability give us a very good chance to find a significant result. In a pre-registered report, the researchers would have to declare, before they collect or look at the data, which tests they plan to use, and where they expect to find significant correlations. This gives less wiggle-space for significance-fishing, or combing the data for significant results which likely just reflect random noise.

Example 2: Clinical implications of ghost studies
The second example is from the perspective of a researcher. A recent paper I came across reviewed studies on a certain clinical population performing tasks tapping a cognitive skill – let’s call it “stylistical turning”. The review concluded that clinical groups perform, on average, worse than control groups on stylistical turning tasks, and suggests stylistical turning training to improve outcomes in this clinical population. Even disregarding the correlation-causation confusion, the conclusion of this paper is problematic, because in this particular case, I happen to know of two well-designed unpublished studies which did not find that the clinical group performed worse than a control group – in fact, both found that the stylistical turning task used by the original study doesn’t even work! Yet, as far as I know, neither has been published (even though I’d encouraged the researchers behind these studies to submit). So, the presence of unpublished “ghost” studies, which cannot be found through a literature search, has profound consequences for a question of clinical importance.

Would it help in this case to demand that studies are pre-registered? Yes, because pre-registration involves creating a record, prior to the collection of data, that this study will be conducted. In the case of our ghost studies, someone conducting a literature review would at least be able to find the registration plan. Even if the data did not end up being published, the person conducting the literature review could (and should) contact the authors and ask what became of these studies.

Is pre-registration really the holy grail?
As for most complex issues, it would be overly simplistic to conclude that pre-registration would fix everything. Pre-registration should help combat questionable research practices (fishing for significance in a large sea of data, as described in Example 1), and publication bias (selective publication of positive results, leading to a distorted literature, as described in Example 2). These are issues that standard peer-review cannot address: When I review the manuscript of a non-pre-registered paper, I cannot possibly know if the authors collected data on 50 other variables and report only the ones that came out as significant. Similarly, I cannot possibly know if, somewhere else in the world, a dozen other researchers conducted the exact same experiment and did not find a significant result.

What would happen if we – researchers and the general public alike – began to demand that studies must be pre-registered if they are to be used to inform practice? Luckily, medical research is ahead of educational sciences on this front: Indeed, pre-registration seems to decrease the number of significant findings, which probably reflects a more realistic view of the world.

So, what could possibly go wrong with pre-registration? First, if we start demanding pre-registered reports now, we can pretty much throw away everything we’ve learned about educational sciences so far. There is a lot of bullshit out there for sure, but there are also effective methods, which show consistent benefits across many studies. But, as pre-registration has not really kicked off yet in educational sciences, none of these studies have been pre-registered. This raises important questions: Should we repeat all existing studies in a pre-registered format, even when is a general consensus among researchers that a given method is effective? On the one hand, this would be very time- and resource-consuming. On the other hand, some things that we think we know turn out to be false. And besides, even when a method is, in reality, effective, the selective publication of positive results makes it looks like the method is much more effective than it really is. In addition to knowing what works and what doesn’t, we also need to make decisions about which method works better: this requires a good understanding of the extent to which a method helps.

It is also clear that we need to change the research structure before we unconditionally demand pre-registered reports: at this stage, it would be unfair to judge a paper as worthless if it has not been pre-registered, because pre-registration is just not done in educational sciences (yet). If more journals offered the registered report format, and researchers were incentivised for publishing pre-registered studies rather than mass-producing significant results, this would set the conditions for the general public to start demanding that practice is informed by pre-registered studies only.

As a second issue, there is one thing we have learned from peer-review: When there are strong incentives, people learn to play the system. At this stage, peer-review is the major determining factor of which studies are considered trustworthy by fellow researchers, policy-makers and stakeholders. This has resulted in a market for predatory journals, which publish anything under the appearance of a peer-reviewed paper, if you give them money. Would it be possible to play the system of pre-registered reports?

It is worth noting that there are two ways to do a pre-registration. One way is for a researcher to write the pre-registration report, upload it by themselves as a time-stamped, non-modifiable document, and then to go ahead with the data collection. In the final paper, they add a link to the pre-registration report. Peer-review occurs at the stage when the data has already been collected. Both the peer-reviewers and the readers can download the pre-registration report and compare the authors’ plans with what they actually did. It is possible to cheat with this format: Run the study, look at the result, and write the pre-registered report retrospectively, based on the results that have already come out as significant. The final paper can then be submitted with a link to the fake pre-registered report, and with a bit of luck, the study would appear as pre-registered in a peer-reviewed journal. This would be straight-out cheating as opposed to being in a moral grey-zone, which is the current status of questionable research practices. But it could be a real concern when there are strong incentives involved.

The second way to do a pre-registration is the so-called registered report (RR) format. Here, journals conduct peer-review of the pre-registered report rather than the final paper. This means that the paper is evaluated based on its methodological soundness and the strength of the proposed analyses. After the reviewers approve of the pre-registered plan, the authors get the thumbs-up to start data collection. Cheating by submitting a plan for a study that has already been conducted becomes difficult in this format, because reviewers are likely to propose some changes to the methodology: if the data has already been collected, the cheating authors would be put in a checkmate position because they would need to collect new data after making the methodological changes.

For both formats, there are more subtle ways to maximise the chances of supporting your theory (let’s say, if you have a financial interest in the results coming out in a certain way). A bad pre-registration report could be written in a way that is vague: As we saw in Example 1, this would still give the authors room to wiggle with their analyses until they find a significant result (e.g., “We will test mathematical skills”, but neglecting to mention that 5 different tests will be used, and all possible permutations of these tests will be used to calculate an average score until one of them turns out to be significant). This would be less likely to happen with the RR format than with non-peer-reviewed pre-registration, because a good peer-reviewer should be able to pick up on this vagueness, and demand that the authors specify exactly which variables they will measure, how they will measure them, and how they will analyse them. But the writer of the registered report could hope for inattentive reviewers, or submit to many different journals until one finally accepts the sloppily-written report. To circumvent this problem, then, it is necessary to combine RRs with rigorous peer-review. From this perspective, the most important task of the reviewer is to make sure that the registered report is written in a clear and unambiguous manner, and that the resulting paper closely follows what the authors said they would do in the registered report.

Conclusion
So, should we start demanding that educational practice is based on pre-registered studies? In an ideal world: Yes. But for now, we need top-down changes inside the academic system, which would encourage researchers to conduct pre-registered studies.

Is it possible to cheat with pre-registered reports in such a way that we don’t end up solving the problems I outlined in this blogpost? Probably yes, although a combination of the RR format (where the pre-registered report rather than the final paper is submitted to a journal) and rigorous peer-review should minimise such issues.

What should we do in the meantime? My proposed course of action will be to focus on making it more common among education researchers to pre-register their studies. One way to achieve this is to encourage as many journals as possible to adopt the RR format. To have good peer-review for RRs, we also need to spread awareness among researchers about what to look out for when reviewing a RR. Some journals which publish RRs, such as Cortex, have very detailed guidelines for reviewers. In addition, perhaps workshops about how to review a RR could be useful.

Monday, June 18, 2018

Realising Registered Reports: Part 1


Two weeks ago, I set out on a quest to increase the number of journals which publish reading-related research and offer the publication format of Registered Reports (RRs: https://cos.io/rr/). I wrote a tweet about my vague idea to start writing to journal editors, which was answered by a detailed description of 7 easy steps by Chris Chambers (reproduced in full below, because I can’t figure out how to embed tweets):

1)    Make a list of journals that you want to see offer Registered Reports
2)    Check osf.io/3wct2/wiki/Journal Responses/ to see if each journal has already been publicly approached
3)    If it hasn’t, adapt template invitation letter here: osf.io/3wct2/wiki/Journal Requests/
4)    Assemble colleagues (ECRs & faculty, as many as possible) & send a group letter to chief editor & senior editors. Feel free to include [Chris Chambers] as a signatory and include [him] and David Mellor (@EvoMellor) in CC.
5)    Login with your OSF account (if you don’t have one then create one: https://osf.io/)
6)    Once logged in via OSF, add the journal and status (e.g., “Under consideration [date]”) to osf.io/3wct2/wiki/Journal Responses/. Update status as applicable. Any OSF user can edit.
7)    Repeat for every relevant journal & contact [Chris] or David Mellor if you have any questions.

So far, I have completed Steps 1-3, and I’m halfway through Step 4: I have a list of signatories, and will start emailing editors this afternoon. It could be helpful for some, and interesting for others to read about my experiences with following the above steps, so I decided to describe them in a series of blog posts. So, here is my experience so far:

My motivation
I submitted my first RR a few months ago. It was a plan to conduct a developmental study in Russian-speaking children, which would assess what kind of orthographic characteristics may pose problems during reading acquisition. There were two issues that I encountered while writing this report: (1) For practical reasons, it’s very difficult to recruit a large sample size to make it a well-powered study, and (2) it’s not particularly interesting for a general audience. This made it very difficult to make a good case for submitting it to any of the journals that currently offer the RR format. Still, I think it’s important that such studies are being conducted, and that there is the possibility to pre-register, in order to avoid publication bias and unintentional p-hacking (and to get peer review and conditional acceptance before you start data collection – this, I think, should be a very good ‘selfish’ reason for everyone to support RRs). So it would be good if some more specialist journals started accepting the RR format for smaller studies that may be only interesting to a narrow audience.

Making lists
The 7-step process involves making two lists: One list of journals where I’d like to see RRs, and another list of signatories who agree that promoting RRs is a good thing. I created a Google Doc spreadsheet, which anyone can modify to add journals to one sheet and their name to another sheet, here. A similar list has been created, at around the same time, by Tim Schoof, for the area of hearing science, here.

Getting signatories
The next question is how to recruit people to modify the list and add their names as signatories. I didn’t want to spam anyone, but at the same time I wanted to get as many signatures as possible, and of course, I was curious how many reading researchers would actively support RRs. I started off by posting on twitter, which already got me around 15 signatures. Then I wrote to my current department, my previous department, and made a list of reading researchers that came to my mind. Below is the email that I sent to them:

Dear fellow reading/dyslexia researchers,

Many of you have probably heard of a new format for journal articles: Registered Reports (RR). With RRs, you write a study and analysis plan, and submit it to a journal before you collect the data. You get feedback from the reviewers on the design of the study. If the reviewers approve of the methods, you get conditional acceptance. This means that the study will be published regardless of its outcome. Exploratory analyses can still be reported, but they will be explicitly distinguished from the confirmatory analyses relating to the original hypotheses.

The RR format is good for science, because it combats publication bias. It is also good for us as individual researchers, because it helps us to avoid situations where we invest time and resources into a study, only to find out in retrospect that we overlooked some design flaw and/or that non-significant findings are uninformative, yielding the study unpublishable. You can find more information and answers to some FAQs about RRs here: https://cos.io/rr/.

There are some journals which offer the RR format, but not many of them are relevant to a specialised audience of reading researchers. Therefore, I'd like to contact the editors of some more specialised journals to suggest accepting the RR format. I started off by making a list of journals which publish reading-related research, which you can view and modify here: https://docs.google.com/spreadsheets/d/1Ewutk2pU6-58x5iSRr18JlgfRDmvqgh2tfDFv_-5TEU/edit?usp=sharing

To increase the probability of as many journals as possible deciding to offer RRs (alongside the traditional article formats) in the future, I would like to ask you three favours:

1) If there are any journals where you would like to see RRs, please add them to the list (linked above) before the 15th of June.
2) If you would like to be a signatory on the emails to the editors, please let me know or add your name to the list of signatories in the second sheet (same link as the list). I will then add your name to the emails that I will send to the editors. Here is a template of the email: https://osf.io/3wct2/wiki/Journal%20Requests/
3) If you are part of any networks for which this could be relevant, please help to spread the information and my two requests above.
If you have any questions or concerns about RRs, I would be very happy to discuss.

Thank you very much in advance!
Kind regards,
Xenia.

The list of reading researchers that I could think of contained 62 names, including collaborators, friends, people I talked to at conferences and who seemed to be enthusiastic about open science, some big names, people who have voiced scepticism about open science; in short: a mixture of early-career and senior researchers, with varying degrees of support for open science practices. After I sent these emails, I gained about 15 more signatures.

After some consideration, I also wrote to the mailing list of the Scientific Studies of Reading. I hesitated because, again, I didn’t want to spam anyone. But in the end, I decided that there is value in disseminating the information to those who would like to do something to support RRs, even if it means annoying some other people. There were some other societies’ mailing lists that I would have liked to try, but I where I couldn’t find a public mailing list and did not get a response from the contact person.

After having done all this, I have 45 signatures, excluding my own. In response to my emails and tweets, I also learned that two reading-related journals are already considering implementing RRs: The Journal of Research in Reading and Reading Psychology. 

Who supports RRs, anyway?
I would find it incredibly interesting to answer some questions using the signature “data” that I have, including: At what career stage are people likely to support RRs? Are there some countries and universities which support RRs more than others? I will describe the observed trends below. However, the data is rather anecdotal, so it should not be taken any more seriously than a horoscope.

Out of the 46 signatories, I classified 17 as early career researchers (PhD students or post-docs), and 29 as seniors (either based on my knowledge of their position or through a quick google search). This is in contrast to the conventional wisdom that young people strive for change while older people cling on to the existing system. However, there are alternative explanations: for example, it could be that ECRs are more shy about adding their name to such a public list.

The signatories gave affiliations from 11 different countries, namely UK (N = 10), US (N = 7), Canada (N = 6), Australia (N = 5), Belgium (N = 4), Germany (N = 4), the Netherlands (N = 4), Italy (N = 2), Norway (N = 2), Brazil (N = 1), and China (N = 1).

The 46 signatories came from 32 different affiliations. The most signatures came from Macquarie University, Australia (my Alma Mater, N = 5). The second place is shared between Dalhousie University, Canada, and Université Libre de Bruxelles, Belgium (N = 3). The shared third place goes to Radboud University, the Netherlands; Royal Holloway University, London, UK; Scuola Internationale Superiore di Studi Avanzati (SISSA), Italy; University of Oslo, Norway; University of York, UK, and University of Oxford, UK (N = 2).

All of these numbers are difficult to interpret, because I don’t know exactly how widely and to which networks my emails and tweets were distributed. However, I can see whether there are any clear trends in the response rate among the researchers I contacted directly, via my list of researchers I could think of. This list contained 62 names of 22 ECRs and 40 senior researchers. Of these, 5 ECRs and 6 senior signed, which indeed is a higher response rate amongst ECRs than senior researchers.

Before sending the group email, I’d tried to rate the researchers, based on my impression of them, of how likely they would be to sign in support of RRs. My original idea was to write only to those who I thought are likely to sign, but ultimately I decided to see if any of those whom I’d consider unlikely would positively surprise me. I rated the researchers on a scale of 1 (very unlikely) to 10 (I’m positive they will sign). If I didn’t know what to expect, I left this field blank. This way, I rated 26 out of the 62 researchers, with a mean rating of 6.1 (SD = 2.1, min = 2, max = 10). Splitting up by ECR and senior researchers, my average ratings were similar across the two groups (ECRs: mean = 5.9, SD = 1.5, min = 3, max = 7; seniors: mean = 6.2, SD = 2.4, min = 2, max = 10).

How accurate were my a priori guesses? Of the 26 people I’d rated, 9 had signed. Their average rating was 7.3 (SD = 1.1, min = 6, max = 9). The other 17 researchers had an average rating of 5.4 (SD = 2.2, min = 2, max = 10). So it seems I was pretty accurate in my guesses about who would be unlikely to sign (i.e., there were no ‘pleasant surprises’). Some of the people I considered to be highly likely to sign did not do so (‘unpleasant surprises’), though I like to think that there are reasons other than a lack of support for the cause (e.g., they are on sabbatical, forgot about my email, or they think there are better ways to promote open science).

What now?
Now I will start sending out the emails to the editors. I’m very curious how it will go, and I’m looking forward to sharing my experiences and the reactions in an upcoming blogpost.

Tuesday, May 8, 2018

Some thoughts on trivial results, or: Yet another argument for Registered Reports


A senior colleague once joked: “If I read about a new result, my reaction is either: ‘That’s trivial!’, or ‘I don’t believe that!’”

These types of reactions are pretty common when presenting the results of a new study (in my experience, anyway). In peer review, especially the former can be a reason for a paper rejection. In conversations with colleagues, one sometimes gets told, jokingly: “Well, I could have told you in advance that you’d get this result, you didn’t have to run the study!” This can be quite discouraging, especially if, while you were planning your study, it did not seem at all obvious to you that you would get the obtained result.

In many cases, perhaps, the outcomes of a result are obvious, especially to someone who has been in the field for much longer than you are. For some effects, there might be huge file drawers, such that it’s a well-known secret that an experimental paradigm which seems perfectly reasonable at first sight doesn’t actually work. In this case, it would be very helpful to hear that it’s probably not the best idea to invest time and resources on this paradigm. However, it would be even more helpful to hear about this before you plan and execute your study.  

One also needs to take into account that there is hindsight bias. If you hear the results first, it’s easy to come up with an explanation for the exact obtained pattern. Thus, a some result that might seem trivial in hindsight would actually have been not so east to predict a priori. There is also often disagreement about the triviality of an outcome: It's not unheard of (not only in my experience) that Reviewer 1 claims that the paper shouldn't be published because the result is trivial, while Reviewer 2 recommends rejection because (s)he doesn’t believe this result.

Registered reports should strongly reduce the amount of times that people tell you that your results are trivial. If you submit a plan to do an experiment that really is trivial, the reviewers should point this out while evaluating the Stage 1 manuscript. If they have a good point, this will save you from collecting data for a study that many people might not find interesting. And if the reviewers agree that the research question is novel and interesting, they cannot later do a backflip and say that it’s trivial after having seen the results.

So, this is another advantage of registered reports. And, if I’m brave enough, I’ll change the way I tell (senior) colleagues about my work in informal conversations, from: “I did experiment X, and I got result Y” to “I did experiment X. What do you think happened?”

Tuesday, March 27, 2018

Can peer review be objective in “soft” sciences?


Love it or hate it – peer review is likely to elicit strong emotional reactions from researchers, at least at those times when they receive an editorial letter with a set of reviews. Reviewer 1 is mildly positive, Reviewer 2 says that the paper would be tolerable if you rewrote the title, abstract, introduction, methods, results and discussion sections, and Reviewer 3 seems to have missed the point of the paper altogether.
 
This is not going to be an anti-peer-review blog post. In general, I like peer review (even if you might get a different impression if you talk to me right after I get a paper rejected). In principle, peer review is an opportunity to engage in academic discussion with researchers who have similar research interests as you. I have learned a lot from having my papers peer reviewed, and most of my papers have been substantially improved by receiving feedback from reviewers, who often have different perspectives on my research question.

In practice, unfortunately, the peer review process is not a pleasant chat between two researchers who are experts on the same topic. The reviewer has the power to influence whether the paper will be rejected or accepted. With the pressure to publish, the reviewer may well delay the graduation of a PhD student or impoverish a researcher’s publication record just before an important deadline. That’s a different matter, that has more to do with the incentive system than with the peer review system. The thing about peer review is, though, that it should be as objective as possible: especially given that, in practice, a PhD student’s graduation may depend on it.

Writing an objective peer review is probably not a big issue for harder sciences and mathematics, where verifying the calculations should be often enough to decide whether the conclusions of the paper are warranted. In contrast, in soft sciences, one can always find methodological flaws, think of potential confounds or alternative explanations that could explain the results, or require stronger statistical evidence. The limit is the reviewer’s imagination: whether your paper gets accepted or not may well be a function of the reviewers’ mood, creativity, and implicit biases for or against your lab.

This leads me to the goal of the current blog post: Can we achieve objective peer review in psychological science? I don’t have an answer to this. But here, I aim to summarise the kinds of things that I generally pay attention to when I review a paper (or would like to pay more attention to in the future), and hope for some discussion (in the comments, on twitter) about whether or not this constitutes as-close-to-objective-as-possible peer review.

Title and abstract
Here, I generally check whether the title and abstract reflect the results of the paper. This shouldn’t even be an issue, but I have reviewed papers where the analysis section described a clear null result, but the title and abstract implied that the effect was found.

Introduction
Here, I aim to check whether the review of the literature is complete and unbiased, to the best of my knowledge. As examples of issues that I would point out: the authors selectively cite studies with positive results (or, worse: studies with null-results as if they had found positive results), or misattribute a finding or theory. As minor points, I note if I cannot follow the authors’ reasoning.
I also consider the a priori plausibility of the authors’ hypothesis. The idea is to try and pick up on instances of HARKing, or hypothesising after results are known. If there is little published information on an effect, but the authors predict the exact pattern of results of a 3x2x2x5-ANOVA, I ask them to clarify whether the results were found in exploratory analyses, and if so, then to rewrite the introduction accordingly. Exploratory results are valuable and should be published, but should not be phrased as confirmatory findings, I write.
It is always possible to list other relevant articles or other theories which the authors should cite in the introduction (e.g., more of the reviewer’s papers). Here, I try to hold back with suggestions: the reader will read the paper through the lens of their own research question, anyway, and if the results are relevant for their own hypothesis, they will be able to relate them without the authors writing a novel on all possible perspectives from which their paper could be interesting.

Methods
No experiment’s methods are perfect, but some imperfections make the results uninterpretable, other types of imperfection should be pointed out as limitations, and yet others are imperfections that are found in all papers using the paradigm, so it’s perfectly OK to have those imperfections in a published paper, unless they are a reviewer’s personal pet peeve. In some instances, it is even considered rude to point out certain imperfections. Sometimes, pointing out imperfections will just result in the authors citing some older papers which have the same imperfections. Some imperfections can be addressed with follow-up analyses (e.g., by including covariates), but in this case it’s not clear what the authors should do if they get ambiguous results or results that conflict with the original analyses.
Perhaps this is the section with which you can always sink a paper, if you want. If for no other reason, then, in most cases, based on the experiment(s) being underpowered. It probably varies from topic to topic and from lab to lab what level of imperfection can be tolerated. I can’t think of any general rules or things that I look at when evaluating the adequacy of the experimental methods. If authors reported a priori power analyses, one could objectively scrutinise their proposed effect size. In practice, demanding power analyses in a review would be likely to lead to some post-hoc justifications on the side of the authors, which is not the point of power analyses.
So, perhaps the best thing is to simply ask the authors for the 21-word statement, proposed by Simmons et al., which includes a clarification about whether or not the sample size, analyses, and comparisons were determined a priori. I must admit that I don’t do this (but will start to do this in the future): so far, failing to include such a declaration, in my area, seems to fall into the category of “imperfections that are found in all papers, so it could be seen as rude to point them out”. But this is something that ought to change.
Even though the methods themselves may be difficult to review objectively, one can always focus on the presentation of the methods. Could the experiment be reproduced by someone wanting to replicate the study? It is always best if the materials (actual stimuli that were used, scripts (python, DMDX) that were used for item presentation) are available as appendices. For psycholinguistic experiments, I ask for a list of words with their descriptive statistics (frequency, orthographic neighbourhood, other linguistic variables that could be relevant). 

Results
In some ways, results sections are the most fun to review. (I think some people whose paper I reviewed would say that this is the section that is the least fun to get reviewed by me.) The first question I try to answer is: Is it likely that the authors are describing accidental patterns in random noise? As warning signs, I take a conjunction of small sample sizes, strange supposedly a priori hypotheses (see “Methods” section), multiple comparisons without corrections, and relatively large p-values for the critical effects.
Content-wise: Do the analyses and results reflect the authors’ hypotheses and conclusions? Statistics-wise: Are there any strange things? For example, are the degrees of freedom in order, or could there be some mistake during data processing?
There are other statistical things that one could look at, which I have not done to date, but perhaps should start doing. For example, are the descriptive statistics mathematically possible? One can use Nick Brown’s and James Heathers’ GRIM test for this. Is the distribution of the variables, as described by the means and standard deviations, plausible? If there are multiple experiments, are there suspiciously many significant p-values despite low experimental power? Uli Schimmack’s Incredibility Index can be used to check this. Doing such routine checks is very uncommon in peer review, as far as I know. Perhaps reviewers don’t want to include anything in their report that could be misconstrued (or correctly construed) as implying that there is some kind of fraud or misconduct involved. On the other hand, it should also be in authors' best interests if reviewers manage to pick up potentially embarrassing honest mistakes. Yes, checking of such things is a lot of work, but arguably it is the reviewer’s job to make sure that the paper is, objectively, correct, i.e., that the results are not due to some typo, trimming error or more serious issues. Just like reviewers of papers in mathematics have to reproduce all calculations, and journals such as the Journal of Statistical Software verify the reproducibility of all code before sending a paper out to review.
And speaking of reproducibility: Ideally, the analyses and results of any paper should be reproducible. This means that the reviewers (or anyone else, for that matter), can take the same data, run the same analyses, and get the same results. In fact, this is more than an ideal scenario: running the same analyses and getting the same results, as opposed to getting results that are not at all compatible with the authors’ conclusions, is kind of a must. This requires that authors upload their data (unless this is impossible, e.g., due to privacy issues), and an analysis script.
The Peer Reviewer’s Openness (PRO) Initiative proposes that reviewers refuse to review any paper that does not provide the data, i.e., that fails to meet the minimum standard of reproducibility. I have signed this initiative, but admit that I still review papers, even when the data is not available. This is not because I don’t think it’s important to request transparency: I generally get overly excited when I’m asked to review a paper, and get halfway through the review before I remember that, as a signatory of the PRO, I shouldn’t be reviewing it at all until the authors provide the data or a reason why this is not possible. I compromise by including a major concern at the beginning of my reviews, stating that the data should be made available unless there are reasons for not making it public. So far, I think, I’ve succeeded only once in convincing the authors to actually upload their data, and a few editors have mentioned my request in their decision letter. 

Discussion
Here, the main questions I ask are: Are the conclusions warranted by the data? Are any limitations clearly stated? Can I follow the authors’ reasoning? Is it sufficiently clear which conclusions follow from the data, and which are more speculative? 
As with the introduction section, it’s always possible to suggest alternative hypotheses or theories for which the results may have relevance. Again, I try not to get too carried away with this, because I see it as the reader’s task to identify any links between the paper and their own research questions.

In conclusion
Peer review is a double-edged sword. Reviewers have the power to influence an editor’s decision, and should use it wisely. In order to be an unbiased gate-keeper to sift out bad science, a reviewer’s report ought to be as objective as possible. I did not aim to make this blog post about Open Science, but looking through what I wrote so far, making sure that the methods and results of a paper are openly available (if possible) and reproducible might be the major goal of an objective peer reviewer. After all, if all information is transparently presented, each reader has the information they need in order to decide for themselves whether they want to believe in the conclusions of the paper. The probability of your paper being accepted for publication would no longer depend on whether your particular reviewers happen to find your arguments and data convincing.  

I will finish the blog post with an open question: Is it possible, or desirable, to have completely objective peer review in psychological science?

Sunday, February 11, 2018

By how much would we need to increase our sample sizes to have adequate power with an alpha level of 0.005?


At our department seminar last week, the recent paper by Benjamin et al. on redefining statistical significance was brought up. In this paper, a large group of researchers argue that findings with a p-value close to 0.05 reflect only weak evidence for an effect. Thus, to claim a new discovery, the authors propose a stricter threshold, α = 0.005.

After hearing of this proposal, the immediate reaction in the seminar room was horror at some rough estimations of either the loss of power or increase in the required sample size that this would involve. I imagine that this reaction is rather standard among researchers, but from a quick scan of the “Redefine Statistical Significance” paper and four responses to the paper that I have found (“Why redefining statistical significance will not improve reproducibility and could make the replication crisis worse” by Crane, “Justify your alpha” by Lakens et al., “Abandon statistical significance” by McShane et al., and “Retract p < 0.005 and propose using JASP instead” by Perezgonzales & Frías-Navarro), there are no updated sample size estimates.

Required sample estimates for α = 0.05 α = 0.005 are very easy to calculate with g*power. So, here are the sample size estimates for achieving 80% power, for two-tailed independent-sample t-tests and four different effect sizes:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
788
200
90
52
0.005
1336
338
152
88

It is worth noting that most effects in psychology tend to be closer to the d = 0.2 end of the scale, and that most designs are nowadays more complicated than simple main effects in a between-subject comparison. More complex designs (e.g., when one is looking at an interaction) usually require even more participants.

The argument of Benjamin et al., that p-values close to 0.05 provide very weak evidence, is convincing. But their solution raises practical issues which should be considered. For some research questions, collecting a sample of 1336 participants could be achievable, for example by using online questionnaires instead of testing participants at the lab. For other research questions, collecting these kinds of samples is unimaginable. It’s not impossible, of course, but doing so would require a collective change in mindset, the research structure (e.g., investing more resources into a single project, providing longer-term contracts for early career researchers), and incentives (e.g., relaxing the requirement to have many first-author publications).

If we ignore peoples’ concerns about the practical issues associated with collecting this many participants, the Open Science movement may lose a great deal of supporters.

Can I end this blog post on a positive note? Well, there are some things we can do to make the numbers from the table above seem less scary. For example, we can use within-subject designs when possible. Things already start to look brighter: Using the same settings in g*power as above, but calculating the required sample size for “Difference between two dependent means”, we get the following:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
199
52
24
15
0.005
337
88
41
25

We could also pre-register our study, including the expected direction of a test, which would allow us to use a one-sided t-test. If we do this, in addition to using a within-subject design, we have:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
156
41
19
12
0.005
296
77
36
22

The bottom line is: A comprehensive solution to the replication crisis should address the practical issues associated with getting larger sample sizes.

Thursday, February 8, 2018

Should early-career researchers make their own website?


TL;DR: Yes.

For a while, I have been thinking about whether or not to make my own website. I could see some advantages, but at the same time, I was wondering how it would be perceived. After all, I don’t think any of my superiors at work have their own website, so why should I?

To see what people actually think, I made a poll on Twitter. It received some attention and generated some interesting discussions and many supportive comments (you can read them directly by viewing the responses to the poll I linked above). In this blogpost, I would like to summarise the arguments that were brought up (they were mainly pro-website).

But first, and without any further ado, here are the results:

The results are pretty clear, so – here it is: https://www.xenia-schmalz.net/. It’s still a work-in-progress, so I would be happy to get any feedback!

It is noteworthy that there are some people who did think that it’s narcissistic for Early Career Researchers (ECRs) to create their own website. It would have been interesting to get some information about the demographics of these 5%, and their thoughts behind their vote. If you are an ECR who is weighing up the pros and cons of creating a website, then, as Leonid Schneider pointed out, you may want to think about whether you would want to positively impress someone who judges you for creating an online presence. Either way, I decided that the benefits outweigh any potential costs.

Several people have pointed out in response to the twitter poll that a website is only as narcissistic as you make it. This leads to the question: what comes off as narcissistic? I can imagine that there are many differences in opinion on this. Does one focus on one’s research only? Or include some fun facts about oneself? I decided to take the former approach, for the reason that people who google me are probably more interested in my research rather than my political opinion or to find out whether I’m a cat or a dog person.

In general, people who spend more time on self-promotion than on actually doing things that they brag they can do are not very popular. I would rather not self-promote at all than come off as someone with a head full of hot air. Ideally, I would want to let my work speak for itself and for colleagues to judge me based on the quality of my work. This, of course, requires that people can access my work – which is where the website comes in. Depending on how you design your website, this is literally what it is: A way for people to access your work, so they can make their own opinion about its quality. 

In principle, universities create websites for their employees. However, things can get complicated, especially for ECRs. ECRs often change affiliations, and sometimes go for months without having an official job. For example, I refer to myself as a “post-doc in transit”: My two-year post-doc contract at the University of Padova went until March last year, and I’m currently on a part-time short-term contract at the University of Munich until I will (hopefully) get my own funding. In the meantime, I don’t have a website at the University of Munich, only an out-of-date and incomplete website at the University of Padova, and a still-functioning and rather detailed and up-to-date website at the Centre of Cognition and its Disorders (where I did my PhD in 2011-2014; I’m still affiliated with the CCD as an associate investigator until this year, so probably this site will disappear or stop being updated rather soon). Several people pointed out, in the responses to my twitter poll, that they get a negative impression if they google a researcher and only find an incomplete university page: this may come across as laziness or not caring.

What kind of information should be available about an ECR? First, their current contact details. I somehow thought that my email address should be findable for everyone who looks for it, but come to think of it, I have had people contact me through researchgate or twitter and saying that they couldn’t find my email address.

Let’s suppose that Professor Awesome is looking to hire a post-doc, and has heard that you’re looking for a job and have all the skills that she needs. She might google you, only to find an out-dated university website with an email address that doesn’t work anymore, and in order to contact you, she would need to find you on researchgate (where she would probably need to have an account to contact you), or search for your recent publications, find one where you are a corresponding author, and hope that the listed email address is still valid. At some stage, Professor Awesome might give up and look up the contact details of another ECR who fits the job description.

Admittedly, I have never heard of anyone receiving a job offer via email out of the blue. But one can think of other situations where people might want to contact you with good news: Invitations to review, to become a journal editor, to participate in a symposium, to give a talk at someone else’s department, to collaborate, to give an interview about your research, or simply to discuss some aspects of your work. These things are very likely to increase your chances of getting a position in Professor Awesome’s lab. For me, it remains an open question whether having a website will actually result in any of these things, but I will report back in one year with my anecdotal data on this.

Second, having your own website (rather than relying on your university to create one for you) gives you more control of what people find out about you. In my case, a dry list of publications would probably not bring across my dedication to Open Science, which I see as a big part of my identity as a scientist.

Third, a website can be useful tool to link to your work: not just a list of publications, but also links to full texts, data, materials and analysis scripts. One can even link to unpublished work. In fact, this was one of my main goals while creating the website. In addition to a list of publications on the CV section, I included information about projects that I’m working on or that I have worked on in the past. This was a good reason to get myself organised. First, I sorted my studies by an overarching research question (which has helped me to figure out: What am I actually doing?). Then, for each study, I added a short description (which has helped me to figure out what I have achieved so far), and links to the full text, data and materials (which helped me to verify that I did make this information publicly accessible, which I always tell everyone else that they should do). 

Creating the website is therefore a useful tool for myself to keep track of what I'm doing. People on twitter have pointed out in their comments that it can also be useful for others: not only for the fictional Professor Awesome who is only waiting to make you a job offer, but also, for example, for students who would like to apply for a PhD at your department and are interested to get more information about what people in the department are doing.

I have included information about ongoing projects, published articles, and projects-on-hold. Including information about unpublished projects could be controversial: given that the preprints are presented alongside with published papers, unsuspecting readers might get confused and mistake an unpublished study for a peer-reviewed paper. However, I think that the benefits of making data and materials for unpublished studies outweighs the cost. Some of these papers are unpublished for practical reasons (e.g., because I ran out of resources to conduct a follow-up experiment). Even if an experiment turned out to be unpublishable because I made some mistakes in the experimental design, other people might learn from my mistakes in conducting their own research. This is one of the main reasons why I created the website: To make all aspects of all of my projects fully accessible.

Conclusion
As with everything, there are pros and cons with creating a personal website. A con is that some people might perceive you as narcissistic. There are many pros, though: Especially as an ECR, you provide a platform with information about your work which will remain independently of your employment status. You increase your visibility, so that others can contact you more easily. You can control what others can find out about you. And, finally, you can provide information about your work that, for whatever reason, does not come across in your publication list. So, in conclusion: I recommend to ECRs to make their own website.