Friday, October 16, 2020

Anecdotes showing the flaws of the current peer review system: Case 2

Earlier this week, I published a Twitter poll with a question relating to peer review. Here is the poll, as well as the results of 91 voters (see here for a link to the poll and to the responses):

The issue is one that I touched in my last blogpost: When we think that a paper that we are reviewing should be rejected, should we make this opinion clear to the editor, or is it simply our role to list the shortcomings and leave it up to the editor to decide whether they are serious enough to warrant a rejection? 

Most respondents would agree to re-review a paper that they think should not be published, but add a very clear statement to the editor about this opinion. This corresponds to the view of the reviewer as a gate keeper, whose task it is to make sure that bad papers don't get published. About half as many respondents would agree to review again with an open mind, and to accept it if, eventually, the authors improve the paper sufficiently to warrant publication. This response reflects the view of a reviewer as a guide, who provides constructive criticism that will help the authors produce a better manuscript. About equally common was the response of declining to re-review in the first place. This reflects the view that it's ultimately not the reviewers' decision whether the paper should be published, but the editor's. The reviewers list the pros and cons, and if the concerns remain unaddressed and the editor still passes it on to the reviewers, clearly the editor doesn't think these concerns are major obstacles to a publication. The problem with this approach is that it creates a loophole for a really bad paper: if the editor keeps inviting re-submissions and critical reviewers only provide one round of peer review, it is only matter of time until the lottery results in only non-critical reviewers who are happy to wave the paper through. 

The view that it's the reviewer's role to provide pros and cons, and the editor's role to decide what to do with them, is the one that I held for a while, and which led me to decline a few invitations to re-review that, in retrospect, I regret. One of these I described in my last blogpost, linked above. Today, I'll describe the second case study. 

I don't want to attack anyone personally, so I made sure to describe the paper from my last blogpost in as little detail as possible. Here, I'd like describe some more details, because the paper is on a controversial theory which has practical implications, some strong believers, and in my view, close-to-no supporting evidence. Publications which make it look like the evidence is stronger than it actually is can potentially cause damage, both to other researchers, who invest their resources on following up on an illusory effect, and for the general public, who may trust a potential treatment that is not backed up by evidence. The topic is - unsurprisingly for anyone who has read my recent publications (e.g., here and here) - statistical learning and dyslexia. 

A while ago, I was asked to review a paper that compared a group of children with dyslexia and a group of children without dyslexia on statistical learning, among with some other cognitive tasks. They showed a huge group difference, and I started to think that maybe I was wrong with my whole skepticism thing. Still, I asked for the raw data, as I do routinely; the authors argued against this with privacy concerns, but added scatterplots of their data instead. At this stage, after two rounds of peer review, I noticed something very strange: There was absolutely no overlap in the statistical learning scores between children with dyslexia and children without dyslexia. After having checked with a stats-savvy friend, I wrote the following point (this is an excerpt from the whole review, with only the relevant information): 

"I have noticed something unusual about the data, after inspecting the scatterplots (Figure 2). The scatterplots show the distribution of scores for reading, writing, orthographic awareness and statistical learning, separated by condition (dyslexic versus control). It seems that in the orthographic awareness and statistical learning tasks, there is no overlap between the two groups. I find this highly unlikely: Even if there is a group difference in the population, it would be strange not to find any child without dyslexia who isn’t worse than any child with dyslexia. If we were to randomly pick 23 men and 23 women, we would be very surprised if all women were shorter than all men – and the effects we find in psychology are generally much smaller than the sex difference in heights. Closer to home, White et al. (2006) report a multiple case study, where they tested phonological awareness, among other tasks, in 23 children with dyslexia and 22 controls. Their Figure 1 shows some overlap between the two groups of participants – and, unlike the statistical learning deficit, a phonological deficit has been consistently shown in dozens of studies since the 1980s, suggesting that the population effect size should be far greater for the phonological deficit compared to any statistical learning deficit. In the current study, it even seems that there was some overlap between scores in the reading and writing tasks across groups, which would suggest that a statistical learning task is more closely related to a diagnosis of dyslexia than reading and writing ability. In short, the data unfortunately do not pass a sanity check. I can see two reasons for this: (1) Either, there is a coding error (the most likely explanation I can think of would be some mistake in using the “sort” function in excel), or (2) by chance, the authors obtained an outlier set of data, where indeed all controls performed better than all children with dyslexia on a statistical learning task. I strongly suggest that the authors double check that the data is reported correctly. If this is the case, the unusual pattern should be addressed in the discussion section. If the authors obtained an outlier set of data, the implication is that they are very likely to report a Magnitude Error (see Gelman & Carlin, 2014): The obtained effect size is likely to be much larger than the real population effect size, meaning that future studies using the same methods are likely to give much smaller effect sizes. This should be clearly stated as a limitation and direction for future research."

Months later, I was invited to re-review the paper. The editor, in the invitation letter, wrote that the authors had collected more data and analysed it together with this already existing dataset. This, of course, is not an appropriate course of action, assuming I was right with my sorting function hypothesis (no matter what, to me that still seems like the most plausible benign explanation): analysing a probably non-real and definitely strongly biased dataset with some additional real data points still leads to a very biased final result.

After some hesitation, I declined, with the justification that the editor and other reviewers should decide whether they think that my concerns were justified. Now, again months later, this article has been published, and frequently shows up in my researchgate feed, with recommendations from colleagues who, I feel, would not endorse it if they knew its peer review history. The scatterplots in the published paper show the combined dataset: indeed, among the newly collected data, there is a lot of overlap in statistical learning between the two groups, which adds noise to the unrealistically and suspiciously neat plots from the original dataset. This means that a cynical person looking at this scatterplot is unlikely to come to the same conclusion as I did. To be fair, I did not read the final version of the paper beyond looking at the plots: perhaps the authors honestly describe the very strange pattern that's probably fake in their original dataset, or provide an amazingly logical and obvious reason for this data pattern that I did not think of.

This anecdote demonstrates my own failure in acting as a gatekeeper who prevents articles that should not be published from making it into the peer-reviewed body of literature. The moral for myself is that, from now on, I will agree to re-review papers I've reviewed previously (unless there are some timing constraints that prevent me from doing so), and I will be more clear when my recommendation is not to publish the paper, ever. (In my reviewing experience so far, this happens extremely rarely, but I have learned that it does happen, and not only in this single case.) 

As for my last blogpost, I will conclude with some broader questions and vague suggestions about the publication system in general. Some open questions: Are reviewers obliged to do their best to keep a bad paper out of the peer-reviewed literature? Should we blame them if they decline to re-review a paper instead of making sure that some serious concern of theirs has been addressed (and, if so, what about those who decline for a legitimate reason, such as health reasons or leaving academia)? Or is it the editor's responsibility to ensure that all critical points raised by any of the reviewers are addressed before publication? If so, how should this be implemented in practice? Even as a reviewer, I sometimes find that, during the time that passes between having written a review and seeing the revised version, I forgot all about the issues that I'd raised previously. For the editors, remembering all reviewers' points when they probably handle more manuscripts than an average reviewer might be too much to ask. 

And as a vague suggestion: To some extent, this issue would be addressed by publishing the reviews along with the paper. This practice wouldn't need to add weight to the manuscript: on the article page, there would simply be an option to download the reviews, next to the option to download any supplementary materials such as the raw data. This is already done, to some extent, by some journals, such as Collabra: Psychology. However, the authors need to agree to this, which for a case such as the one I described above seems very unlikely. To really address the issue, publishing the reviews (whether with or without the reviewers' identities) would need to be compulsory. This would come with the possibility of collateral damage to authors if a reviewer throws around wild and unjustified accusations. Post-publication peer review, such as is done on PubPeer, would not fully address this particular issue. First, it comes with the same danger of unjustified criticism potentially damaging honest authors' reputation. Second, ultimately, a skeptical reviewer who doesn't follow the paper until the issues are resolved or the paper is rejected, helps the authors to hide these issues, such that another skeptical reader will not be able to spot them so easily without knowing about the peer review history.

Thursday, October 8, 2020

Anecdotes showing the flaws of the current peer review system: Case 1

A friend, who had decided not to pursue a PhD and an academic career after finishing his Masters degree, asked me how it's possible that so many of the papers that are published in peer-reviewed journals are - well - bullshit. As a response, I told him about a recent experience of mine. 

A while ago, I was asked to review a paper by a journal with a pretty high impact factor. I agreed: the paper was right in my area of expertise and sounded very interesting. When I read the manuscript, however, I was less enthusiastic. Let's say: I've seen better papers desk-rejected by lower impact factor journals. This was a sloppily designed study with overstated conclusions. I wrote the review by my standard template: First, summarise the paper in a few sentences, then write something nice about it, then list major and minor points, with suggestions that would address them whenever possible. I hold on to the belief that any study that the authors thought was worth conducting is also worth publishing, at least in some form. In the paper, I detected a potential major confound, and I had the impression that the authors wanted to hide some of the information relating to it, so I asked for clarifications. 

I submitted my review, and as always, a while later, received the decision letter. The other reviews were also lukewarm at best, so I was very surprised that the action editor invited a revision! When the authors resubmitted the paper, I agreed to review it again. However, most of my comments remained unaddressed, and my overall impression was that of the authors trying to hide some of the design flaws to blow up the importance of the conclusions. I wrote a slightly less reserved review, stating more clearly that I didn't think the paper should be published unless the authors addressed my comments. When I was invited to participate in the third round of reviews, I declined: I just didn't want to deal with it. 

Several months later, I saw the paper published in the very same high impact factor journal. As the academic world is small, I now knew for sure what I had suspected despite the anonymity of the peer review process: the senior author of that paper was a friend of the action editor's.

This is, of course, an anecdote, coloured by my own perceptions and preconceptions. There is nothing to suggest, other than my own impression, that the paper was published only because of the friendship between the author and editor. Maybe (probably) I'm way too skeptical in my reading of articles. That was also one of the reasons why I had declined to do a third round of review: I wanted to leave it up to the editor and the other reviewers to decide whether my concerns were justified. But let's be honest: Is anyone truly surprised that there are some cases where editors are more lenient when they personally know the author(s)? And, if we are truly honest, isn't this just a very natural thing that we do ourselves whenever we judge our colleagues' papers, be it as reviewers or editors or simply as readers: letting people we know and like get away with things that we would judge strangers harshly for? 

Maybe this anecdote, along with your own personal experiences, is convincing enough to show that at least sometimes, personal interest interferes with objective judgements and allows articles to pass peer review when they wouldn't hold up to scrutiny under other circumstances. This raises two questions, to which I don't have an answer: How often does this happen, and is this really a problem? And, more importantly, what is a better system? 

For years, I've been an advocate for as much transparency as possible in all aspects of the research process, and in line with this principle, I started signing my reviews shortly after I finished my PhD (though I stopped signing them later). Now, I am coming to the conclusion that anonymity has substantial advantages, not only if the reviewers don't know the identity of the authors, but also if the editors don't know the identity of the authors. Would this help? Well, maybe not. Years ago, I've been told by a senior researcher that it doesn't matter whether peer review is anonymous or not, because it's normally obvious who exactly - or at least which lab - produced the paper. In my experience (I've reviewed ca. 60 papers since then), I'd say this is often true, and when I review an anonymous paper I cannot stop myself from taking a guess at who the authors are.

So, to conclude, I don't have the answers to the two questions I asked above. But I do know that experiencing such anecdotes leaves me discouraged and frustrated about a system where one's chances of being employed are determined based on whether one publishes in high impact factor journals or not.