Xenia Schmalz's blog: Anecdotes showing the flaws of the current peer review system: Case 2

Earlier this week, I published a Twitter poll with a question relating to peer review. Here is the poll, as well as the results of 91 voters (see here for a link to the poll and to the responses):

The issue is one that I touched in my last blogpost: When we think that a paper that we are reviewing should be rejected, should we make this opinion clear to the editor, or is it simply our role to list the shortcomings and leave it up to the editor to decide whether they are serious enough to warrant a rejection?

Most respondents would agree to re-review a paper that they think should not be published, but add a very clear statement to the editor about this opinion. This corresponds to the view of the reviewer as a gate keeper, whose task it is to make sure that bad papers don't get published. About half as many respondents would agree to review again with an open mind, and to accept it if, eventually, the authors improve the paper sufficiently to warrant publication. This response reflects the view of a reviewer as a guide, who provides constructive criticism that will help the authors produce a better manuscript. About equally common was the response of declining to re-review in the first place. This reflects the view that it's ultimately not the reviewers' decision whether the paper should be published, but the editor's. The reviewers list the pros and cons, and if the concerns remain unaddressed and the editor still passes it on to the reviewers, clearly the editor doesn't think these concerns are major obstacles to a publication. The problem with this approach is that it creates a loophole for a really bad paper: if the editor keeps inviting re-submissions and critical reviewers only provide one round of peer review, it is only matter of time until the lottery results in only non-critical reviewers who are happy to wave the paper through.

The view that it's the reviewer's role to provide pros and cons, and the editor's role to decide what to do with them, is the one that I held for a while, and which led me to decline a few invitations to re-review that, in retrospect, I regret. One of these I described in my last blogpost, linked above. Today, I'll describe the second case study.

I don't want to attack anyone personally, so I made sure to describe the paper from my last blogpost in as little detail as possible. Here, I'd like describe some more details, because the paper is on a controversial theory which has practical implications, some strong believers, and in my view, close-to-no supporting evidence. Publications which make it look like the evidence is stronger than it actually is can potentially cause damage, both to other researchers, who invest their resources on following up on an illusory effect, and for the general public, who may trust a potential treatment that is not backed up by evidence. The topic is - unsurprisingly for anyone who has read my recent publications (e.g., here and here) - statistical learning and dyslexia.

A while ago, I was asked to review a paper that compared a group of children with dyslexia and a group of children without dyslexia on statistical learning, among with some other cognitive tasks. They showed a huge group difference, and I started to think that maybe I was wrong with my whole skepticism thing. Still, I asked for the raw data, as I do routinely; the authors argued against this with privacy concerns, but added scatterplots of their data instead. At this stage, after two rounds of peer review, I noticed something very strange: There was absolutely no overlap in the statistical learning scores between children with dyslexia and children without dyslexia. After having checked with a stats-savvy friend, I wrote the following point (this is an excerpt from the whole review, with only the relevant information):

"I have noticed something unusual about the data, after inspecting the scatterplots (Figure 2). The scatterplots show the distribution of scores for reading, writing, orthographic awareness and statistical learning, separated by condition (dyslexic versus control). It seems that in the orthographic awareness and statistical learning tasks, there is no overlap between the two groups. I find this highly unlikely: Even if there is a group difference in the population, it would be strange not to find any child without dyslexia who isn’t worse than any child with dyslexia. If we were to randomly pick 23 men and 23 women, we would be very surprised if all women were shorter than all men – and the effects we find in psychology are generally much smaller than the sex difference in heights. Closer to home, White et al. (2006) report a multiple case study, where they tested phonological awareness, among other tasks, in 23 children with dyslexia and 22 controls. Their Figure 1 shows some overlap between the two groups of participants – and, unlike the statistical learning deficit, a phonological deficit has been consistently shown in dozens of studies since the 1980s, suggesting that the population effect size should be far greater for the phonological deficit compared to any statistical learning deficit. In the current study, it even seems that there was some overlap between scores in the reading and writing tasks across groups, which would suggest that a statistical learning task is more closely related to a diagnosis of dyslexia than reading and writing ability. In short, the data unfortunately do not pass a sanity check. I can see two reasons for this: (1) Either, there is a coding error (the most likely explanation I can think of would be some mistake in using the “sort” function in excel), or (2) by chance, the authors obtained an outlier set of data, where indeed all controls performed better than all children with dyslexia on a statistical learning task. I strongly suggest that the authors double check that the data is reported correctly. If this is the case, the unusual pattern should be addressed in the discussion section. If the authors obtained an outlier set of data, the implication is that they are very likely to report a Magnitude Error (see Gelman & Carlin, 2014): The obtained effect size is likely to be much larger than the real population effect size, meaning that future studies using the same methods are likely to give much smaller effect sizes. This should be clearly stated as a limitation and direction for future research."

Months later, I was invited to re-review the paper. The editor, in the invitation letter, wrote that the authors had collected more data and analysed it together with this already existing dataset. This, of course, is not an appropriate course of action, assuming I was right with my sorting function hypothesis (no matter what, to me that still seems like the most plausible benign explanation): analysing a probably non-real and definitely strongly biased dataset with some additional real data points still leads to a very biased final result.

After some hesitation, I declined, with the justification that the editor and other reviewers should decide whether they think that my concerns were justified. Now, again months later, this article has been published, and frequently shows up in my researchgate feed, with recommendations from colleagues who, I feel, would not endorse it if they knew its peer review history. The scatterplots in the published paper show the combined dataset: indeed, among the newly collected data, there is a lot of overlap in statistical learning between the two groups, which adds noise to the unrealistically and suspiciously neat plots from the original dataset. This means that a cynical person looking at this scatterplot is unlikely to come to the same conclusion as I did. To be fair, I did not read the final version of the paper beyond looking at the plots: perhaps the authors honestly describe the very strange pattern that's probably fake in their original dataset, or provide an amazingly logical and obvious reason for this data pattern that I did not think of.

This anecdote demonstrates my own failure in acting as a gatekeeper who prevents articles that should not be published from making it into the peer-reviewed body of literature. The moral for myself is that, from now on, I will agree to re-review papers I've reviewed previously (unless there are some timing constraints that prevent me from doing so), and I will be more clear when my recommendation is not to publish the paper, ever. (In my reviewing experience so far, this happens extremely rarely, but I have learned that it does happen, and not only in this single case.)

As for my last blogpost, I will conclude with some broader questions and vague suggestions about the publication system in general. Some open questions: Are reviewers obliged to do their best to keep a bad paper out of the peer-reviewed literature? Should we blame them if they decline to re-review a paper instead of making sure that some serious concern of theirs has been addressed (and, if so, what about those who decline for a legitimate reason, such as health reasons or leaving academia)? Or is it the editor's responsibility to ensure that all critical points raised by any of the reviewers are addressed before publication? If so, how should this be implemented in practice? Even as a reviewer, I sometimes find that, during the time that passes between having written a review and seeing the revised version, I forgot all about the issues that I'd raised previously. For the editors, remembering all reviewers' points when they probably handle more manuscripts than an average reviewer might be too much to ask.

And as a vague suggestion: To some extent, this issue would be addressed by publishing the reviews along with the paper. This practice wouldn't need to add weight to the manuscript: on the article page, there would simply be an option to download the reviews, next to the option to download any supplementary materials such as the raw data. This is already done, to some extent, by some journals, such as Collabra: Psychology. However, the authors need to agree to this, which for a case such as the one I described above seems very unlikely. To really address the issue, publishing the reviews (whether with or without the reviewers' identities) would need to be compulsory. This would come with the possibility of collateral damage to authors if a reviewer throws around wild and unjustified accusations. Post-publication peer review, such as is done on PubPeer, would not fully address this particular issue. First, it comes with the same danger of unjustified criticism potentially damaging honest authors' reputation. Second, ultimately, a skeptical reviewer who doesn't follow the paper until the issues are resolved or the paper is rejected, helps the authors to hide these issues, such that another skeptical reader will not be able to spot them so easily without knowing about the peer review history.

Xenia Schmalz's blog

Friday, October 16, 2020

Anecdotes showing the flaws of the current peer review system: Case 2

No comments:

Post a Comment

Blog Archive