Tuesday, March 27, 2018

Can peer review be objective in “soft” sciences?

Love it or hate it – peer review is likely to elicit strong emotional reactions from researchers, at least at those times when they receive an editorial letter with a set of reviews. Reviewer 1 is mildly positive, Reviewer 2 says that the paper would be tolerable if you rewrote the title, abstract, introduction, methods, results and discussion sections, and Reviewer 3 seems to have missed the point of the paper altogether.
This is not going to be an anti-peer-review blog post. In general, I like peer review (even if you might get a different impression if you talk to me right after I get a paper rejected). In principle, peer review is an opportunity to engage in academic discussion with researchers who have similar research interests as you. I have learned a lot from having my papers peer reviewed, and most of my papers have been substantially improved by receiving feedback from reviewers, who often have different perspectives on my research question.

In practice, unfortunately, the peer review process is not a pleasant chat between two researchers who are experts on the same topic. The reviewer has the power to influence whether the paper will be rejected or accepted. With the pressure to publish, the reviewer may well delay the graduation of a PhD student or impoverish a researcher’s publication record just before an important deadline. That’s a different matter, that has more to do with the incentive system than with the peer review system. The thing about peer review is, though, that it should be as objective as possible: especially given that, in practice, a PhD student’s graduation may depend on it.

Writing an objective peer review is probably not a big issue for harder sciences and mathematics, where verifying the calculations should be often enough to decide whether the conclusions of the paper are warranted. In contrast, in soft sciences, one can always find methodological flaws, think of potential confounds or alternative explanations that could explain the results, or require stronger statistical evidence. The limit is the reviewer’s imagination: whether your paper gets accepted or not may well be a function of the reviewers’ mood, creativity, and implicit biases for or against your lab.

This leads me to the goal of the current blog post: Can we achieve objective peer review in psychological science? I don’t have an answer to this. But here, I aim to summarise the kinds of things that I generally pay attention to when I review a paper (or would like to pay more attention to in the future), and hope for some discussion (in the comments, on twitter) about whether or not this constitutes as-close-to-objective-as-possible peer review.

Title and abstract
Here, I generally check whether the title and abstract reflect the results of the paper. This shouldn’t even be an issue, but I have reviewed papers where the analysis section described a clear null result, but the title and abstract implied that the effect was found.

Here, I aim to check whether the review of the literature is complete and unbiased, to the best of my knowledge. As examples of issues that I would point out: the authors selectively cite studies with positive results (or, worse: studies with null-results as if they had found positive results), or misattribute a finding or theory. As minor points, I note if I cannot follow the authors’ reasoning.
I also consider the a priori plausibility of the authors’ hypothesis. The idea is to try and pick up on instances of HARKing, or hypothesising after results are known. If there is little published information on an effect, but the authors predict the exact pattern of results of a 3x2x2x5-ANOVA, I ask them to clarify whether the results were found in exploratory analyses, and if so, then to rewrite the introduction accordingly. Exploratory results are valuable and should be published, but should not be phrased as confirmatory findings, I write.
It is always possible to list other relevant articles or other theories which the authors should cite in the introduction (e.g., more of the reviewer’s papers). Here, I try to hold back with suggestions: the reader will read the paper through the lens of their own research question, anyway, and if the results are relevant for their own hypothesis, they will be able to relate them without the authors writing a novel on all possible perspectives from which their paper could be interesting.

No experiment’s methods are perfect, but some imperfections make the results uninterpretable, other types of imperfection should be pointed out as limitations, and yet others are imperfections that are found in all papers using the paradigm, so it’s perfectly OK to have those imperfections in a published paper, unless they are a reviewer’s personal pet peeve. In some instances, it is even considered rude to point out certain imperfections. Sometimes, pointing out imperfections will just result in the authors citing some older papers which have the same imperfections. Some imperfections can be addressed with follow-up analyses (e.g., by including covariates), but in this case it’s not clear what the authors should do if they get ambiguous results or results that conflict with the original analyses.
Perhaps this is the section with which you can always sink a paper, if you want. If for no other reason, then, in most cases, based on the experiment(s) being underpowered. It probably varies from topic to topic and from lab to lab what level of imperfection can be tolerated. I can’t think of any general rules or things that I look at when evaluating the adequacy of the experimental methods. If authors reported a priori power analyses, one could objectively scrutinise their proposed effect size. In practice, demanding power analyses in a review would be likely to lead to some post-hoc justifications on the side of the authors, which is not the point of power analyses.
So, perhaps the best thing is to simply ask the authors for the 21-word statement, proposed by Simmons et al., which includes a clarification about whether or not the sample size, analyses, and comparisons were determined a priori. I must admit that I don’t do this (but will start to do this in the future): so far, failing to include such a declaration, in my area, seems to fall into the category of “imperfections that are found in all papers, so it could be seen as rude to point them out”. But this is something that ought to change.
Even though the methods themselves may be difficult to review objectively, one can always focus on the presentation of the methods. Could the experiment be reproduced by someone wanting to replicate the study? It is always best if the materials (actual stimuli that were used, scripts (python, DMDX) that were used for item presentation) are available as appendices. For psycholinguistic experiments, I ask for a list of words with their descriptive statistics (frequency, orthographic neighbourhood, other linguistic variables that could be relevant). 

In some ways, results sections are the most fun to review. (I think some people whose paper I reviewed would say that this is the section that is the least fun to get reviewed by me.) The first question I try to answer is: Is it likely that the authors are describing accidental patterns in random noise? As warning signs, I take a conjunction of small sample sizes, strange supposedly a priori hypotheses (see “Methods” section), multiple comparisons without corrections, and relatively large p-values for the critical effects.
Content-wise: Do the analyses and results reflect the authors’ hypotheses and conclusions? Statistics-wise: Are there any strange things? For example, are the degrees of freedom in order, or could there be some mistake during data processing?
There are other statistical things that one could look at, which I have not done to date, but perhaps should start doing. For example, are the descriptive statistics mathematically possible? One can use Nick Brown’s and James Heathers’ GRIM test for this. Is the distribution of the variables, as described by the means and standard deviations, plausible? If there are multiple experiments, are there suspiciously many significant p-values despite low experimental power? Uli Schimmack’s Incredibility Index can be used to check this. Doing such routine checks is very uncommon in peer review, as far as I know. Perhaps reviewers don’t want to include anything in their report that could be misconstrued (or correctly construed) as implying that there is some kind of fraud or misconduct involved. On the other hand, it should also be in authors' best interests if reviewers manage to pick up potentially embarrassing honest mistakes. Yes, checking of such things is a lot of work, but arguably it is the reviewer’s job to make sure that the paper is, objectively, correct, i.e., that the results are not due to some typo, trimming error or more serious issues. Just like reviewers of papers in mathematics have to reproduce all calculations, and journals such as the Journal of Statistical Software verify the reproducibility of all code before sending a paper out to review.
And speaking of reproducibility: Ideally, the analyses and results of any paper should be reproducible. This means that the reviewers (or anyone else, for that matter), can take the same data, run the same analyses, and get the same results. In fact, this is more than an ideal scenario: running the same analyses and getting the same results, as opposed to getting results that are not at all compatible with the authors’ conclusions, is kind of a must. This requires that authors upload their data (unless this is impossible, e.g., due to privacy issues), and an analysis script.
The Peer Reviewer’s Openness (PRO) Initiative proposes that reviewers refuse to review any paper that does not provide the data, i.e., that fails to meet the minimum standard of reproducibility. I have signed this initiative, but admit that I still review papers, even when the data is not available. This is not because I don’t think it’s important to request transparency: I generally get overly excited when I’m asked to review a paper, and get halfway through the review before I remember that, as a signatory of the PRO, I shouldn’t be reviewing it at all until the authors provide the data or a reason why this is not possible. I compromise by including a major concern at the beginning of my reviews, stating that the data should be made available unless there are reasons for not making it public. So far, I think, I’ve succeeded only once in convincing the authors to actually upload their data, and a few editors have mentioned my request in their decision letter. 

Here, the main questions I ask are: Are the conclusions warranted by the data? Are any limitations clearly stated? Can I follow the authors’ reasoning? Is it sufficiently clear which conclusions follow from the data, and which are more speculative? 
As with the introduction section, it’s always possible to suggest alternative hypotheses or theories for which the results may have relevance. Again, I try not to get too carried away with this, because I see it as the reader’s task to identify any links between the paper and their own research questions.

In conclusion
Peer review is a double-edged sword. Reviewers have the power to influence an editor’s decision, and should use it wisely. In order to be an unbiased gate-keeper to sift out bad science, a reviewer’s report ought to be as objective as possible. I did not aim to make this blog post about Open Science, but looking through what I wrote so far, making sure that the methods and results of a paper are openly available (if possible) and reproducible might be the major goal of an objective peer reviewer. After all, if all information is transparently presented, each reader has the information they need in order to decide for themselves whether they want to believe in the conclusions of the paper. The probability of your paper being accepted for publication would no longer depend on whether your particular reviewers happen to find your arguments and data convincing.  

I will finish the blog post with an open question: Is it possible, or desirable, to have completely objective peer review in psychological science?