Love it or hate it – peer review is likely to elicit
strong emotional reactions from researchers, at least at those times when they
receive an editorial letter with a set of reviews. Reviewer 1 is mildly
positive, Reviewer 2
says that the paper would be tolerable if you rewrote the title, abstract,
introduction, methods, results and discussion sections, and Reviewer 3 seems to
have missed the point of the paper altogether.
This is not going to be an
anti-peer-review blog post. In general, I like peer review (even if you might
get a different impression if you talk to me right after I get a paper
rejected). In principle, peer review is an opportunity to engage in academic
discussion with researchers who have similar research interests as you. I have
learned a lot from having my papers peer reviewed, and most of my papers have
been substantially improved by receiving feedback from reviewers, who often
have different perspectives on my research question.
In practice, unfortunately, the
peer review process is not a pleasant chat between two researchers who are
experts on the same topic. The reviewer has the power to influence whether the
paper will be rejected or accepted. With the pressure to publish, the reviewer
may well delay the graduation of a PhD student or impoverish a researcher’s
publication record just before an important deadline. That’s a different
matter, that has more to do with the incentive system than with the peer review
system. The thing about peer review is, though, that it should be as objective
as possible: especially given that, in practice, a PhD student’s graduation may
depend on it.
Writing an objective peer review is
probably not a big issue for harder sciences and mathematics, where verifying
the calculations should be often enough to decide whether the conclusions of
the paper are warranted. In contrast, in soft sciences, one can always find
methodological flaws, think of potential confounds or alternative explanations
that could explain the results, or require stronger statistical evidence. The
limit is the reviewer’s imagination: whether your paper gets accepted or not
may well be a function of the reviewers’ mood, creativity, and implicit biases
for or against your lab.
This leads me to the goal of the
current blog post: Can we achieve objective peer review in psychological
science? I don’t have an answer to this. But here, I aim to summarise the kinds
of things that I generally pay attention to when I review a paper (or would
like to pay more attention to in the future), and hope for some discussion (in
the comments, on twitter) about whether or not this constitutes
as-close-to-objective-as-possible peer review.
Title and abstract
Here, I generally check whether the
title and abstract reflect the results of the paper. This shouldn’t even be an
issue, but I have reviewed papers where the analysis section described a clear
null result, but the title and abstract implied that the effect was found.
Introduction
Here, I aim to check whether the
review of the literature is complete and unbiased, to the best of my knowledge.
As examples of issues that I would point out: the authors selectively cite
studies with positive results (or, worse: studies with null-results as if they
had found positive results), or misattribute a finding or theory. As minor points, I note if I
cannot follow the authors’ reasoning.
I also consider the a priori plausibility of the authors’
hypothesis. The idea is to try and pick up on instances of HARKing,
or hypothesising after results are known. If there is little published
information on an effect, but the authors predict the exact pattern of results
of a 3x2x2x5-ANOVA, I ask them to clarify whether the results were found in
exploratory analyses, and if so, then to rewrite the introduction accordingly.
Exploratory results are valuable and should be published, but should not be
phrased as confirmatory findings, I write.
It is always possible to list other
relevant articles or other theories which the authors should cite in the
introduction (e.g., more of the reviewer’s papers). Here, I try to hold back
with suggestions: the reader will read the paper through the lens of their own
research question, anyway, and if the results are relevant for their own
hypothesis, they will be able to relate them without the authors writing a
novel on all possible perspectives from which their paper could be interesting.
Methods
No experiment’s methods are
perfect, but some imperfections make the results uninterpretable, other types
of imperfection should be pointed out as limitations, and yet others are
imperfections that are found in all papers using the paradigm, so it’s
perfectly OK to have those imperfections in a published paper, unless they are
a reviewer’s personal pet peeve. In some instances, it is even considered rude
to point out certain imperfections. Sometimes, pointing out imperfections will
just result in the authors citing some older papers which have the same
imperfections. Some imperfections can be addressed with follow-up analyses
(e.g., by including covariates), but in this case it’s not clear what the
authors should do if they get ambiguous results or results that conflict with
the original analyses.
Perhaps this is the section with
which you can always sink a paper, if you want. If for no other reason, then,
in most cases, based on the experiment(s) being underpowered. It probably varies from
topic to topic and from lab to lab what level of imperfection can be tolerated.
I can’t think of any general rules or things that I look at when evaluating the adequacy of the experimental methods. If authors reported a
priori power analyses, one could objectively scrutinise their proposed
effect size. In practice, demanding power analyses in a review would be likely
to lead to some post-hoc justifications on the side of the authors, which is
not the point of power analyses.
So, perhaps the best thing is to simply
ask the authors for the 21-word
statement, proposed by Simmons et al., which includes a clarification about
whether or not the sample size, analyses, and comparisons were determined a priori. I must admit that I don’t do
this (but will start to do this in the future): so far, failing to include such
a declaration, in my area, seems to fall into the category of “imperfections
that are found in all papers, so it could be seen as rude to point them out”.
But this is something that ought to change.
Even though the methods themselves may be difficult to review objectively, one can always focus on the presentation of the
methods. Could the experiment be reproduced by someone wanting to replicate the
study? It is always best if the materials (actual stimuli that were used,
scripts (python, DMDX) that were used for item presentation) are available as
appendices. For psycholinguistic experiments, I ask for a list of words with
their descriptive statistics (frequency, orthographic neighbourhood, other
linguistic variables that could be relevant).
Results
In some ways, results sections are
the most fun to review. (I think some people whose paper I reviewed would say that this is the section that is the
least fun to get reviewed by me.) The first question I try to answer is: Is it
likely that the authors are describing accidental patterns in random noise? As
warning signs, I take a conjunction of small sample sizes, strange supposedly a priori hypotheses (see “Methods”
section), multiple comparisons without corrections, and relatively large p-values for the critical effects.
Content-wise: Do the analyses and
results reflect the authors’ hypotheses and conclusions? Statistics-wise: Are
there any strange things? For example, are the degrees of freedom in order, or
could there be some mistake during data processing?
There are other statistical things
that one could look at, which I have not done to date, but perhaps should start
doing. For example, are the descriptive statistics mathematically
possible? One can use Nick Brown’s and James Heathers’ GRIM test for this. Is the distribution
of the variables, as described by the means and standard deviations, plausible?
If there are multiple experiments, are there suspiciously many significant p-values despite low experimental power?
Uli Schimmack’s Incredibility
Index can be used to check this. Doing such routine checks is very uncommon
in peer review, as far as I know. Perhaps reviewers don’t want to include
anything in their report that could be misconstrued (or correctly construed) as
implying that there is some kind of fraud or misconduct involved. On the other hand, it should also be in authors' best interests if reviewers manage to pick up potentially embarrassing honest mistakes. Yes, checking
of such things is a lot of work, but arguably it is the reviewer’s job to make
sure that the paper is, objectively, correct, i.e., that the results are not
due to some typo, trimming error or more serious issues. Just like reviewers of
papers in mathematics have to reproduce all calculations, and journals such as
the Journal of Statistical
Software verify the reproducibility of all code before sending a paper out
to review.
And speaking of reproducibility:
Ideally, the analyses and results of any paper should be reproducible. This
means that the reviewers (or anyone else, for that matter), can take the same
data, run the same analyses, and get the same results. In fact, this is more
than an ideal scenario: running the same analyses and getting the same
results, as opposed to getting results that are not at all compatible with the authors’
conclusions, is kind of a must. This requires that authors upload their data
(unless this is impossible, e.g., due to privacy issues), and an analysis
script.
The Peer Reviewer’s Openness (PRO) Initiative
proposes that reviewers refuse to review any paper that does not provide the
data, i.e., that fails to meet the minimum standard of reproducibility. I have
signed this initiative, but admit that I still review papers, even when the
data is not available. This is not because I don’t think it’s important to request transparency: I
generally get overly excited when I’m asked to review a paper, and get halfway
through the review before I remember that, as a signatory of the PRO, I
shouldn’t be reviewing it at all until the authors provide the data or a reason
why this is not possible. I compromise by including a major concern at the
beginning of my reviews, stating that the data should be made available unless
there are reasons for not making it public. So far, I think, I’ve succeeded
only once in convincing the authors to actually upload their data, and a few
editors have mentioned my request in their decision letter.
Discussion
Here, the main questions I ask are:
Are the conclusions warranted by the data? Are any limitations clearly stated?
Can I follow the authors’ reasoning? Is it sufficiently clear which conclusions
follow from the data, and which are more speculative?
As with the introduction section,
it’s always possible to suggest alternative hypotheses or theories for which
the results may have relevance. Again, I try not to get too carried away with
this, because I see it as the reader’s task to identify any links between the
paper and their own research questions.
In conclusion
Peer review is a double-edged
sword. Reviewers have the power to influence an editor’s decision, and should
use it wisely. In order to be an unbiased gate-keeper to sift out bad science,
a reviewer’s report ought to be as objective as possible. I did not aim to make
this blog post about Open Science, but looking through what I wrote so far,
making sure that the methods and results of a paper are openly available (if
possible) and reproducible might be the major goal of an objective peer
reviewer. After all, if all information is transparently presented, each reader has the information they need in order to decide for themselves whether they want to believe in the conclusions of the paper. The probability of your paper being accepted for publication would no longer depend on whether your particular reviewers happen to find your arguments and data convincing.
I will finish the blog post with an
open question: Is it possible, or desirable, to have completely objective peer
review in psychological science?