Xenia Schmalz's blog: Redefining Reproducibility

A lot of good things start with “r”: Replicability, reproducibility, robustness, reading, running, relaxing, and if you include German words as well, then also “travelling” and “cycling”. Some of these r-words are more controversial than others. “Replicability” and “reproducibility” often occur with the word “crisis”, suggesting a negative connotation. On a more basic level, some r-words are more well-defined than others. While “running” is relatively well defined, everyone seems to insist on defining replicability and reproducibility in different ways. Perhaps this is one of the reasons why there is little consensus about whether or not there is a crisis, and arguably little progress in resolving a potential crisis.

In this blogpost, I aim to take a step back, and ask the question: What is replicability and reproducibility? Can we come up with a definition that is generalisable across fields and research steps? Perhaps, by re-thinking the terminology, we can get a tiny bit closer to narrowing down the issues and the reasons why these r-words are important for science.

How are the words “reproducibility” and “replicability” used? In my bubble, the most common way to use these words is as defined by The Turing Way (The Turing Way Community, 2022): Replicability refers to studies which apply identical experimental methods to an already published article, collect new data, and assess if the results are approximately in line with those in the published article. Reproducibility refers to analysing already existing data with identical methods. The implication is that reproducibility studies should obtain identical results as the original study, while we expect the results of replicability studies to vary due to sampling error.

There are two issues with the use of this terminology. The first is a lack of consensus across and even within fields. Famously, the study that arguably started the whole replication crisis debate referred to “reproducibility”, even though they were empirically estimating replicability. In fact, the title of the article was “Estimating the reproducibility in psychological science” (Open Science Collaboration, 2015). This is inconsistent with the definitions that were later proposed by The Turing Way. To complicate matters, other fields use different terminology. For example, in computational neuroscience, McDougal, Bulanova, and Lytton (2016) defined a replicable simulation as one that “can be repeated exactly (e.g., by rerunning the source code on the same computer)”, and a reproducible simulation as one that “can be independently reconstructed based on a description of the model” (p. 2). They further specify that “a replication should give precisely identical results, while a reproduction will give results which are similar but often not identical.” This is in contrast to the use of these terms in psychological science, where a replications suggests that the results should be approximately similar between an original study and a replication, and that a reproduction should yield identical results.

This brings us to the second issue, which I will pose as an open question. Are there any useful features that can be used to distinguish between reproducibility and replicability on a more general level? For example, using The Turing Way definition, reproducibility implies an exact repetition of a previous study’s processes. In a replicability study, some puzzle pieces are missing, which the replicator needs to re-create; classically, this would be the collection of data using existing experimental methods and materials. However, this feature of exactness versus approximateness is neither clear-cut nor generalisable across fields or research processes. For example, even in a reproducibility study, important information is often missing, and the reproducer needs to fill in the gaps, thus deviating from the concept of exactness (Seibold et al., 2021).

The Turing Way definition also applies neatly to the process of data analysis, as this is the focus of this community. However, how do we distinguish between reproducibility and replicability, for example, if we want to describe less new-data-heavy fields? For example, what are replicability versus reproducibility when one considers a systematic review? We can probably all agree that, when someone does a systematic review, they should transparently document their decision steps and search procedure. But how do we map the concepts of reproducibility and replicability onto this research process? Is a reproducibility possible or useful, given that across time, newly published studies may need to be included as the output of the systematic search?

To resolve these issues, it may be worth re-thinking how the word “replicability” is used. While the focus of The Turing Way – and possibly that of the main chunk of the scientific community – is on the level of the data analysis, we could consider shifting the focus from the data analysis to the process that a replicability study really wants to reproduce: The data collection process. This gives us a more narrow definition: A replicability study is one that aims to reproduce the data collection process. In this case, we are using the word “reproducibility” to define “replicability”. Does this lead us down a rabbit hole? Or, vice versa, does it help us to bring some clarity into what we actually mean when we use various r-words?

This may be one of these very rare occasions, when we can improve things by subtraction rather than addition (Winter, Fischer, Scheepers, & Myachykov, 2023). What if we remove the word “replication” from our vocabulary? This would leave us with “reproducibility”. If we want to refer to what we now call “replicability”, we would simply specify: “Reproducibility of the data collection process”. And if we want to talk about what we now call “reproducibility”, we would say: “Reproducibility of the data analysis process”, or if we want to be even more specific: “Reproducibility of the analysis script” or “Reproducibility of the reported results”.

There would be some advantages to such a shift. First, my impression is that explaining the difference between reproducibility and replicability, say, in my Open Science workshops, is more complicated than it should be. The proposed change in terminology would simplify things. Second, we’d create a more general terminology, that could be used across all fields in science and research. This should allow for more fruitful discussion across fields, allowing us to learn from each others’ mistakes and solutions. By using additional qualifiers and referring to the research steps that we have in mind when we talk about reproducibility, we wouldn’t lose any clarity and specificity. Third, we may shift the focus of the replication crisis debate away from the single step of data analysis and consider other research processes where reproducibility may be equally important.

Important for what, you may ask? A more generic definition calls for a more generic answer to the question what it is good for. Reproducibility exists on two levels: First, the researchers doing the original work should work in such a way that they document all relevant information. Second, reproducers ought to verify the original work. The obvious purpose of this is error detection. As much as everyone dislikes the idea of other people finding errors in one’s work, we can probably still agree that we don’t want to build on a research topic where the main finding reflects a banal coding error. The less obvious purpose is to identify alternative paths: For example, it may be clear that Researcher A inferred Y from X; Researcher B may question the validity of this inference and propose and test an alternative explanation. Also, perhaps less obvious to more experienced researchers, is the value in working reproducibly at all stages of the research process to allow others to learn from one’s work.

In summary, reproducibility is a good thing; terminological messes are not a good thing. The distinction between reproducibility and replicability may make matters overly complicated, and simplifying things by referring to “reproducibility” + specification of research process may be good.

References

Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).

Community, T. T. W. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2): Zenodo.

McDougal, R. A., Bulanova, A. S., & Lytton, W. W. (2016). Reproducibility in computational neuroscience models and simulations. IEEE Transactions on Biomedical Engineering, 63(10), 2021-2035.

Seibold, H., Czerny, S., Decke, S., Dieterle, R., Eder, T., Fohr, S., . . . Kopper, P. (2021). A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses. PloS One, 16(6), e0251194.

Winter, B., Fischer, M. H., Scheepers, C., & Myachykov, A. (2023). More is better: English language statistics are biased toward addition. Cognitive Science, 47(4), e13254.

Xenia Schmalz's blog

Monday, July 21, 2025

Redefining Reproducibility

References

No comments:

Post a Comment

Blog Archive