Friday, July 25, 2025

What can we learn from unsuccessful theories?

We still have a long way to go when it comes to designing good theories, at least in psychological science. On the topic of 'we have a long way to go', psychology is often juxtaposed with physics, where one might get the impression that physics is doing better than psychology, as a science, on all counts. Embarrassingly for psychologists, Popper famously compared Einstein's and Freud's theories, with Einstein's theory of relativity being an example of good science, as it made a verifiable prediction that hadn't been, but could be empirically tested. Freud's theories, on the other hand, were unfalsifiable, as any observed phenomenon could be explained post-hoc. 

There are already many blog posts that discuss how we should be, or why we shouldn't be, more like physicists. Of course, physicists have advantages that we don't have, such as precise measurements and centuries of numerical models to build on. Aspiring to improve is always a good thing, but maybe we can also look at the flip side of the coin: To not only compare ourselves to a gold standard and potentially discover that we just can't, but also to see where others went wrong to avoid repeating mistakes.

With these thoughts in mind (or something along these lines), I started digging a bit into theories that didn't work out. So far, I haven't found that much, so I welcome any recommendations for reading on this topic! 

The first unsuccessful theory that I started googling was alchemy. It's often described as what came before chemistry, except the people back then didn't really know what matter was made of, so they did meticulous work and maybe even discovered some principles that are still relevant for modern chemistry, but mainly went on wild goose chases to achieve immortality or to turn base metals into gold. I came across a historical character who sounds pretty cool: Cleopatra the Alchemist (https://en.wikipedia.org/wiki/Cleopatra_the_Alchemist), not to be confused with the empress, who lived in the same country of Egypt but in a different century. Alas, it seems that back in the days, reproducible working was not a thing yet. Apparently, the writings of alchemists are difficult to decipher, because they often wrote in code (https://www.youtube.com/watch?v=gxiLuz9kHi0). 

What can we learn from that? First, that we should work reproducibly, even when it comes to documenting our ideas and trains of thought. Second, there may be a broader message there, something about not letting our own personal interests dominating our research. Achieving eternal life or unlimited wealth may be an ultimate aim for some people in life, but perhaps it's important to concentrate on the little steps and the scientific achievements that we make on the way there.

The second unsuccessful theory that came to my mind is Lamarckian evolution. This is a theory that, in my undergraduate course of biology, was juxtaposed to Darwin's theory of evolution by natural selection. Lamarck built on the obvious observation that children are similar to their parents, and suggested that parents can pass on acquired traits. The example from the textbook was the giraffe's neck: A giraffe stretches its neck to get to the leaves on the top of the tree, and because this makes its neck longer, its children also have longer necks. The example on Wikipedia is a blacksmith, who acquires muscles through his work, and then his children become physically stronger, too. 

Interestingly, the Wikipedia page on Lamarckism (https://en.wikipedia.org/wiki/Lamarckism) has a whole section on "Textbook Lamarckism", criticising exactly what I described above: Naming Lamarckian and Darwinian evolution as two sides of the same coin, one being bad and the other one being good. Apparently, Darwin believed in the passing on of acquired traits, just as Lamarck did. What we learned in biology class was that Darwin's theory of evolution by natural selection stood the test of time because later research, namely the advent of genetics, showed support for a mechanism that could account for transmission across generations and didn't involve the passing on of inherited traits. I think the lesson that we were supposed to learn from this juxtaposition was how important the specification of mechanisms is: undoubtedly, this is an important lesson for psychological scientists. What I personally found cool about Darwin's evolution by natural selection was its reliance on deductive reasoning: If there is variability between individuals of the same species, and this variability allows some individuals to survive with a higher probability than others, and the individual differences are passed on across generations, then those with the more successful variant will survive, leading to survival of the fittest and evolution by natural selection. For psychologists, achieving theorising based on deductive reasoning may be as utopian as achieving the measurement accuracy of physicists, who apparently throw out a tool if its test-retest reliability is less than 0.99. But it's nice to dream. 

Speaking of physicists, the third theory is Cold Fusion. I learned about it only when I met my husband, who is a physicist working with hot fusion. With hot fusion, a nuclear reaction happens when matter is heated up to hundreds of millions of degrees. Cold fusion was supposed to work at room temperature. The first mess underlying this theory was on the empirical level: after the initial study demonstrating fusion at room temperature was published, other labs failed to replicate it. The original study was apparently difficult to replicate in the first place because the experiment was not well-described, and when participants did, they were not able to find the excess heat that was supposed to be a product of the reaction. There is even talk about fraud in the original experiment. So, cold fusion went quickly out of fashion, and is not taken seriously by the overwhelming majority of physicists.

The celebratory conclusion of this whole fiasco is that replication studies can identify false positives: empirical phenomena that are just not there.  The focus of this blogpost, however, is on theories, not on empirical replicability. So, what can we learn from cold fusion about theory building? Well, apparently there wasn't that much theory behind it in the first place. So yes, the implication is: Even physics has a story about how researchers published a sexy, unbelievable finding based on a wishy-washy theory, and led a whole research community down a rabbit hole trying to reproduce their results -- including a relatively recent replication failure published in Nature: https://www.nature.com/articles/s41586-019-1256-6. 

What is the overall conclusion? Admittedly, I don't think we learn much from these three case studies that we didn't already know. It might be important insights, but those that are already part of the mainstream discussions -- otherwise, I probably wouldn't know about them in the first place. In theory building, as with most other complex things, it's easier to do things wrong than to do things right. This is because, for a theory building process to be right, all of the underlying processes have to be right. If one step is wrong, the theory is wrong. And there are many more ways to do things wrong than to do things right. Probabilistically, the odds are against us.

And yet it's probably worth considering exactly what has gone wrong when we have come up with wrong theories in the past. By eliminating possible mistakes, we can increase the ratio of right-to-wrong theory building processes. So, I'm looking forward to extending my collection on wrong theories! Please feel free to post any leads in the comments!

Monday, July 21, 2025

Redefining Reproducibility

     A lot of good things start with “r”: Replicability, reproducibility, robustness, reading, running, relaxing, and if you include German words as well, then also “travelling” and “cycling”. Some of these r-words are more controversial than others. “Replicability” and “reproducibility” often occur with the word “crisis”, suggesting a negative connotation. On a more basic level, some r-words are more well-defined than others. While “running” is relatively well defined, everyone seems to insist on defining replicability and reproducibility in different ways. Perhaps this is one of the reasons why there is little consensus about whether or not there is a crisis, and arguably little progress in resolving a potential crisis.

     In this blogpost, I aim to take a step back, and ask the question: What is replicability and reproducibility? Can we come up with a definition that is generalisable across fields and research steps? Perhaps, by re-thinking the terminology, we can get a tiny bit closer to narrowing down the issues and the reasons why these r-words are important for science.  

 

    How are the words “reproducibility” and “replicability” used? In my bubble, the most common way to use these words is as defined by The Turing Way (The Turing Way Community, 2022): Replicability refers to studies which apply identical experimental methods to an already published article, collect new data, and assess if the results are approximately in line with those in the published article. Reproducibility refers to analysing already existing data with identical methods. The implication is that reproducibility studies should obtain identical results as the original study, while we expect the results of replicability studies to vary due to sampling error.

 

    There are two issues with the use of this terminology. The first is a lack of consensus across and even within fields. Famously, the study that arguably started the whole replication crisis debate referred to “reproducibility”, even though they were empirically estimating replicability. In fact, the title of the article was “Estimating the reproducibility in psychological science” (Open Science Collaboration, 2015). This is inconsistent with the definitions that were later proposed by The Turing Way. To complicate matters, other fields use different terminology. For example, in computational neuroscience, McDougal, Bulanova, and Lytton (2016) defined a replicable simulation as one that “can be repeated exactly (e.g., by rerunning the source code on the same computer)”, and a reproducible simulation as one that “can be independently reconstructed based on a description of the model” (p. 2). They further specify that “a replication should give precisely identical results, while a reproduction will give results which are similar but often not identical.” This is in contrast to the use of these terms in psychological science, where a replications suggests that the results should be approximately similar between an original study and a replication, and that a reproduction should yield identical results. 

 

    This brings us to the second issue, which I will pose as an open question. Are there any useful features that can be used to distinguish between reproducibility and replicability on a more general level? For example, using The Turing Way definition, reproducibility implies an exact repetition of a previous study’s processes. In a replicability study, some puzzle pieces are missing, which the replicator needs to re-create; classically, this would be the collection of data using existing experimental methods and materials. However, this feature of exactness versus approximateness is neither clear-cut nor generalisable across fields or research processes. For example, even in a reproducibility study, important information is often missing, and the reproducer needs to fill in the gaps, thus deviating from the concept of exactness (Seibold et al., 2021)

 

    The Turing Way definition also applies neatly to the process of data analysis, as this is the focus of this community. However, how do we distinguish between reproducibility and replicability, for example, if we want to describe less new-data-heavy fields? For example, what are replicability versus reproducibility when one considers a systematic review? We can probably all agree that, when someone does a systematic review, they should transparently document their decision steps and search procedure. But how do we map the concepts of reproducibility and replicability onto this research process? Is a reproducibility possible or useful, given that across time, newly published studies may need to be included as the output of the systematic search? 

 

    To resolve these issues, it may be worth re-thinking how the word “replicability” is used. While the focus of The Turing Way – and possibly that of the main chunk of the scientific community – is on the level of the data analysis, we could consider shifting the focus from the data analysis to the process that a replicability study really wants to reproduce: The data collection process. This gives us a more narrow definition: A replicability study is one that aims to reproduce the data collection process. In this case, we are using the word “reproducibility” to define “replicability”. Does this lead us down a rabbit hole? Or, vice versa, does it help us to bring some clarity into what we actually mean when we use various r-words? 

 

    This may be one of these very rare occasions, when we can improve things by subtraction rather than addition (Winter, Fischer, Scheepers, & Myachykov, 2023). What if we remove the word “replication” from our vocabulary? This would leave us with “reproducibility”. If we want to refer to what we now call “replicability”, we would simply specify: “Reproducibility of the data collection process”. And if we want to talk about what we now call “reproducibility”, we would say: “Reproducibility of the data analysis process”, or if we want to be even more specific: “Reproducibility of the analysis script” or “Reproducibility of the reported results”. 

 

    There would be some advantages to such a shift. First, my impression is that explaining the difference between reproducibility and replicability, say, in my Open Science workshops, is more complicated than it should be. The proposed change in terminology would simplify things. Second, we’d create a more general terminology, that could be used across all fields in science and research. This should allow for more fruitful discussion across fields, allowing us to learn from each others’ mistakes and solutions. By using additional qualifiers and referring to the research steps that we have in mind when we talk about reproducibility, we wouldn’t lose any clarity and specificity. Third, we may shift the focus of the replication crisis debate away from the single step of data analysis and consider other research processes where reproducibility may be equally important. 

 

    Important for what, you may ask? A more generic definition calls for a more generic answer to the question what it is good for. Reproducibility exists on two levels: First, the researchers doing the original work should work in such a way that they document all relevant information. Second, reproducers ought to verify the original work. The obvious purpose of this is error detection. As much as everyone dislikes the idea of other people finding errors in one’s work, we can probably still agree that we don’t want to build on a research topic where the main finding reflects a banal coding error. The less obvious purpose is to identify alternative paths: For example, it may be clear that Researcher A inferred Y from X; Researcher B may question the validity of this inference and propose and test an alternative explanation. Also, perhaps less obvious to more experienced researchers, is the value in working reproducibly at all stages of the research process to allow others to learn from one’s work.

 

    In summary, reproducibility is a good thing; terminological messes are not a good thing. The distinction between reproducibility and replicability may make matters overly complicated, and simplifying things by referring to “reproducibility” + specification of research process may be good.

 

References

Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251).

Community, T. T. W. (2022). The Turing Way: A handbook for reproducible, ethical and collaborative research (1.0.2): Zenodo.

McDougal, R. A., Bulanova, A. S., & Lytton, W. W. (2016). Reproducibility in computational neuroscience models and simulations. IEEE Transactions on Biomedical Engineering, 63(10), 2021-2035.

Seibold, H., Czerny, S., Decke, S., Dieterle, R., Eder, T., Fohr, S., . . . Kopper, P. (2021). A computational reproducibility study of PLOS ONE articles featuring longitudinal data analyses. PloS One, 16(6), e0251194. 

Winter, B., Fischer, M. H., Scheepers, C., & Myachykov, A. (2023). More is better: English language statistics are biased toward addition. Cognitive Science, 47(4), e13254.

Wednesday, June 4, 2025

On ghost writing in academia

 

LinkedIn suggested a job as a perfect match for my skills: Being a ghost writer. Curious, I clicked on the job description, thinking: “This can’t possibly be what it sounds like, right?” It was. According to the website, the ghost writing agency has their seat in Berlin, and they have anonymised reviews by happy university students who’d received a high grade on their assignment or thesis thanks to work written by the ghost writers and submitted in the students’ names, and non-anonymised profiles of ghost writers, along with their photos and the average rating by their customers. Three thoughts came to my mind, I don’t remember in which order: “How is this legal?”, “Well, I do have the skills for that job!”, and “Can we expect students not to rely on the services of ghost writers if many academics do the same?”

 

This blog post is about the third thought. In the form of paper mills, ghost writing may be more common that you’d think, but it’s a topic I don’t know much about and other people have written about it before (e.g., Anna Albakina & Dorothy Bishop, https://osf.io/preprints/psyarxiv/2yf8z_v1, https://osf.io/preprints/psyarxiv/6mbgv_v1). Relying on the services of paper mills is clearly beyond any grey areas, but there are other forms of what I’d personally call ghost writing and that seem to be perfectly acceptable in some circles in academia.

 

The specific case I’m thinking of is grant writing. It’s not uncommon that a grant is submitted in a professor’s name, but written by PhD students or postdocs. There are few cases that are black or white, as early career researchers contribute to varying numbers of sections to varying extents, ranging from brainstorming specific ideas to actually writing the whole proposal. As far as I know, there is no consensus about what is actually acceptable. Of course, there are many advantages of involving early career researchers in the grant writing process. They learn a lot: both about the scientific processes that are part of the steps between idea generation and having a plan, and about the arguably less pleasant sides of the academic profession associated with the pressure of grant writing. Furthermore, the idea is often that, if the grant is successful, the student or postdoc will be hired to work on it, so it’s nice to give them the opportunity to contribute their own ideas.

 

In my own grant proposals, I have received very valuable input from my PhD students. Though I’m sure that would have been able and genuinely happy to contribute much more than I allowed / asked them to, I always had the idea in the back of my mind that I didn’t want to submit their work as my own. Depending on the grant, it’s sometimes possible to add PhD students as contributors, but at other times there are formal limitations, such as all (co-)applicants needing to have a PhD, or that there can be only one applicant (e.g., as is the case for some ERC grants). My own policy is to limit colleagues’ contributions to providing comments and discussions when I can’t add them as co-applicants, so I don’t end up submitting sentences or even whole sections as my own intellectual work when I haven’t actually written them. However, I’m sure that’s not the only defensible thing to do. Thus, I don’t propose that what I do is what everyone should do – rather, maybe we should start rethinking the whole process of determining intellectual contributions, acknowledging people’s work, and awarding research funding?

 

Regardless of where we draw the line: Is grant ghost writing really that bad? Maybe it’s just common knowledge that a grant submitted by a professor was likely not actually written by the professor. I honestly don’t know how common this is – only that it’s common enough that, at a webinar about how to write academic CVs, there was a whole discussion about how to take credit for successful grant proposals that you wrote but are not listed on. This brings us to problem number one: Early career researchers don’t get formal credit for their work, even though they need it most. The second problem that I see is in the quality of the project: One would think (hope?) that the professor is more experienced and thus better able to write a high-quality proposal. By getting early career researchers involved too much, the chances of success are thus being diminished. In my view, this puts the professor in a bad light: By offloading their work, they are decreasing the chance that they will be able to support the early career researchers on their team by getting the grant to extend their contracts.

 

In a way, I think that this problem with grant ghost writing is yet another manifestation of the academic system not changing as fast as the world is. There are two changes that the grant writing system doesn’t seem to consider: First, the change from the lone genius idea to team science. Although it’s nice to have grants that specifically promote a promising researcher, it’s utopian, in most fields, to assume that a large project can be conceived, let alone executed, without intellectual contributions from others. Second, the change from getting a job via the good old boys' club to fierce competition. One of the arguments for involving early career researchers in the grant writing process is that they would have a job if the proposal is successful. Maybe in the past, appeasing your professor by helping them write the grant proposal would raise your chances of getting a job. Maybe the chances of getting the grant were higher back then. And if the grant proposal was not successful, the professor was probably in a better position to get you a job, anyway – either through his own funding, or by calling his old buddy and asking if they didn’t have a suitable position in their lab. Maybe this still happens more often than we think. Still, from anecdotal observations, it is much more important for an early career researcher to get on their own two feet, both subjectively and objectively. Subjectively, one doesn’t want to be known as “Professor such-and-such’s PhD student”, even years after graduation. Objectively, one doesn’t want to – and simply cannot – rely on a single person’s good will to get a job. Although connections, of course, help in getting a job, the competition is with colleagues who have worked on their own projects, associated with gaining funding as a principal investigator and publishing first-author papers on topics that are not spin-offs of the professor’s interests.

 

Is grant ghost writing morally better than students submitting ghost-written assignments and theses? “Real” academia, as opposed to the bachelor thesis of a student without academic ambitions, is a joint venture, where it may simply be understood that a single piece of writing is not the intellectual property of whoever wrote it, but of everyone who contributed, directly and indirectly. Still, it’s not easy to pinpoint exactly what makes grant ghost writing better than ghost writing by students in university assignments. In both cases, the person in whose name the work is submitted gets an advantage – either a good grade or a degree that they don’t deserve, or funding and a stronger CV. It’s difficult to say if more is at stake in the former or the latter case. And what about the ghost writers? Well, I could do some further research to see if postdocs and PhD students get a better salary, on average, than the ghost writers at the Berlin-based company. But somehow that feels like it would be beside the point…