Wednesday, April 27, 2016

The power is in collaboration: Developing international networks to increase the reproducibility of science

This is an essay that I wrote for the Winnower's essay contest: "How do we ensure that research is reproducible?" 

The field of psychological science is in a pandemonium. With failures to replicate well-established effects, evidence for a skewed picture of science in the published literature, and a media hype about the replication crisis – what is left for us to believe in these days?

Luckily, researchers have done what they do best – research – to try to establish the causes, and possible solutions to this replication crisis. A coherent picture has emerged. There are three key factors that seem to have led to the replication crisis: (1) Underpowered studies, (2) publication bias, and (3) questionable research practices. Studies in psychology often test a small number of participants. As effects tend to be small and measures noisy, larger samples are required to reliably detect an effect. An underpowered study, trying to find a small effect with a small sample sizes, runs a high probability of not finding an effect, even if it is real (Button et al., 2013; Cohen, 1962; Gelman & Weakliem, 2009).

By itself, this would not be a problem, because a series of underpowered studies can be, in principle, combined in a meta-analysis to provide a more precise effect size estimate. However, there is also publication bias, as journals tend to prefer publishing articles which show positive results. Authors often do not even bother trying to submit papers with non-significant results, leading to a file-drawer problem (Rosenthal, 1979). As the majority of research papers are underpowered, the studies that do show a significant effect capture the outliers of a normal distribution around a true effect size (Ioannidis, 2005; Schmidt, 1992, 1996). This creates a biased literature: even if an effect is small or non-existent, a number of published studies can provide apparently consistent evidence for a large effect size.

The problems of low power and publication bias are further exacerbated by questionable research practices, where researchers – often unaware that they are doing something wrong – use little tricks to get their effects above a significance threshold, such as removing outliers until the threshold is reached, or including post-hoc covariates (John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011).

As a lot of research and discussion exists on how to fix the problem of publication bias and questionable research practices – which mostly require a top-down change of the incentive structure. Here, I focus on the issue of underpowered studies, as this can be addressed by individual researchers. Increasing power is in everyone’s best interests: It strengthens science, but it also gives the researcher a better chance to provide a meaningful answer to their question of interest.

On the surface, the solution to the problem of underpowered studies is very simple: we just have to run bigger studies. The simplicity is probably why this issue is not discussed very much. However, the solution is only simple if you have the resources to increase your sample sizes. Running participants takes time and money. Therefore, this simple solution poses another problem: the possibility of creating a Matthew effect, where the rich get richer by producing large quantities of high-quality research, while researchers with fewer resources can produce either very few good studies, or numerous underpowered experiments for which they will get little recognition*.

On the surface, the key to avoiding the Matthew effect is also simple: if the rich collaborate with the poor, even researchers with few resources can produce high-powered studies. However, in practice, there are few perceived incentives for the rich to reach out to the poor. There are also practical obstacles for the poor in approaching the rich. These issues can be addressed, and it takes very little effort from an average researcher to do so. Below, I describe why it is important to promote collaborations in order to improve replicability in social sciences, and how this could be achieved.

In order to ensure the feasibility of creating a large-scale collaboration network, it would be necessary to promote the incentives for reaching out to the poor. Collecting data for someone with fewer resources may seem like charity. However, I argue that it is a win-win situation. Receivers are likely to reciprocate. If they cannot collect a large amount of data for you, perhaps they can help you in other ways. For example, they could provide advice on a project with which you got stuck and which you had abandoned years ago; they could score that data that you never got around to having a look at; or simply discuss new ideas, which could give you a fresh insight into your topic. If they collect even a small amount of data, this could improve a dataset. In the case of international collaborations, you would be able to recruit a culturally diverse sample. This would ensure that our view of psychological processes is generalisable beyond a specific population (Henrich, Heine, & Norenzayan, 2010).

There are numerous ways in which researchers can reach out to each other. Perhaps one could create an online platform for this purpose. Here, anyone can write an entry for their study, which can be at any stage: it could be just an idea, or a quasi-finished project which just needs some additional analyses or tweaks before publication.

Anyone can browse a list of proposed projects by topic, and contact the author if they find something interesting. The two researchers can then discuss further arrangements: whether together, they can execute this project, whether the input of the latter will be sufficient for co-authorship, or whether the former will be able to reciprocate by helping out with another project.

Similarly, if someone is conducting a large-scale study, and if they have time to spare in the experimental sessions, they could announce this in a complementary forum. They would provide a brief description of their participants, and offer to attach another task or two for anyone interested in studying this population.

To reach a wider audience, we could rely on social media. Perhaps a hashtag could be used on twitter. Perhaps #LOOC (“LOOking for Collaborator”)**? One could tweet: “Testing 100 children, 6-10 yo. Could include another task up to 15 minutes. #LOOC”. Or: “Need more participants for a study on statistical learning and dyslexia. #LOOC”, and attach a screen shot or link with more information.  

In summary, increasing sample sizes would break one of the three pillars of the replication crisis: large studies are more informative than underpowered studies, as they lead to less noisy and more precise effect size estimates. This can be achieved through collaboration, though only if researchers with resources are prepared to take on some amount of additional work by offering to help others out. While this may be perceived as a sacrifice, in the long run it should be beneficial for all parties. It will both become easier to diversify one’s sample, and help researchers who study small, specific populations (e.g., a rare disorder), to collaborate with others to recruit enough participants to draw meaningful conclusions. It will provide a possibility to connect with researchers from all over the world with similar interests and possibly complementary expertise. And in addition, it will lead to an average increase in sample sizes, and reported effects which can be replicated across labs.

Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S. J., & Munafo, M. R. (2013). Confidence and precision increase with high statistical power. Nature Reviews Neuroscience, 14(8). doi:10.1038/nrn3475-c4
Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65(3), 145-153.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.
Henrich, J., Heine, S. J., & Norenzayan, A. (2010). The weirdest people in the world? Behavioral and Brain Sciences, 33(2-3), 61-83.
Ioannidis, J. P. A. (2005). Why most published research findings are false. Plos Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science, 0956797611430953.
Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129. doi:10.1037//1082-989x.1.2.115
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 0956797611417632.

* One may or may not consider this a problem – after all, the issue of the replicability crisis is solved.

** Urban dictionary tells me that “looc” means “Lame. Stupid. Wack. The opposite of cool. (Pronounced the same as Luke.)”

Monday, April 25, 2016

When success in academia comes at a high price

Last month, I published a blogpost on “Crossroads of an early career researcher”, lamenting that for an early career researcher, taking steps to be a successful academic often goes counter the principles of being a good scientist (inspired by a table featured in this blogpost by Gerald Carter). My blogpost has gotten almost 2,000 views (4x more than my second-most-read post), suggesting that the idea resonates with many colleagues. The Times Higher Education even reposted me, although they changed the title to the slightly more provocative “Good scientist or successful academic? You can’t be both!”

The title did not elicit 100% positive reactions, as some people were opposed to the idea that it’s impossible to be good and successful at the same time. I agree – as I have stressed in the blog post, I have been very fortunate to work with senior colleagues who are both successful and good scientists. Others pointed out that being a good versus a successful scientist is not a straight-forward dichotomy: you have to be a bit of both; or, you have to spend your early years becoming a successful academic, and once your position is secured, you can focus on producing good science. While I understand the reasoning behind these suggestions, I disagree. In my view, taking steps to being a successful academic is fully justified – as long as they, in no way, compromise the quality of science. The path from being a blue-eyed scientist, whose only ambition it is to make the world a better place with their research, to a ruthless academic who publishes anything they get their hands on while being fully aware that it’s probably BS, may be a slippery slope.

Let’s imagine a post-doc who is pressured to publish one p-hacked study after another, because their contract is extended only if they have a certain number of publications. On a human level, one can sympathise with this post-doc. Finally, this post-doc gets a lecturing position. But their woes are not over: they need to publish to be promoted, to stop the university administration from forcing them to go to a part-time position, or changing their position to a teaching one, while expecting them to produce the same amount of papers. These are all real scenarios, from real universities. Even for a full professor, there is pressure to continue receiving grants and produce results that make the university proud (quantity, not necessarily quality-wise, that is). Thus, there is no point at which one can stop focussing on being a successful academic, and start doing good science*.

The desire to be a good scientist and to be a successful academic are probably independent of each other: one can be high on both, low on both (in this case one would probably not stay in academia for very long), or high on one and low on the other. An academic who is high on the desire to be a successful academic, but doesn’t really give a sh*t about science (hereafter referred to as the Pimp) is both dangerous and unpleasant, and (as a symbolical figure) probably one of the main reasons for the replicability crisis we’ve found ourselves in. If the current system rewards academics such as the Pimp, it is clear that incentives need to be changed. While – again – there is nothing wrong with wanting to be successful, the Pimp, by definition, engages in practices that are hurtful to science.

The Pimp, of course, is a hypothetical creature: most academics have a mixture of desire to become a good scientist and successful. I have never worked with a Pimp, but I have heard horror stories from many different people – at work, at home, at Friday night drinks, while travelling. Horror stories are told by the Pimp’s subordinates, who get a terrifying insights into how the Pimp’s lab works – often resulting in their leaving academia (especially if they’ve never worked in a lab with better values and working conditions). What I know about Pimps is therefore based on their interactions with students and post-docs. Outside of their supervisory roles – who knows? – they may be perfectly lovely people.

There is a considerable amount of overlap in the horror stories that I’ve been told: across disciplines, across countries, genders, and languages. This makes me fear that Pimp-like creatures may not be that uncommon. Below, I put my mental image of a Pimp, a caricature, to paper. Again, this is based on stories I have heard from numerous people, so it does *not* describe a single person.

The Pimp
The Pimp fully relies on his inferiors to produce publishable data. Their perceived role as a supervisor is to summon their students or post-docs regularly into their office, and tell them to publish more. They don’t care what the students do with their data, as long as it’s publishable. A student or post-doc is more likely to be in trouble with the Pimp for obtaining non-significant results than for fabricating data. In isolated cases, the Pimp takes credit for their subordinates’ work, presenting it at conferences or even in publications, while conveniently forgetting to mention that the experiment was designed, executed, analysed, and written up by somebody else.

With minimal supervision, students often feel like they are being thrown into cold water, and they are supposed to do things without knowing what, or how. For example, this means that they are not told that practices such as post-hoc data trimming and using optional stopping rules to obtain a p < 0.05 are bad.

But isn’t academia all about overcoming adversity, and learning to work independently? If a student stays afloat, clearly they will go far (and the Pimp can squeeze publications of out them). Someone who drowns – well, they don’t really have what it takes, anyway (and they’re a bad investment for the Pimp).

Pimps often work hard (not smart), and expect their students to do the same. What, you’d rather spend the Christmas week with your spouse and children than in the lab? Clearly, you’re not a real scientist – I will bear this in mind for the next grant application. If you really love science, you should be prepared to sacrifice your hobbies, your family, even your mental and physical health. Or, as a Pimp repeatedly told a PhD student: You need to get out of your comfort zone.

A popular trick of the Pimp is to hire international students and post-docs. With a vague promise that they may receive something more permanent at some stage, and with different cultural expectations about what constitutes hard work, international subordinates will not complain about working 12 hours a day and 7 days a week. At least, not to authority. 


* Although one could possibly make a career by publishing a lot of BS throughout their early years, and spending the rest of their lives refuting one’s own work.

Tuesday, April 5, 2016

How big can a dyslexic deficit be?

TL;DR: Thinking about effect size can help to plan experiments and evaluate existing studies. Bigger is not always better!

In grant applications, requests for ethics permission, article introductions or pre-registered reports it is often expected that the researcher provides an estimate of the expected effect size. This should then be used to determine the sample size that the researcher aims to recruit. At first glance, it may seem counter-intuitive that one has to specify an effect size before one collects the data – after all, the aim of (most) experiments is to establish what this effect size could be, and/or whether it is different from zero. At second glance, specifying an expected effect size may be one of the most important issues to consider in the planning of an experiment. Underpowered studies reduce the researcher’s chance to find an effect, even if the effect exists in the population – running an experiment with 30% power gives one a worse chance to find the effect than tossing a coin to determine the experiment’s outcome. Significant effects in underpowered studies overestimate the true effect size, because the z-score cut-off that is needed to reach significance is larger than the population effect (Gelman & Carlin, 2014; Gelman & Weakliem, 2009; Schmidt, 1992).

So, given that it is important to have an idea of the effect before running the study, what is the best way to come up with a reasonable a priori effect size estimate? One possibility is to run a pilot test; another is to specify the smallest effect size that would be of (theoretical and/or practical) significance; the third would be to consider everything we know about effects that are kind of like the one we are interested in, and to use that to guide our expectations. All three have both benefits and drawbacks – my aim is not to discuss these. My aim here is to focus on the third possibility: knowing how big the most stable effects in your research are can be used to constrain your expectations – it is unlikely that your “new” effect, a high-hanging fruit, is bigger than the low-hanging fruit that has been plucked decades ago. Specifically, I will try to provide an effect size estimate which could be used by researchers seeking to find potential deficits underlying developmental dyslexia, and discuss the limitations of this approach in this particular instance.

Case study: Developmental dyslexia
A large proportion of the literature on developmental dyslexia focuses on finding potentially causal deficits. Such studies generally test dyslexic and control participants on a task which is not directly related to reading, such as visual attention, implicit learning, balance, etc., etc. If dyslexic participants perform significantly worse than control participants, it is suggested that there is the possibility that the task might tap an underlying deficit, which causes dyslexia, can be used as an early marker of dyslexia, and can be treated to improve reading skills. The abundance of such studies has lead to an abundance of theories of developmental dyslexia, which has lead to – well – a bit of a mess. I will not go into details here; the interested reader can scan the titles of papers from any reading- or dyslexia-related journal for some examples.

A problem arises now that we know that many published studies are Type-I errors (that find a significant group difference in their sample that is absent in the population; see Ioannidis, 2005; Open Science Collaboration, 2015). Sifting through the vast and often contradictory literature to find out whether there is a sound case for a dyslexic deficit on each of the tasks that have been studied to date would be a mammoth task. Yet, this would be an important step for an integrative theory of dyslexia. Making sense of the existing literature would not only constrain theory, but also prevent researchers from wasting resources on treatment studies which are not likely to yield fruitful results.

Both to evaluate the existing literature, and to plan future studies, it would be helpful to have an idea of how big an effect can be reasonably expected. To this end, I decided to calculate an effect size estimate of the phonological awareness deficit. The phonological awareness deficit relates to the consistent finding that participants at dyslexia perform more poorly than controls on tasks involving the manipulation of phonemes and other sublexical spoken units (Melby-Lervåg, Lyster, & Hulme, 2012; Snowling, 2000; for a discussion about potential causality see Castles & Coltheart, 2004).  

Studies which are not about phonological awareness often include phonological awareness tasks in their test battery when comparing dyslexic participants to controls. In these instances, phonological awareness is not the variable of interest, therefore there is little reason to suspect that there is any p-hacking (e.g., including covariates, removing outliers, or adding participants until the group difference becomes significant). For this reason, the effect size estimate is based only on studies where phonological awareness was not the critical experimental task. To this end, I looked through all papers that are lying around on my desk on the subject of statistical learning and dyslexia, and on magnocellular/dorsal functioning in dyslexia. I included, in the analysis, all papers which provided a table with group means and standard deviations on tasks of phonological awareness. This resulted in 15 papers, 6 testing children and 9 testing adults, from 5 different languages. Mostly, the phonological awareness measures were spoonerism tasks. A full list of studies with the tasks, participant characteristics, and full references to the papers can be found in the dropbox folder linked below.

I generated a forest plot with the metafor package for R (Viechtbauer, 2010). I used a random-effect model, and the Sidik-Jonkman method (Sidik & Jonkman, 2005; see Edit below). The R analysis script, and the datafile (including full references to all papers that were included) can be found here:

The results are shown in the figure below. All studies show a negative effect (i.e., worse performance on phonological awareness tasks for dyslexic compared to control groups). The effect size estimate is d = 1.24, with a relatively narrow confidence interval, 95% CI [0.95, 1.54].

Conclusions and limitations
When comparing dyslexic participants and controls on phonological awareness tasks, one can expect a large effect size of around d = 1.2. A phonological awareness deficit is the most established phenomenon in dyslexia research; given that all other effects (as far as I know) are contentious, and have often been successfully replicated by some labs, but not by others, it is likely that a group difference on any other task (which is not directly related to reading) will be smaller. For researchers, peer reviewers, and editors, the obtained effect size of a new experiment can be used as a red flag: if a study examines whether or not there is a group difference in, say, chewing speed between dyslexic and control groups*, and obtains an effect size of d > 1, one should raise an eyebrow. This would be a clear indicator that one has obtained a Magnitude error (Gelman & Carlin, 2014). Obtaining several effect sizes of this magnitude in a multi-experiment study could even be an indicator of p-hacking or worse (Schimmack, 2012).

The are two limitations of the current blog post. The first is methodological, the second is theoretical. First, I have deliberately chosen to include studies where phonological awareness was not of primary interest. This means that the original authors would not have had an incentive to make this effect look bigger than it actually is. However, there is a potential alternative source of bias: it is possible that (at least some) researchers use phonological awareness as a manipulation check: given that this effect is well-established, a failure to find it in one’s sample could be taken to suggest that the research assistant who collected the data screwed up. This could lead to a file-drawer problem, where researchers discard all datasets where the group difference in phonological awareness was not significant. It is also plausible that researchers would simply not report having tested this variable if it did not yield a significant difference, as the critical comparison in all papers was on other cognitive tasks. If other studies have obtained non-significant differences in phonological awareness but did not report these, the current analysis presents a hugely inflated effect size estimate.

The second issue relates less to the validity of the effect size, and more to its utility: the effect size estimate is really, really big. A researcher paying attention to effect sizes will be sceptical, anyway, if they see an effect size this big. Thus, in this case, the effect size of the most stable effect in the area cannot be used to constrain our expectation of potential effect sizes of other experiments. To guide the expectation of an effect size for future studies, one might therefore want to turn to alternative approaches. Pilot testing could be useful, but its limitation is that it is often difficult to recruit enough participants to get a meaningful pilot study plus a well-powered full experiment. At this stage, it would also be difficult to define a minimum effect size that would be of interest. This could change, however, if we develop models of reading and dyslexia that make quantitative predictions. (This is unlikely to happen anytime soon, though.) Currently, the most useful approach to determining whether or not there is an effect, given limited resources and no effect size estimate, seems to be optional stopping. This can be done if it is planned in advance; in the frequentist framework, alpha-levels need to be adjusted a priori (Lakens, 2014; for Bayesian approaches see Rouder, 2014; Schönbrodt, 2015).  

Despite the limitations of the current analysis, I hope the readers of this blog post will be encouraged to consider issues in estimating effect sizes, and will find some useful references below.  

Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641-651.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.
Hartung, J., & Makambi, K. H. (2003). Reducing the number of unjustified significant results in meta-analysis. Communications in Statistics-Simulation and Computation, 32(4), 1179-1190.
IntHout, J., Ioannidis, J. P., & Borm, G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC medical research methodology, 14(1), 1.
Ioannidis, J. P. A. (2005). Why most published research findings are false. Plos Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
Lakens, D. (2014). Performing highpowered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.
Melby-Lervåg, M., Lyster, S.-A. H., & Hulme, C. (2012). Phonological skills and their role in learning to read: a meta-analytic review. Psychological Bulletin, 138(2), 322.
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301-308.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Schönbrodt, F. D. (2015). Sequential Hypothesis Testing with Bayes Factors: Efficiently Testing Mean Differences. Available at SSRN 2604513.
Sidik, K., & Jonkman, J. N. (2005). Simple heterogeneity variance estimation for metaanalysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2), 367-384.
Snowling, M. J. (2000). Dyslexia. Malden, MA: Blackwell Publishers.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1-48.  Retrieved from


* It’s not easy coming up with a counterintuitive-sounding “dyslexic deficit” which hasn’t already been proposed!

Edit 5.4.16 #1: Added a new link to data (to OSF rather than dropbox).
Edit 5.4.16 #2: I used the "SJ" method for ES estimation, which I thought referred to the one recommended by IntHout et al., referenced above. Apparently, it doesn't. Using, instead, the REML method slightly reduces the effect size, to d = 1.2, 95% CI [1.0, 1.4]. The estimate is similarly large to the one described in the text, therefore it does not change the conclusions of the analyses. Thanks to Robert Ross for pointing this out!