Wednesday, December 20, 2017

Does action video gaming help against dyslexia?

TL;DR: Probably not.

Imagine there is a way to improve reading ability in children in dyslexia, which is fun and efficient. For parents of children with dyslexia this would be great: No more dragging your child to therapists, spending endless hours in the evening trying to get the child to practice their letter-sound rules or forcing them to sit down with a book. According to several recent papers, a fun and quick treatment to improve reading ability might be in sight, and every parent can apply this treatment in their own home: Action video gaming.

Action video games differ from other types of games, because they involve situations where the player has to quickly shift their attention from one visual stimulus to another. First-person shooter games are a good example: one might focus on one part of the screen, and then an “enemy” appears and one needs to direct the visual attention to him and shoot him1.

The idea that action video gaming could improve reading ability is not as random as might seem at first sight. Indeed, there is a large body of work, albeit very controversial, that suggests that children or adults with dyslexia might have problems with shifting visual attention. The idea that a visual deficit might underlie dyslexia originates from the early 1980s (Badcock et al., Galaburda et al.; references are in the articles linked below), thus it is not in any way novel or revolutionary. A summary of this work would warrant a separate blog post or academic publication, but for some (favourable) reviews, see Vidyasagar, T. R., & Pammer, K. (2010). Dyslexia: a deficit in visuo-spatial attention, not in phonological processing. Trends in Cognitive Sciences, 14(2), 57-63 (downloadable here) or Stein, J., & Walsh, V. (1997). To see but not to read; the magnocellular theory of dyslexia. Trends in neurosciences, 20(4), 147-152 (downloadable here), or (for a more agnostic review) Boden, C., & Giaschi, D. (2007). M-stream deficits and reading-related visual processes in developmental dyslexia. Psychological Bulletin, 133(2), 346 (downloadable here). It is worth noting that there is little consensus, amongst the proponents of this broad class of visual-attentional deficit theories, about the exact cognitive processes that are impaired and how they would lead to problems with reading.

The way research should proceed is clear: If there is a theoretical groundwork, based on experimental studies, to suggest that a certain type of treatment might work, one does a randomised controlled trial (RCT): A group of patients are randomly divided into two groups, one is subjected to the treatment in question, and the other to a control treatment, and we compare the improvement between pre- and post-measurement in the two groups. To date, there are three such studies:

Franceschini, S., Gori, S., Ruffino, M., Viola, S., Molteni, M., & Facoetti, A. (2013). Action video games make dyslexic children read better. Current Biology, 23(6), 462-466 (here)

Franceschini, S., Trevisan, P., Ronconi, L., Bertoni, S., Colmar, S., Double, K., ... & Gori, S. (2017). Action video games improve reading abilities and visual-to-auditory attentional shifting in English-speaking children with dyslexia. Scientific Reports, 7(1), 5863 (here), and

Gori, S., Seitz, A. R., Ronconi, L., Franceschini, S., & Facoetti, A. (2016). Multiple causal links between magnocellular–dorsal pathway deficit and developmental dyslexia. Cerebral Cortex, 26(11), 4356-4369 (here).

In writing the current critique, I am assuming no issues with the papers at stake, or with the research skills or integrity of the researchers. Rather, I would like to show that, under the above assumptions, the three studies may provide a highly misleading picture of the effect of video gaming on reading ability. The implications are clear and very important: Parents of children with dyslexia have access to many different sources of information, some of which provide only snake-oil treatments. From a quick google search for “How to cure dyslexia”, the first five links suggest modelling letters out of clay, early assessment, multi-sensory instructions, more clay sculptures, and teaching phonemic awareness. As reading researchers, we should not add to the confusion or divert resources from treatments that have actually been shown to work, by adding yet another “cure” to the list.

So, what is my gripe with these three papers? First, that there are only three such papers. As I mentioned above, the idea that there is a deficit in visual-attentional processing amongst people with dyslexia, and that this might be a cause of their poor reading ability, has been floating around for over 30 years. We know that the best way to establish causality is through a treatment study (RCT): We have known this for well over thirty years2. So, why didn’t more people conduct and publish RCTs on this topic?

The Mystery of Missing Data
Here is a hypothesis which, admittedly, is difficult to test: RCTs have been conducted for 30 years, but only three of them ever got published. This is a well-known phenomenon in scientific publishing: in general, studies which report positive findings are easier to publish. Studies which do not find a significant result tend to get stored in file-drawer archives. This is called the File-Drawer Problem, and has been discussed as early as 1979 (Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641, here). 

The reason this is a problem goes back to the very definition of the statistical test we generally use to establish significance: The p-value. p-values are considered “significant” if they are below 0.05, i.e., below 5%. The p-value is defined as the probability of obtaining the data or more extreme observations, under the assumption that the null hypothesis is true. They key is the second part. By rephrasing the definition, we get the following: When the effect is not there, the p-value tells us that it is there 5% of the time. This is a feature, not a bug, as it does exactly what the p-value was designed to do: It gives us a long-run error rate and allows us to keep it constant at 5% across a set of studies. But this desired property becomes invalidated in a world where we only publish positive results. In a scenario where the effect is not there, 5 in 100 studies will give us a significant p-value, on average. If only the five significant studies are published, we have a 100% rate of false positives (significant p-values in the absence of a true effect) in the literature. If we assume that the action video gaming effect is not there, then we would expect, on average, three false positives out of 60 studies3. Is it possible that in 30 years, there is an accumulation of studies which trained dyslexic children’s visual-attentional skills and observed no improvement?

Magnitude Errors
The second issue in the currently published literature relates to the previous point, and extends to the possibility that there might be an effect of action video gaming on reading ability. So, for now, let’s assume the effect is there. Perhaps it is even a big effect, let’s say, it has a standardised effect size (Cohen’s d) of 0.3, which is considered to be a small-to-medium-size effect. Realistically, the effect of action video gaming on reading ability is very unlikely to be bigger, since the best-established treatment effects have shown effect sizes of around 0.3 (Galuschka et al., 2014; here).

We can simulate very easily (in R) what will happen in this scenario. We pick a sample of 16 participants (the number of dyslexic children assigned to the action video gaming group in Franceschini et al., 2017). Then, we calculate the average improvement across the 16 participants, in the standardised score:


The first average value I get a mean improvement of 0.24. Not bad. Then I run the code again, and get a whooping 0.44! Next time, not so lucky: 0.09. And then, we even get a negative effect, of -0.30.

This is just a brief illustration of the fact that, when you sample from the population, your observed effect will jump around the true population effect size due to random variation. This might seem trivial to some, but, unfortunately, this fact is often forgotten even by well-established researchers, who may go on to treat an observed effect size as a precise estimate.

When we sample, repeatedly, from a population, and plot a histogram of all the observed means, we get a normal distribution: A fair few observed means will be close to the true population mean, but some will not be at all.

We’re closing in on the point I want to make here: Just by chance, someone will eventually run an experiment and obtain an effect size of 0.7, even if the true effect is 0.5, 0.2, or even 0. Bigger observed effects, when all else is equal, will yield significant results while smaller observed effects will be non-significant. This means: If you run a study, and by chance you observe an effect size that is bigger than the population effect size, there will be a higher probability that it will be significantly and get published. If your identical twin sibling runs an identical study but happens to obtain an effect size that is smaller than yours – even if it corresponds to the true effect size! – it may not be significant, and they will be forced to stow it in their file drawer.

Given that only the significant effects are published (or even if there is a disproportionate number of positive compared to negative outcomes), we end up with a skewed literature. In the first-case scenario, we considered the possibility that the effect might not be there at all. In the second scenario, we assume that the effect is there, but even so, the published studies, due to the presence of publication bias, may have captured effect sizes that are larger than the actual treatment effect. This has been called by Gelman & Carlin (2014, here) the “Magnitude Error”, and has been described, with an illustration that I like to use in talks, by Schmidt in 1992 (see Figure 2, here).

Getting back to action video gaming and dyslexia: Maybe action video gaming improves dyslexia. We don’t know: Given only three studies, it is difficult to adjudicate between two possible scenarios (no effect + publication bias or small effect + publication bias).

So, let’s have a look at the effects reported in the three published papers. I will ignore the 2013 paper4, because it only provides the necessary descriptives in figures rather than tables, and the journal format hides the methods section with vital information about the number of participants god-knows-where. In the 2017 paper, Table 1 provides the pre- and post-measurement values of the experimental and control group, for word reading speed, word reading accuracy, phonological decoding (pseudoword reading) speed, and phonological decoding accuracy. The paper even reports the effect sizes: The action video game training had no effect on reading accuracy. For speed, the effect sizes are d = 0.27 and d = 0.45 for word and pseudoword reading, respectively. In the 2015 paper, the effect size for the increase in speed for word reading (second row of the table) is 0.34, and for pseudoword reading ability, it is 0.58.

The effect sizes are thus comparable across studies. Putting the effect sizes into context: The 2017 study found an increase in speed, from 88 seconds to 76 seconds to read a list of words, and from 86 seconds to 69 seconds to read a list of pseudowords. For words, this translates to an increase in speed of 14%: In practical terms, if it takes a child 100 hours to read a book before training, it would take the same child only 86 hours to read the same book after training.

In experimental terms, this is not a huge effect, but it competes with the effect sizes for well-established treatment methods such as phonics instruction (Hedge’s g’ = 0.32; Galuschka et al., 2014)5. Phonics instruction focuses on a proximal cause of poor reading: A deficit in mapping speech sounds onto print. We would expect a focus of proximal causes to have a stronger effect than a focus on distal causes, where there are many intermediate steps between a deficit and reading ability, as explained by McArthur and Castles (2017) here. In our case, the following things have to happen for a couple of weeks of action video gaming to improve reading ability:

- Playing first-person shooter games has to increase children’s ability to switch their attention rapidly,
- The type of attention switching during reading is the same as the attention switching to a stimulus which appears suddenly on the screen,
-  Improving your visual attention leads to an increase in reading speed.

There are ifs and buts at each of these steps. The link between action video gaming and visual-attentional processing would be diluted by other things which train children’s visual-attentional skills, such as how often they read, played tennis, sight-read sheet music, or looked through “Where’s Wally” books during the training period.6 In between visual-attentional processing and reading ability, are other variables which affect reading ability and dilute this link: the amount of time they read at home, motivation and tiredness at the first versus the second testing time point, and many others. These other factors dilute the treatment effect by adding variability to the experiment that is not due to the treatment. This should lead to smaller effect sizes.

In short: There might be an effect of action video gaming on reading ability. But I’m willing to bet that it will be smaller than the effect reported in the published studies. I mean this literally: I will buy a good bottle of a drink of your choice to anyone who can convince me that the effect 2 weeks of action video gaming on reading ability is in the vicinity of d = 0.3.

How to provide a convincing case for an effect of action video gaming on reading ability
The idea that something as simple as action video gaming can improve children’s ability to do one of the most complex tasks they learn at school is an incredible claim. Incredible claims require very strong evidence. Especially if the claim has practical implications.

To convince me, one would have to conduct a study which is (1) well-powered, and (2) pre-registered. Let’s assume that the effect is, indeed, d = 0.3. With g*power, we can easily calculate how many participants we would need to recruit for 80% power. Setting “Means: Difference between two dependent means (matched pairs)” in “Statistical test”, a one-tailed test (note that both of these decisions increase power, i.e., decrease the number of required participants), effect size of 0.3, alpha of 0.05 and power of 0.8, it shows that we need 71 children in a within-children design to have adequate power to detect such an effect.

A study should also be pre-registered. This would remove the possibility of the authors tweaking the data, analysis and variables until they get significant results. This is important in reading research, because there are many different ways in which reading ability can be calculated. For example, Gori and colleagues (Table 3) present 6 different dependent variables that can be used as the outcome measure. The greater the amount of variables one can possibly analyse, the greater the flexibility for conducting analyses until at least some contrast becomes significant (Simmons et al., 2011, here). Furthermore, pre-registration will reduce the overall effect of publication bias, because there will be a record of someone having started a given study:

In short: To make a convincing case that there is an effect of the magnitude reported in the published literature, we would need a pre-registered study with at least 70 participants in a within-subject design.

Some final recommendations
For researchers: I hope that I managed to illustrate how publication bias can lead to magnitude errors: the illusion that an effect is much bigger than it actually is (regardless of whether or not it exists). Your perfect study which you pre-registered and published with a significant result and without p-hacking might be interpreted very differently if we knew about all the unpublished studies that are hidden away. This is a pretty terrifying thought: As long as publication bias exists, you can be entirely wrong with the interpretation of your study, even if you do all the right things. We are quickly running out of excuses: We need to move towards pre-registration, especially for research questions such as the one I discussed here, which has strong practical implications. So, PLEASE PLEASE PLEASE, no more underpowered and non-registered studies of action video gaming on reading ability.

For funders: Unless a study on the effect of action video gaming on reading ability is pre-registered and adequately powered, it will not give us meaningful results. So, please don’t spend any more of the tax payers’ money on studies that cannot be used to address the question they set out to answer. In case you have too much money and don’t know what to do with it: I am looking for funding for a project on GPC learning and automatisation in reading development and dyslexia.   

For parents and teachers who want to find out what’s best for their child or student: I don’t know what to tell you. I hope we’ll sort out the publication bias thing soon. In the meantime, it’s best to focus on proximal causes of reading problems, as proposed by McArthur and Castles (2017) here.

1 I know absolutely nothing about shooter games, but from what I understand characters there tend to be males.
2 More like 300 years, Wikipedia informs me.
3 This assumes no questionable research practices: With questionable research practices, the false positive rate may inflate to 60%, meaning that we would need to assume the presence of only 2 unpublished studies which did not find a significant treatment effect (Simmons et al., 2011, here)
4 I can do this in a blog post, right?
5 And this is probably an over-estimation, given publication bias.
6 If playing action video games increases visual-attentional processing ability, then so should, surely, these other things?

Thursday, November 9, 2017

On the importance of studying things that don’t work

In our reading group, we discussed a landmark paper of Paul Meehl’s, “Why summaries of research on psychological theories are often unintepretable” (1990). The paper ends with a very strong statement (p. 242), written by Meehl in italics for extra emphasis:

We should maturely and sophisticatedly accept the fact that some perfectly legitimate “empirical” scientific theories may not be strongly testable at a given time, and that it is neither good scientific strategy nor a legitimate use of the taxpayer’s dollar to pretend otherwise.

This statement should bring up all kinds of stages of grief in psychological researchers, including anger, denial, guilt, and depression. Are we really just wasting taxpayers’ money on studying things that are not studyable (yet)?

We sometimes have ideas, theories, or models, which cannot be tested given our current measurement devices. However, research is a process of incremental progress, and in order to make progress, we need to first understand if something works or not, and if not, why it doesn’t work. If we close our eyes towards all of the things that don’t work, we cannot progress. Even worse, if we find out that something doesn’t work, and don’t make any effort to publicise our results, other researchers are likely to get the same idea, at some point in time, and start using their resources in order to also find out that it doesn’t work.

To illustrate with a short example: For some reason or another, I decided to look at individual differences in the size of psycholinguistic marker effects. With the help of half a dozen colleagues, we have collected data from approximately 100 participants, tested individually in 1-hour sessions. The results so far suggest that this approach doesn’t work: there are no individual differences in psycholinguistic marker effects.

Was I the first one to find this out? Apparently not. When sharing my conclusion with some older colleagues, they said: “Well, I could have told you that. I have tried to use this approach for many years with the same results.” Could I have known this? Did I waste the time of my colleagues and the participants in pursuing something that everyone already knows? I think not. At least myself and my colleagues were unaware of any potential problems with this approach. And finding out that it doesn’t work opens interesting new questions: Why doesn’t it work? Does it work in some other populations? Can we make it work?

All of these questions are important, even if the answer is that there is no hope to make this approach work. However, in the current academic reward system, studying things that may never work is not a good strategy. If one wants publications, a better strategy is to drop a study like a hot potato once you realise that it will not give a significant result: throw it into your file drawer and move on to something else, something that will be more likely to give you a significant p-value somewhere. This is waste of taxpayer’s money.

Tuesday, August 8, 2017

Are predatory journals really that bad?

Tales of Algerian Princes, Exotic Beauties, Old Friends Stranded And In Need, and… Your Next Submission?

All academics know these pesky little emails that our spam folder is filled with. Occasionally, a real-looking one slips through the filter, and it takes us a few minutes to figure out that we are invited to submit a paper to the journal Psychological Sciences, rather than the prestigious (or rather, high-impact) journal Psychological Science, without the ‘s’ at the end.

Predatory journals, which pose as real, often open-access journals, offer to publish your papers for a processing fee, normally several thousand US-dollars. Numerous researchers have demonstrated that the peer review process, that supposedly guarantees high quality of your paper, is completely absent or very lax in these journals. The result of these demonstrations is a set of published pseudo-academic papers with varying degree of absurdity; see here for Zen Faulkes' non-comprehensive compilation of the funniest publications.

I argue that such predatory journals are not worse than your average spammer – but, of course, they are no better, either. Charging money for a service one doesn’t provide is a crime, be it a shipping of gold, mail-order bride, or peer-review process. What I argue here is that, despite predatory journals receiving a lot of negative attention from the research community, I have not yet seen a convincing argument to suggest that they damage science.

Also, it is a separate question whether monopolising publically funded research, putting it behind a paywall and charging gazillions for access, then suing the crap out of anyone who dares to disseminate the knowledge, is morally superior to predatory journals. But, two wrongs don’t make a right, and this blog post is not about that.

Predatory journals: A victimless crime?
Sometimes, a paper we write is just “unlucky”: it gets rejected by journal after journal, and eventually we shrug and realise that the paper will probably never be accepted for publication. Maybe the paper really isn’t our best piece of work: it could be a failed experiment, which does not advance our understanding, but publishing it would prevent other researchers from wasting time trying the same thing. A worse scenario is a paper which contradicts previously published and “well-established” work: it could keep getting blocked by editors and reviewers who are friends with the original authors or have themselves published papers that hinge on the assumptions that we are arguing against.

In such cases, making the paper public while avoiding a stringent peer-review process is justifiable. And, in principle, if you have money, if you know that you will be publishing in a journal with very low prestige, or rather, very high anti-prestige  – why not? The Frontiers Journals, anecdotally speaking, are a popular outlet for such work, and until relatively recently, Frontiers was considered a respectable open-access journal with a high impact factor, which has published some good papers.  

For the record, I don’t think it’s a good idea to publish “unlucky” papers in predatory journals, for the simple reason that preprint platforms give you the same service for free, and without the possibility of damaging your reputation. The format of a preprint also has other advantages: for example, the fact that your paper is not (yet) published may encourage your colleagues to provide useful feedback (which has happened to me both times I have uploaded a preprint so far). But, for those who really want to see their “unlucky” paper in the formatted journal version, the question is: is publishing in predatory journals a victimless crime?

Playing the game of boosting your CV
Some publications in predatory journals are probably by researchers who got scammed, and genuinely believed that they were paying money for a good peer-reviewed publication in a legit open-access journal. However, I would guess that the number of such fooled researchers is relatively small – at least, I have not heard of a single case. (To be fair, anyone who has realised that they have been tricked into paying money for a bogus publication would probably be embarrassed to admit it.)

The problem seems to be that some researchers take advantage of these predatory journals to boost their publication record. Anecdotally, this seems to be a problem in the non-Western world, where researchers are often pressured by their institutions to keep up with Western standards of publishing in international peer-reviewed journals, even though they often have fewer resources to produce the same amount of high-quality research and are sometimes limited by their English skills. Predatory journals allow them to publish a large quantity of low-quality papers, without having a strict English proficiency requirement. Here, the victims are honest researchers on the job market and applying for grants. Having to compete with someone who has an artificially inflated CV is unfair. On the other hand, I would argue, the problem here is not predatory journals, but rather an evaluation system that would prefer a researcher with a hundred random-text-generator papers compared to one with five good publications. Also, I would bet that, in practice, presenting a CV with hundreds of publications in predatory journals would not get a researcher very far on the international market (though I have heard of such researchers being unfairly advantaged by their home institutions).

In summary, while playing the publication game by publishing many low-quality articles in predatory journals is not a victimless crime, as it disadvantages honest researchers, I see it as a symptom of a broken evaluation system. If we did not evaluate researchers by quantity rather than quality, researchers just wanting to make their CV look bigger could publish all the gibberish they wanted, without causing any damage to their colleagues with less fragile egos.

Bad research posing as good research
The peer review process serves as a filter to ensure that the published literature is trustworthy. For researchers, science journalists, and the general public, this filtering process means that they can read papers with more confidence. It’s peer reviewed, therefore it’s true, one might be tempted to conclude. Having papers which appear to be peer reviewed but actually contain faulty methods, analyses or inferences would create and disseminate knowledge that is false. As the demonstrations which I linked above show, any text can be published under the apparent seal of peer-review. 

Except, we all know that peer review, even in "legit" journals, is not perfect. I would like to hear from anyone who has never seen a bad published paper in their field. Some papers are just sloppy, and draw conclusions that are not justified. Occasionally, a case of data fabrication or other types of fraud blows up, and papers published in very prestigious journals that have been peer-reviewed meticulously by genuine experts are retracted. Even a perfectly executed study may be reporting a false positive – after all, it’s possible that one runs an experiment and gets a p-value of 0.01, not knowing that fifty other labs have tried the same paradigm and not found a significant effect. Thus, we should not trust the results of a paper, just because it is peer reviewed. The trustworthiness of a paper should be determined by its quality, and by whether or not the results are replicable.

Perhaps predatory journals rarely or never publish good research. Theoretically, it is possible that some publications in predatory journals are “unlucky” papers of the type I described above, in which case they may well be worth reading. In fact, if we adopt a broad definition of predatory journals and include Frontiers, it is very likely that some of the papers are good. Be that as it may, it is undeniable that peer-reviewed journals at least sometimes publish rubbish. Thus, we should not rely on peer review as an ultimate seal of approval, anyway – regardless of the outlet where a paper was published, we should first skip to the methods and results section, and judge the paper on its own merit.

Damage to the Open Science movement
When I finally published one of my “unlucky” papers in Collabra, a friend (from a completely different area of research) told me: “I don’t want to disappoint you, but… I saw that the journal you published in is one of these open access journals.” As many of the predatory journals play the card of making your work freely accessible, there is some confusion about the distinction between “good” open-access journals and predatory journals. For example, Frontiers seems to be hovering in a grey area, with many respectable scientists on the editorial boards, but examples of very bad research getting published, and editors being pressured into accepting papers for the sake of increasing profit.

It is hard to argue against the benefit of making research freely accessible, both to fellow scientists and to the general public. Therefore, it is a pity that the Open Science movement loses some of the respect and support that it deserves, not due to convincing counter-arguments but due to confusion about whether or not it has a legit peer review process. Again, though, the problem here is not predatory publishing: rather, it is misconceptions about open access and its relation to the quality of peer review.

Predatory journals pose as academic, often open-access journals, and have been shown to publish, for a fee, any text with a very lax peer review process, or none at all. Predatory journals are annoying, because they spam researchers in an attempt to receive submissions, and they are immoral, because they may trick a researcher into paying money for the service of high-quality peer review which will not be provided.

There are other issues which may be argued to impede the progress of science. Allowing researchers to inflate their CVs by publishing a large quantity of low-quality work may disadvantage more honest researchers with fewer but better publications, who compete with them for jobs and funding. This would lead to the selection of bad scientists in high-level positions. Publishing low-quality papers as peer reviewed studies may confuse other researchers, science journalists and the general public, and would thus serve to disseminate facts that are not true. Finally, as they pose as open-access journals, predatory journals damage the reputation of other open-access journals, by spreading the misconception that open-access journals necessarily have a lax peer review process and publish anything to increase their financial profit.

I argue that the issues discussed in the previous paragraph – though they are real and important problems – are symptoms of an imperfect evaluation system, rather than caused by the presence of predatory journals. In an ideal world, researchers and papers would be evaluated on their own merit, rather than by a number representing the quantity of publications or impact factors. This is rather difficult to achieve, because it requires top-down changes from employers and funders. But, in this ideal world, publishing in a predatory journal would become nothing more than an auto-ego-stroking gesture. Also, myths about open access journals need to be dispelled, so that the negative publicity that predatory journals receive would not damage the open science movement. Many open access journals, such as Collabra and RIO, have the option of publishing the reviewers’ comments alongside the paper. This practice should dispel any doubts about the legitimacy of the peer review process. If the same was done for all journals, this could be used as an indicator for the journal’s quality, rather than the label of being open access, which is, in principle, orthogonal to the peer review process.

So, what should we do about the presence of predatory journals? Address the issues from the previous paragraph, somehow. And, in the meantime, treat emails from predatory journals the same way you treat any other spam: either delete them, or, for a slow day in the office, see here for some inspiration.

Sunday, July 23, 2017

Learning about nuclear fusion in the Balkan

Novi Sad is a 250,000-people city, the second-largest in Serbia. It is located on the side of the Danube river. On the other side of the river, a hill with a castle, and an amazing view of the city during sunset. It has a university with a green campus, located conveniently between the river and the city centre. The city centre is in central European style: a large, main square next to a majestic cathedral, narrow streets with cafés and bars. Last year, the city was also host to a nuclear fusion workshop, the Fusion Days@NS, an annual summer school for students wanting to learn more about nuclear fusion.
Nuclear fusion may well hold the key to sustainable energy production. It is the reaction that takes place on the sun, and that provides all energy on earth. Now, scientists all over the world are trying to recreate this reaction on earth to produce energy. This involves heating up plasma to 200,000,000 degrees in order to get the atoms to fuse. If scientists succeed in creating a device that would produce more energy than it uses for heating, we would have an infinite source of energy. In contrast to the atomic reactors which are currently in use, there will be much less radioactive waste, no danger of major accidents; in contrast to coal mining, a large amount of energy can be created from a small amount of matter; in contrast to natural resources such as wind or solar power the supply of energy would be continuous and reliable.
The Fusion Education Workshop lasted one week. The first three days were filled with talks, and the last two days were a hands-on workshop where students could conduct experiments, externally, on the Golem Tokamak device in Prague. The students had the possibility to perform the experiment, analyse the data, and write a report. The two best reports got a price: a two-week internship at the Tokamak Department of the Institute of Plasma Physics in Prague.
Among the students, the youngest was 16 years old, and the older students were already on the masters level. The participants for the experimental section were chosen to be gender balanced – interestingly, the two winners of the internship were girls. The talks were presented by world-leading researchers from Spain, the Netherlands, Belgium, Germany, and Czech Republic. Between the talks by the big shots, PhD students and post-docs from seven different countries briefly presented their own stories and work, providing an insight into what one may expect if one goes on to do research in the area. Social events in the evenings allowed the workshop participants to get to know the early career researchers and ask any further questions about the academic pathway.
What does it take to organise such a workshop in a country that has no department of plasma physics or nuclear fusion? The answer is: five enthusiastic PhD students. The Fusion Education Network Team, Miloš Vlanić, Ana Kostić, Branka Vanovac, Vladica Nikolić, and Maša Šćepanović, are PhD students from the Balkan region, studying at various universities in Europe. Out of their own initiative, they decided to bring back the knowledge they acquired during their studies abroad. Not only did they take the initiative, but also organised the entire event, including invited speakers, negotiations regarding the experiment on the Golem device and the internships in Prague, and fully funded the workshop out of their own pockets.
The workshop has also taken place in the preceding year, 2015. The organiser team keeps in contact with the participants from the previous years, and supporting them in any academic endeavours. The next workshop is scheduled for September 2017, in Belgrade (Fusion Days@BG). For this year, the event is partly funded by crowd sourcing. You can back the project here:

The workshop is an excellent example of what a small group of enthusiastic early career scientist is able to achieve. The workshops simultaneously support physics students who are thinking about a career in research, build up a scientific community in an area which is not well-established in the Balkan region, and encourage future researchers to focus on an area of study which will achieve nothing less than an unlimited source of energy.

Monday, June 19, 2017

Should we increase our sample sizes, or keep them the same? We need to make up our minds

Amidst the outcries and discussions about the replication crisis, there is one point on which there is a general consensus: very often, studies in psychology are underpowered. An underpowered study is one which runs a high risk, under the assumption that the hypothesis that the effect is true, to not detect the effect at the significance threshold. The word that we need to run bigger studies has seeped through the layer of replication bullies to the general scientific population. Papers are increasingly often being rejected for having a small sample sizes. If nothing else, that should be reason enough to care about this issue.

Despite the general consensus about the importance of properly-powered studies, there is no real consensus about what we should actually do about it, in practice. Of course, the solution, in theory, is simple – we need to run bigger studies. But this solution is only simple if you have the resources to do so. In practice, as I will discuss below, there are many issues that remain unaddressed. I argue that, despite the upwards trend in psychological science, drastic measures need to be taken to enable scientists (regardless of their background) to produce good science.

For those who believe that underpowered studies are not a problem
Meehl, Cohen, Schmidt, Gelman – they all explain the problem of underpowered studies much better than I ever could. The notion that underpowered studies give you misleading results is not an opinion – it’s a mathematical fact. But seeing is believing, and if you still believe that you can get useful information with small or medium-sized effects and 20 participants, the best way to convince you otherwise is to show you some simulations. If you haven’t tinkered around with simulating data, download R, copy-and-paste the code below, and see what happens. Do it. Now. It doesn't matter if you're an undergraduate student, professor, or lay person who somehow stumbled across this blog post. I’ll wait. 

*elevator music*

# Simulating the populations
# This gives us a true effect in the population of Cohen's d = 0.4.
# Sampling 20 participants from the population
# Calculating the means for the two samples
# Note how the means vary with each time we run the simulation.
# Note how many of the results give you a “significant” p-value.

The populations that we are simulating have a mean (e.g., IQ) of 100 and 106, respectively, and a standard deviation of 15. The difference can be summarised as a Cohen’s d effect size of 0.4, a medium-sized effect. One may get an intuitive feeling for how strong an experimental manipulation would need to be to cause a true difference of 6 IQ points. The power (i.e., probability of obtaining a significant result, given that we know that the alternative hypothesis is true and we have an effect of Cohen’s d = 0.4 in the population) is 23% with 20 participants per cell (i.e., 40 altogether). You should see the observed means jumping around quite a lot, suggesting that if you care about quantifying the size of the effect you will get very unstable results. You should also see a large number of simulations returning non-significant effects, despite the fact that we know that there is an effect in the population, suggesting that if you want to make reject/accept H0 decisions based on a single study you will be wrong most of the time.

For the professors who forgot what it’s like to be young
So, we need to increase our sample sizes if we study small-to-medium effects. What’s the problem? The problems are practical in nature. Maybe you are lucky enough to have gone through all stages of your career at a department that has a very active participant pool, unlimited resources for paying participants, and maybe even an army of bored research assistants just waiting to be assigned with the task of going out and finding hundreds of participants. In this case, you can count yourself incredibly lucky. My PhD experience was similar to this. With a pool of keen undergraduates, enough funds to pay a practically unlimited amount of participants, and modern booth labs where I could test up to 8 people in parallel, I once managed to collect enough data for a four-experiment paper within a month. I list the following subsequent experiences to respectfully remind colleagues that things aren’t always this easy. These are my experiences, of course – I don’t know how many people have similar stories. My guess is that I’m not alone. Especially early-career researchers and scientists from non-first-world countries, where giving funding to social sciences is not really a thing yet, probably have similar experiences. Or maybe I’m wrong about that, and I’m just unlucky. Either way, I would be interested to hear about those others’ experiences in the comments.  

-       Working in a small, stuffy lab with no windows and only one computer that takes about as long to start as it takes you to run a participant.
-       Relying on bachelor students to collect data. They have no resources for this. They can ask their friends and families, stop people in the corridor, and only their genuine interest and curiosity in the research question stops them from just sitting in the lab for ten hours and testing themselves over and over again, or learning how to write code for a random number generator to produce the data that is expected of them.
-       Paying for participants from your own pocket.
-       Commuting for two hours (one way) to a place with participants, with a 39-degree fever, then trying hard not to cough while the participants do tasks involving voice recording.
-       Pre-registering your study, then having your contract run out before you have managed to collect the number of participants you’d promised.
-       Trying to find free spots on the psychology department notice boards or toilet doors to plaster the flyer for your study between an abundance of other recruitment posters, and getting, on average, less than one participant per week, despite incessant spamming.
-       Raising the issue of participant recruitment with senior colleagues, but not being able to come up with a practically feasible way to recruit participants more efficiently.
-       Trying to find collaborators to help you with data collection, but learning that while people are happy to help, they rarely have spare resources they could use to recruit and test participants for you.
-       Writing to lecturers to ask if you can advertise my study in their lectures. Being told that so many students ask the same question that allowing everyone to present their study in class is just not feasible anymore.

I can consider myself lucky in the sense that I’m doing mostly behavioural studies with unselected samples of adults. If you are conducting imaging studies, the price of a single participant cannot be covered from your own pocket if the university decides not to pay. If you are studying a special population, such as a rare disease, finding seven participants in the entire country during your whole PhD or post-doc contract could already be an achievement. If you are conducting experiments with children, bureaucratic hurdles may prevent you from directly approaching your target population.

So, can we keep it small?
It’s all good and well, some people say, to make theoretical claims about the sample sizes that we need. But there are practical hurdles that make it impossible in many cases. So, can we ignore the armchair theoreticians’ hysteria about power and use practical feasibility to guide our sample sizes?

Well, in theory we can. But in order to allow science to progress, we, as a field, need to make some concessions:

-       Every study should be published, i.e., there should be no publication bias.
-       Every study should provide full data in a freely accessible online repository.
-       Every couple of years, someone needs to do a meta-analysis to synthesise the results from the existing small studies.
-       Replications (including direct replications) are not frowned upon.
-       We cannot, ever, draw conclusions from a single study. 

At this stage, none of these premises are satisfied. Therefore, if we continue to conduct small studies in the current system, those that show non-significant results will likely disappear in a file drawer. Ironically, the increased awareness of power amongst reviewers is increasing publication bias at the same time: reviewers who recommend rejection based on small sample sizes have good intentions, but this leads to an even larger amount of data that never see the light of day. In addition, studies that have marginally significant effects will be p-hacked beyond recognition. For meta-analyses, the published literature will then give us a completely skewed view of the world. And in the end, we’ve wasted a lot of resources and learned nothing.

So, increasing sample size it is?
Unless we, as a field, tackle the issues described in the previous section, we will need to increase our sample sizes. There is no way around it. This solution will work, under a single premise:

-       Research is not for everyone: Publishable studies will be conducted by a handful of labs in elite universities, who have the funding to recruit hundreds of participants within weeks or months. These will be the labs that will produce high-quality research at a fast pace, which will result in them winning more grants and producing even more high-quality research. And those who don’t have the resources to conduct large studies from the beginning? Well, fuck ‘em. 

This is a valid view point, as a world where this is the norm would not have any of the problems associated with the small-study-world described above. And yet, I would say that such a world would be very bad. First, for individuals such as me (of course, I have some personal-interest-motivations in writing this blog post), who spend months and months, lugging around the testing laptop through trains and different departments in search of participants, while other researchers snap their fingers and get their research assistant to run the same study in a matter of weeks. Second, it disadvantages populations of researchers who may have systematically different views. As mentioned above, populations with fewer resources probably include younger researchers, and those from not-first-world countries. Reducing the opportunity for these researchers to contribute to their field of expertise will create a monotonous field, where scientific theories are based, to a large extent, on the musings of old white men. By this process, the field would lose an overwhelming amount of potential by locking out a majority of scholars.

In short, I argue that publishing only well-powered studies without consideration of practical issues that some researchers face will be bad for individual researchers, as well as the whole field. So, how can we increase power without creating a Matthew Effect, where the rich get richer and the poor get poorer? 

-       Collaborate more, as I’ve argued here.
-       Routinely use StudySwap to look for collaborators who help you to get the sample size you need, but also to collect data for other researchers if you happen to have some bored research assistants or lots of keen undergrads.
-       For the latter part of the last point, “rich” researchers will need to start sacrificing their own resources, which they could well use for a study of their own, that would have a chance of getting them another first-author publication instead of ending up as fifth out of seven authors on someone else’s paper.
-       As a logical consequence of the last point, researchers need to change their mindset, such that they prefer to publish fewer first-author papers and to spend more time collecting data, both for their own pet projects and for others'.
-       And why are we so obsessed with first-author publications in the first place? It’s our incentive system, of course. We, as a field, should stop giving scholarships, jobs, grants, and promotions to researchers with the most first-author publications.

And where to now?
Perhaps an ideal world would consist of large-scale studies, and small studies and meta-analyses, as it kind of does already. But in order to allow for the build-up of knowledge in such as system, to be able separate true effects from crap in candy wrappers, we, as a field, need to fix all of the issues above.

And in the meantime, there are more questions than answers for individual researchers. Do I conduct a large study? Do I bank all of my resources on a single experiment, with a chance that, for whatever reason, it may not work out, and I will finish my contract without a single publication? Do I risk looking, in front of a prospective longish-term employer, like a dreamer, one who promises the moon but in the end fails to recruit enough participants? Or do I conduct small studies during my short-term contract? Do I risk that journals will reject all of my papers because they are underpowered? Do I run a small study, knowing that, most likely, the results will be uninterpretable? Knowing that I may face pressure to p-hack to get publishable results, from journals, collaborators, or the shrewd little devil sitting on my shoulder, reminding me that I won’t have a job if I don’t get publications?

Wednesday, April 19, 2017

How much statistics do psychological scientists need to know? Also, a reading list

TL;DR: As much as possible.

The question of how much statistics psychological scientists should know has been discussed numerous times on twitter and psychology method groups on facebook. The consensus seems to be that psychologists need to know some stats, but they don’t need to be statisticians. When it comes to specifics, though, there does not seem to be any consensus: some argue that knowing the basics of the tests that are useful for one’s specific field is enough, while others argue that a thorough understanding of the concepts is important.

Here, I argue, based on my own experience, that a thorough understanding of statistics substantially enhances the quality of one’s work. The reason why I think statistical knowledge is really important is that the amount of knowledge you have constrains the experiments you can conduct: If one’s only tool is ANOVA, there is only a limited set of possible experiments that fit within the mould of this statistical test. *

First, a little bit about my stats background. I don’t remember much from high school maths: I think I had a lot of motivation to repress any memories about it. During my undergraduate course, one of the biggest mysteries is how I even passed my statistics courses. I guess they had to scale everyone’s marks to avoid failing too many students. After these experiences, though, I have spent a lot of time learning about the tools that are the corner stone of making sense of my experiments. As my supervisor told me: the best way to learn about statistical analyses is when you have some data that you care about. When I had started my PhD, I dreaded the day when I would be asked to do anything more complex than a correlation matrix. But during the PhD, I learned, through trial and error and with a lot of guidance from experienced colleagues, to analyse data in R with linear mixed effect models and Bayes Factors. When I started my post-doc, driven by my curiosity about how it is possible that we can get two identical experiments with completely different results (i.e., with p-values on different sides of the significance threshold), I decided to learn more about how this stats thing actually works. My interest was further sparked by several papers I read on this topic, and a one-day workshop given by Daniël Lakens in Rovereto, which I happened to hear about via twitter. It culminated with my signing up for a part-time distance course, a graduate certificate in statistics, which I’m due to finish in June.

This learning process has taken a lot of time. Cynically speaking, I would not recommend it to early career researchers, who would probably maximise their chances of success in academia by focussing on publishing lots of papers (quantity is more important than quality, right?). If you have only a short-term contract (anywhere between 6 months and 2 years), you probably won’t have time to do both. Besides, you will never again want to do N=20 studies, and unless your department is rich, conducting a high-powered experiment might not be feasible during a short-term post-doc contract. Ideologically speaking, I would recommend this learning process to every social scientist who feels that they don’t know enough. In my experience, it’s worth putting one’s research on hold to learn about stats: moving from following a set of arbitrary conventions** to understanding why these conventions make sense is a liberating experience, not to mention the increase of the quality of your work, and the ability to design studies that maximise the chance of getting meaningful results.

Useful resources
For anyone who is reading this blog post because they would like to learn more about statistics, I have compiled a list of resources that I found useful. They contain both statistics-oriented material, and material which is more about philosophy of science. I see them as two sides of the same coin, so I don’t make a distinction between them below.

First of all: If you haven’t already done so, sign up on twitter, and follow people who tweet about stats. Read their blogs. I am pretty sure that I learned more about stats this way than I have during my undergraduate degree. Some people I’ve learned from (it’s not a comprehensive list, but they’re all interconnected: if you follow some, you’ll find others through their discussions): Daniel Lakens (@lakens), Dorothy Bishop (@deevybee), Andrew Gelman (@StatModeling), Hilda Bastian (@hildabast), Alexander Etz (@AlxEtz), Richard Morey (@richardmorey), Deborah Mayo (@learnfromerror) and Uli Schimmack (@R_Index). If you’re on facebook, join some psychological methods groups. I frequently lurk on PsycMAP and the Psychological Methods Discussion Group.

Below are some papers (again, a non-comprehensive and somewhat sporadic list) that I found useful. I tried to sort them in order of difficulty, but I didn’t do it in a very systematic way (also, some of those papers I read a long time ago, so I don’t remember how difficult they were). I think most papers should be readable to most people with some experience with statistical tests. As an aside: even those who don’t know much about statistics may be aware of frequent discussions and disagreements among experts about how to do stats. The readings below contain a mixture of views, some of which I agree with more than with others. However, all of them have been useful for me in the sense that they helped me understand some new concepts.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304.

Savalei, V., & Dunn, E. (2015). Is the call to abandon p-values the red herring of the replicability crisis? Frontiers in Psychology, 6, 245.

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.

Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.

Lakens, D., & Evers, E. R. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278-292.

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., ... & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.

Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195-244.

Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129.

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.

Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., ... & Morey, R. D. (2015). A power fallacy. Behavior Research Methods, 47(4), 913-917.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103-123.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.

Kline, R. B. (2004). What's Wrong With Statistical Tests--And Where We Go From Here. (Chapter 3 from Beyond Significance Testing. Reforming data analysis methods in behavioural research. Washington, DC: APA Books.)

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician, 40(4), 313-315.

Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS One, 11(3), e0152719.

Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behavioral Ecology and Sociobiology, 65(1), 47-55.

In terms of books, I recommend Dienes’ “Understanding Psychology as a Science” and McElreath’s “Statistical Rethinking”.

Then there are some online courses and videos. First, there is Daniel Lakens’ Coursera course in statistical inferences. From a more theoretical perspective, I like this MIT Probability course by John Tsiskilis, and Meehl’s Lectures. For something more serious, you could also try a university course, like a distance education graduate certificate in statistics. Here is a very positive review of the Sheffield University course which I am currently doing. However, at least if your maths skills are as bad as mine, I would not recommend to do it on top of a full-time job.

Learning stats is a long and never-ending road, but if you are interested in designing strong and informative studies and being flexible with what you can do with data, it is a worthwhile investment. There is always more to learn, and no matter how much I learn I continue to feel like I know less than I should. However, it’s a steep learning curve, so even investing a little bit of time and effort can already have beneficial effects. This is possible, even if you have only fifteen minutes to spare each day, through the resources that I tried to do justice to in my list above.

I should conclude, I think, by thanking all those who make these resources available, be it via published papers, lectures, blog posts, or discussions on social media.  

* To provide an example, I will do some shameless self-advertising: in a paper that came out of my PhD, we got data which seemed uninterpretable at first. However, thanks to collaboration with a colleague with a mathematics background, Serje Robidoux, we could make sense of the data with an optimisation procedure. While an ANOVA would not have given us anything useful, the optimisation procedure allowed us to conclude that readers use different sources of information when they read aloud unfamiliar words, and that there is individual variation in the relative degree to which they rely on these different sources of information. This is one of my favourite papers I’ve published so far, but it’s only been cited by myself to date. (*He-hem!*) Here is the reference:
Schmalz, X., Marinus, E., Robidoux, S., Palethorpe, S., Castles, A., & Coltheart, M. (2014). Quantifying the reliance on different sublexical correspondences in German and English. Journal of Cognitive Psychology, 26(8), 831-852.

** Conventions such as:
“Control for multiple comparisons.”
“Don’t interpret non-significant p-values as evidence for the null hypothesis.”
“If you have a marginally significant p-value, don’t collect more data to see if the p-value drops below the threshold.”
“The p-value relates to the probability of the data, not the hypothesis.”
“For Meehl’s sake, don’t mess up the exact wording of the definition of a confidence interval!”