TL;DR: Thinking about effect size can help to plan
experiments and evaluate existing studies. Bigger is not always better!
In grant applications, requests for ethics permission, article introductions or pre-registered reports it is often expected that the
researcher provides an estimate of the expected effect size. This should then
be used to determine the sample size that the researcher aims to recruit. At
first glance, it may seem counter-intuitive that one has to specify an effect
size before one collects the data – after all, the aim of (most) experiments is
to establish what this effect size could be, and/or whether it is different
from zero. At second glance, specifying an expected effect size may be one of
the most important issues to consider in the planning of an experiment.
Underpowered studies reduce the researcher’s chance to find an effect, even if
the effect exists in the population – running an experiment with 30% power
gives one a worse chance to find the effect than tossing a coin to determine
the experiment’s outcome. Significant effects in underpowered studies overestimate the true effect
size, because the z-score cut-off
that is needed to reach significance is larger than the population effect (Gelman & Carlin, 2014; Gelman & Weakliem, 2009; Schmidt, 1992).
So, given that it is important to have an idea of the effect
before running the study, what is the best way to come up with a reasonable a priori effect size estimate? One
possibility is to run a pilot test; another is to specify the smallest effect
size that would be of (theoretical and/or practical) significance; the third
would be to consider everything we know about effects that are kind of like the
one we are interested in, and to use that to guide our expectations. All three
have both benefits and drawbacks – my aim is not to discuss these. My aim here
is to focus on the third possibility: knowing how big the most stable effects
in your research are can be used to constrain your expectations – it is
unlikely that your “new” effect, a high-hanging fruit, is bigger than the
low-hanging fruit that has been plucked decades ago. Specifically, I will try
to provide an effect size estimate which could be used by researchers seeking
to find potential deficits underlying developmental dyslexia, and discuss the
limitations of this approach in this particular instance.
A large proportion of the literature on developmental
dyslexia focuses on finding potentially causal deficits. Such studies generally
test dyslexic and control participants on a task which is not directly related
to reading, such as visual attention, implicit learning, balance, etc., etc. If dyslexic participants perform significantly worse than control
participants, it is suggested that there is the possibility that the task might
tap an underlying deficit, which causes dyslexia, can be used as an early
marker of dyslexia, and can be treated to improve reading skills. The abundance
of such studies has lead to an abundance of theories of developmental dyslexia,
which has lead to – well – a bit of a mess. I will not go into details here; the
interested reader can scan the titles of papers from any reading- or
dyslexia-related journal for some examples.
A problem arises now that we know that many published
studies are Type-I errors (that find a significant group difference in their
sample that is absent in the population; see Ioannidis, 2005; Open Science
Collaboration, 2015). Sifting through the vast and often contradictory literature to find out whether there is a sound case for a
dyslexic deficit on each of the tasks that have been studied to date would be a
mammoth task. Yet, this would be an important step for an integrative theory of
dyslexia. Making sense of the existing literature would not only constrain
theory, but also prevent researchers from wasting resources on treatment
studies which are not likely to yield fruitful results.
Both to evaluate the existing literature, and to plan future
studies, it would be helpful to have an idea of how big an effect can be
reasonably expected. To this end, I decided to calculate an effect size
estimate of the phonological awareness deficit. The phonological awareness
deficit relates to the consistent finding that participants at dyslexia perform
more poorly than controls on tasks involving the manipulation of phonemes and
other sublexical spoken units (Melby-Lervåg, Lyster, & Hulme, 2012; Snowling, 2000; for a discussion
about potential causality see Castles & Coltheart, 2004).
Methods
Studies which are not about phonological awareness often
include phonological awareness tasks in their test battery when comparing
dyslexic participants to controls. In these instances, phonological awareness
is not the variable of interest, therefore there is little reason to suspect
that there is any p-hacking (e.g.,
including covariates, removing outliers, or adding participants until the group
difference becomes significant). For this reason, the effect size estimate is
based only on studies where phonological awareness was not the critical
experimental task. To this end, I looked through all papers that are lying
around on my desk on the subject of statistical learning and dyslexia, and on
magnocellular/dorsal functioning in dyslexia. I included, in the analysis, all
papers which provided a table with group means and standard deviations on tasks
of phonological awareness. This resulted in 15 papers, 6 testing children and 9
testing adults, from 5 different languages. Mostly, the
phonological awareness measures were spoonerism tasks. A full list of studies with the tasks, participant characteristics, and full references to the papers can be found in the dropbox folder linked below.
I generated a forest plot with the metafor package for R
(Viechtbauer, 2010). I used a random-effect model, and the Sidik-Jonkman method (Sidik
& Jonkman, 2005; see Edit below).
The R analysis script, and the datafile (including full references to all
papers that were included) can be found here: osf.io/69a2p
Results
The results are shown in the
figure below. All studies show a negative effect (i.e., worse performance on
phonological awareness tasks for dyslexic compared to control groups). The
effect size estimate is d = 1.24,
with a relatively narrow confidence interval, 95% CI [0.95, 1.54].
Conclusions and limitations
When comparing dyslexic participants and controls on
phonological awareness tasks, one can expect a large effect size of around d = 1.2. A phonological awareness
deficit is the most established phenomenon in dyslexia research; given that all
other effects (as far as I know) are contentious, and have often been
successfully replicated by some labs, but not by others, it is likely that a
group difference on any other task (which is not directly related to reading) will
be smaller. For researchers, peer reviewers, and editors, the obtained effect size of a new experiment can be used as a red flag: if a study examines whether or not there is
a group difference in, say, chewing speed between dyslexic and control groups*,
and obtains an effect size of d >
1, one should raise an eyebrow. This would be a clear indicator that one has
obtained a Magnitude error (Gelman & Carlin, 2014). Obtaining several effect
sizes of this magnitude in a multi-experiment study could even be an indicator
of p-hacking or worse (Schimmack, 2012).
The are two limitations of the current blog post. The first
is methodological, the second is theoretical. First, I have deliberately chosen
to include studies where phonological awareness was not of primary interest.
This means that the original authors would not have had an incentive to make
this effect look bigger than it actually is. However, there is a potential alternative source
of bias: it is possible that (at least some) researchers use phonological
awareness as a manipulation check: given that this effect is well-established, a
failure to find it in one’s sample could be taken to suggest that the research
assistant who collected the data screwed up. This could lead to a file-drawer
problem, where researchers discard all datasets where the group difference in
phonological awareness was not significant. It is also plausible that
researchers would simply not report having tested this variable if it did not
yield a significant difference, as the critical comparison in all papers was on
other cognitive tasks. If other studies have obtained non-significant
differences in phonological awareness but did not report these, the current
analysis presents a hugely inflated effect size estimate.
The second issue relates less to the validity of the effect
size, and more to its utility: the effect size estimate is really, really big.
A researcher paying attention to effect sizes will be sceptical, anyway, if
they see an effect size this big. Thus, in this case, the effect size of the
most stable effect in the area cannot be used to constrain our expectation of
potential effect sizes of other experiments. To guide the expectation of an
effect size for future studies, one might therefore want to turn to alternative
approaches. Pilot testing could be useful, but its limitation is that it is
often difficult to recruit enough participants to get a meaningful pilot study plus a well-powered full experiment. At
this stage, it would also be difficult to define a minimum effect size that
would be of interest. This could change, however, if we develop models of
reading and dyslexia that make quantitative predictions. (This is unlikely to
happen anytime soon, though.) Currently, the most useful approach to
determining whether or not there is an effect, given limited resources and no effect
size estimate, seems to be optional stopping. This can be done if it is planned
in advance; in the frequentist framework, alpha-levels need to be adjusted a priori (Lakens, 2014; for Bayesian approaches see Rouder, 2014; Schönbrodt, 2015).
Despite the limitations of the current analysis, I hope the
readers of this blog post will be encouraged to consider issues in estimating effect sizes, and
will find some useful references below.
References
Collaboration, O. S. (2015). Estimating the
reproducibility of psychological science. Science,
349(6251), aac4716.
Gelman, A., & Carlin, J. (2014). Beyond
Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6),
641-651.
Gelman, A., & Weakliem, D. (2009). Of
beauty, sex and power: Too little attention has been paid to the statistical
challenges in estimating small effects. American
Scientist, 97(4), 310-316.
Hartung, J., & Makambi, K. H. (2003).
Reducing the number of unjustified significant results in meta-analysis. Communications in Statistics-Simulation and
Computation, 32(4), 1179-1190.
IntHout, J., Ioannidis, J. P., & Borm,
G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis
is straightforward and considerably outperforms the standard DerSimonian-Laird
method. BMC medical research methodology,
14(1), 1.
Ioannidis, J. P. A. (2005). Why most
published research findings are false. Plos
Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
Lakens, D. (2014). Performing high‐powered
studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.
Melby-Lervåg, M., Lyster, S.-A. H., &
Hulme, C. (2012). Phonological skills and their role in learning to read: a
meta-analytic review. Psychological
Bulletin, 138(2), 322.
Rouder, J. N. (2014). Optional stopping: No
problem for Bayesians. Psychonomic
Bulletin & Review, 21(2), 301-308.
Schimmack, U. (2012). The ironic effect of
significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.
Schmidt, F. L. (1992). What do data really
mean? Research findings, meta-analysis, and cumulative knowledge in psychology.
American Psychologist, 47(10), 1173.
Schönbrodt, F. D. (2015). Sequential
Hypothesis Testing with Bayes Factors: Efficiently Testing Mean Differences. Available at SSRN 2604513.
Sidik, K., & Jonkman, J. N. (2005).
Simple heterogeneity variance estimation for meta‐analysis. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 54(2), 367-384.
Snowling, M. J. (2000). Dyslexia. Malden, MA: Blackwell
Publishers.
Viechtbauer, W. (2010). Conducting
meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1-48. Retrieved from http://www.jstatsoft.org/v36/i03/.
*************************************************
* It’s not easy coming up with a counterintuitive-sounding
“dyslexic deficit” which hasn’t already been proposed!
*************************************************
Edit 5.4.16 #1: Added a new link to data (to OSF rather than dropbox).
Edit 5.4.16 #2: I used the "SJ" method for ES estimation, which I thought referred to the one recommended by IntHout et al., referenced above. Apparently, it doesn't. Using, instead, the REML method slightly reduces the effect size, to d = 1.2, 95% CI [1.0, 1.4]. The estimate is similarly large to the one described in the text, therefore it does not change the conclusions of the analyses. Thanks to Robert Ross for pointing this out!
*************************************************
Edit 5.4.16 #1: Added a new link to data (to OSF rather than dropbox).
Edit 5.4.16 #2: I used the "SJ" method for ES estimation, which I thought referred to the one recommended by IntHout et al., referenced above. Apparently, it doesn't. Using, instead, the REML method slightly reduces the effect size, to d = 1.2, 95% CI [1.0, 1.4]. The estimate is similarly large to the one described in the text, therefore it does not change the conclusions of the analyses. Thanks to Robert Ross for pointing this out!
No comments:
Post a Comment