TL;DR: Don’t dichotomise continuous variables.
So about a year ago, I designed an experiment. It is a
simple reading-aloud experiment: participants see one word at a time and need
to read it out aloud as quickly and accurately as possible. I manipulated word
frequency (a continuous variable, with log frequencies between 0 and 2),
and predictability of the pronunciation (a binary variable I made up). I matched on
several psycholinguistic variables, such that any differences associated with
the manipulation could not be attributed to a correlated variable. I did this
by comparing the means across the four conditions, and making sure that differences
between conditions were not significant (to be ultra-conservative, taking 0.1
as a cut-off). For example, I matched on orthographic N (for a given word, the
number of words that can be created by exchanging a letter; a continuous
variable). The average orthographic N values (and SDs in brackets), across the
four conditions, are listed below:
High frequency (>1)
|
Low frequency (<1)
|
|
Predictable
|
6.3 (3.9)
|
6.1 (4.5)
|
Unpredictable
|
5.7 (3.8)
|
5.1 (4.9)
|
So far, so good – pretty standard practice, I believe. But
here comes the complicating factor: in the analyses, we did not treat frequency
as a dichotomised variable, but rather as a continuum. I decided to make a descriptives table for the manuscript which reflects this. Instead of presenting four columns for the different
conditions and separate rows for each potential confound variable, I decided to
perform, for each covariate, a separate linear model (LM) analysis, where the
covariate is the dependent variable, predicted by the manipulated variables
(frequency, predictability, and their interaction). Here is what I got:
Potential covariate
|
Main effect of frequency
|
Main effect of predictability
|
Interaction of predictability and frequency
|
Overall average and standard deviation
|
Orthographic N
|
t = 2.02, p = 0.04 *
|
t = 1.04, p = 0.30
|
t = -0.23, p = 0.82
|
5.82 (4.35)
|
It looks like my beautifully matched item set isn’t
beautifully matched, after all. Far out. (It would have been far more
convenient to discover this prior to data collection, too.)
So, what happened? The critical difference is that the
pairwise t-tests treated frequency as
a dichotomy; in the LM analysis, it was treated as a continuum. As it turns
out, dichotomising naturally continuous variables decreases the power, meaning
that, on average, fewer analyses will yield statistically significant p-values if there is a true difference.
I would like to think that this was just my rookie error,
but it seems that it is not generally known that dichotomising continuous
variables decreases power. For example, my meta-analysis of my eight studies
failing to replicate an effect was rejected, partly because an anonymous
reviewer argued that I was reducing the power to find an effect by treating the
critical variable as a continuum rather than a dichotomy. The implications of
treating continuous variables as continuums therefore go beyond mishaps in
matching items: increasing experimental power is a central issue that is being
discussed as a possible solution to the replication crisis. For
psycholinguists, it should be – in many cases – relatively simple to increase
their power, simply by using slightly more complex models (more complex
compared to the traditional 2x2 ANOVA, that is), and treating continuous
variables as - well - continuous variables.
To provide an illustration that dichotomising variables
really does decrease power, I used the British Lexicon Project (Keuleers, Lacey, Rastle, & Brysbaert, 2012),
which has reading aloud latencies for over 2,000 words and information on their
linguistic characteristic. I took 1, 000 repeated samples each of 20, 40, 60,
…, 280, and 300 words. For each of these samples, I created LMs to test for the
main effects of frequency, length (number of letters), orthographic N, and
bigram frequency. It is pretty well established that frequency and length
effects are real, although orthographic N and bigram frequency effects are
still somewhat elusive. The question is how many of these analyses yield p-values smaller than 0.05, and how this
compares in for continuous compared to dichotomised models. Here is what happens
when all variables are treated as a continuum: the x-axis shows the number of
words in the sample, and the y-axis the number of analyses with p < 0.05 out of 1,000. The red dashed
line indicates 80% power, which is considered to be good.
Next, I dichotomised frequency: for each sample, frequency
was coded as high (0.5) if its value was above the median for this particular
sample, and as low (-0.5) if it was below the median. The other predictors and
other aspects of the analyses were identical to the models above. The analyses
showed that dichotomising frequency decreases power: at N = 20, the power was
55.6%, at N = 40, power was 88.8%, at N = 80, power = 98.6%, and only at N =
100 did it reach 100%. In comparison, the power using the continuous frequency
measure was 73.1%, 96.9%, and 100% for N = 20, 40, and 60, respectively (see
Figure). An inspection of the average slopes showed that for each set size, the
slope estimates for the effect of frequency were steeper for frequency as a
dichotomous rather than the continuous distinction, for example, at N = 300,
the average slope for the dichotomous frequency measure, β = -55.05, and for
the continuous measure, β = -35.1. However, the standard deviations of the
slopes also differ across the two measures, with a consistently higher standard
deviation for the dichotomous compared to continuous measure (e.g., for N =
300, the standard deviations are 5.65 and 2.80, respectively). Thus, while
dichotomising a continuous variable, on average, increases the raw effect size,
it also increases the variability of the effect size estimate, resulting in
lower power.
In summary, psycholinguists can take a step towards
increasing their experimental power and thus creating a more replicable science
by not dichotomising continuous variables. This potential change in analysis
methods has no drawbacks and only gains.
Reference
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert,
M. (2012). The British Lexicon Project: Lexical decision data for 28,730
monosyllabic and disyllabic English words. Behavior
Research Methods, 44(1), 287-304. doi:10.3758/S13428-011-0118-4
*******************************************
Edit 1/3/15: Fixed a typo.
*******************************************
Edit 1/3/15: Fixed a typo.
No comments:
Post a Comment