On a website called flexiblemeasures.com, Malte Elson
lists 156 dependent measures that have been used in the literature to quantify
the performance on the Competitive Reaction Time Task. A task which has this
many possible ways of calculating the outcome measure is, in a way, convenient
for researchers: without correcting for multiple comparisons, the probability
that the effect of interest will be significant in at least one of the measures
skyrockets.
So does, of course, the probability that a significant
result is a Type-I error (false positive). Such testing of multiple variables
and reporting only the one which gives a significant result is an instance of p-hacking. It becomes problematic when another researcher tries to establish whether there is good evidence for an effect: if one
performs a meta-analysis of the published analyses (using standardised effect
sizes to be able to compare the different outcome measures across tasks), one
can get a significant effect, even if each study reports only random noise and
one creatively calculated outcome variable that ‘worked’.
Similarly, it becomes difficult for a researcher to
establish how reliable a task is. Take, for example, statistical learning.
Statistical learning, the cognitive ability to derive regularities from the
environment and apply them to future events, has been linked to everything from
language learning to autism. The concept of statistical learning ties to many
theoretically interesting and practically important questions, for example,
about how we learn, and what enables us to be able to use an abstract, complex
system such as languages before we even learn to tie a shoelace.
Unsurprisingly, many tasks have been developed that are
supposed to measure this cognitive ability of ours, and to correlate
performance on these tasks to various everyday skills. Let us set aside the
theoretical issues with the proposition that a statistical learning mechanism
underlies the learning of statistical regularities in the environment, and
concentrate on the way statistical learning is measured. This is an important
question for someone who wants to study this statistical learning process:
before running an experiment, one would like to be sure that the experimental
task ‘works’.
As it turns out, statistical learning tasks don’t have
particularly good psychometric properties: when the same individuals perform
different tasks, the correlations between performance on different tasks are
rather low; the test-retest reliability varies across tasks, but ranges from
pretty good to pretty bad (Siegelman & Frost, 2015). For some tasks, performance
on statistical learning tasks is not above chance for the majority of the
participants, meaning that they cannot be used as valid indicators of
individual differences in the statistical learning skill. This raises questions
about why such a large proportion of published studies find that individual
differences in statistical learning are correlated with various life-skills,
and explains anecdotal evidence from myself and colleagues of conducting
statistical learning experiments that just don’t work, in the sense that there
is no evidence of statistical learning.* Relying on flexible outcome measures
increases the researcher’s chances of finding a significant effect or
correlation, which can be especially handy when the task has sub-optimal
psychometric properties (low reliability and validity reduce the statistical
power to find an effect if it exists). Rather than trying to improve the
validity or reliability of the task, it is easier to continue analysing different variables until
something becomes significant.
The first example of a statistical learning tasks is the
Serial Reaction Time Task. Here, the participants respond to a series of
stimuli, which appear on different positions on a screen. The participant
presses buttons which correspond to the location of the stimulus. Unbeknown to
the participant, the sequence of the locations repeats – the participants’
error rates and reaction times decrease. Towards the end of the experiment,
normally in the penultimate block, the order of the locations is scrambled,
meaning that the learned sequence is disrupted. Participants perform worse in
this scrambled block compared to the sequential one. Possible outcome variables
(which can all be found in the literature) are:
- Comparison of accuracy in the scrambled block to the
preceding block
- Comparison of accuracy in the scrambled block to the
succeeding (final) block
- Comparison of accuracy in the scrambled block compared to
an average of the preceding and succeeding blocks
- The increase in accuracy across the sequential blocks
- Comparison of reaction times in the scrambled block to the
preceding block
- Comparison of reaction times in the scrambled block to the
succeeding (final) block
- Comparison of reaction times in the scrambled block
compared to an average of the preceding and succeeding blocks
- The increase in reaction times across the sequential
blocks.
This can hardly compare to the 156 dependent variables from
the Competitive Reaction Time Task, but it already gives the researcher
increased flexibility in selectively reporting only the outcome measures that
‘worked’. As an example of how this can lead to conflicting conclusions about
the presence or absence of an effect: in a recent review, we discussed the
evidence for a statistical learning deficit in developmental dyslexia (Schmalz, Altoè, & Mulatti, in press). In regards to the Serial
Reaction Time Task, we concluded that there was insufficient evidence to decide
whether or not there are differences in performance on this task across
dyslexic participants and controls. Partly, this is because researchers tend to
report different variables (presumably the one that ‘worked’): as it is rare
for researchers to report the average reaction times and accuracy per block (or
to respond to requests for raw data), it was impossible to pick the same
dependent measure from all studies (say, the difference between the scrambled
block and the one that preceded it) and perform a meta-analysis on it. Today, I
stumbled across a meta-analysis on the same question: without taking into
account differences between experiments in the dependent variable, Lum, Ullman, and Conti-Ramsden (2013) conclude that there is
evidence for a statistical learning deficit in developmental dyslexia.
As a second example: in many statistical learning tasks,
participants are exposed to a stream of stimuli which contain regularities. In
a subsequent test phase, the participants then need to make decisions about
stimuli which either follow the same patterns or not. This task can take many
shapes, from a set of letter strings generated by a so-called artificial
grammar (Reber, 1967)
to strings of syllables with varying transitional probabilities (Saffran, Aslin, & Newport, 1996). It should be noted that both
the overall accuracy rates (i.e., the observed rates of learning) and the psychometric
properties varies across different variants of this tasks (see, e.g., Siegelman, Bogaerts, & Frost, 2016, who specifically aimed
to create a statistical learning task with good psychometric properties). In these tasks, accuracy is
normally too low to allow an analysis of reaction times; nevertheless, different
dependent variables can be used: overall accuracy, the accuracy of grammatical
items only, or the sensitivity index (d’). And, if there is imaging data, one can apparently interpret brain patterns in the complete absence of any evidence of learning on the behavioural level.
In summary, flexible measures could be an issue for
evaluating the statistical learning literature: both in finding out which tasks
are more likely to ‘work’, and in determining to what extent individual
differences in statistical learning may be related to everyday skills such as
language or reading. This does not mean that statistical learning does not
exists, or that all existing work on this topic is flawed. However, it creates
cause for healthy scepticism about the published results, and many interesting
questions and challenges for future research. Above all, the field would
benefit from increased awareness of issues such as flexible measures, which
would lead to the pressure to increase the probability of getting a significant
result by maximising the statistical power, i.e., decreasing the Type-II error
rate (through larger sample sizes and more reliable and valid measures), rather
than using tricks that affect the Type-I error rate.
References
Lum, J. A., Ullman, M. T., & Conti-Ramsden, G.
(2013). Procedural learning is impaired in dyslexia: Evidence from a
meta-analysis of serial reaction time studies. Research in Developmental Disabilities, 34(10), 3460-3476.
Reber, A. S. (1967). Implicit learning of
artificial grammars. Journal of Verbal
Learning and Verbal Behavior, 6(6), 855-863.
Saffran, J. R., Aslin, R. N., &
Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
Schmalz, X., Altoè, G., & Mulatti, C.
(in press). Statistical learning and dyslexia: a systematic review. Annals of Dyslexia.
doi:10.1007/s11881-016-0136-0
Siegelman, N., Bogaerts, L., & Frost,
R. (2016). Measuring individual differences in statistical learning: Current
pitfalls and possible solutions. Behavior
Research Methods, 1-15.
Siegelman, N., & Frost, R. (2015).
Statistical learning as an individual ability: Theoretical perspectives and
empirical evidence. Journal of Memory and
Language, 81, 105-120.
------------------------------------------------------------------
* In my case, it’s probably a lack of flair, actually.
Dear Xenia, Thanks for this blog. Very interesting views. It is very much the issue that I have been trying to address in my present research that examines SL in language impairment. One way to counter this could be to have clear description of SL (as attempted by Siegelman and Frost, 2016) and stating that up front in your study (ie.,what is the nature of SL you are attempting to study). Another way is to have a task that allows for greater variability (unlike SRT which has rigid statistical structure). We have, in our present study, adopted these two approaches and hoping to see some reliable measure of SL. Thanks again.
ReplyDeleteDear Kuppu, thanks for your comments, it's kind of nice to know that others are struggling with the same issues! I'm trying to figure the nature of the relationship between statistical learning and reading acquisition, if any. Our idea is to pick a number of SL tasks and create a latent construct from the outcome measures, this way the hope is to pick out whatever variance is unique to the statistical learning construct rather than task artifacts. I'd be curious to hear more about your approach: what tasks would you classify as having greater variability?
DeleteHi Xenia
DeleteThanks. We tried to embed a spectrum of regularities : Starting from the easiest (TP of 1 and adjacent) to difficult (TP of .3 and non-adjacent) in a single task using the paradigm of serial search task. We piloted this on small sample and found no clear learning pattern across all the regularities. However we found that people are learning lower end of the regularities. I will be happy to write to you when I have more results on this.