Thursday, November 24, 2016

Flexible measures and meta-analyses: The case of statistical learning

On a website called, Malte Elson lists 156 dependent measures that have been used in the literature to quantify the performance on the Competitive Reaction Time Task. A task which has this many possible ways of calculating the outcome measure is, in a way, convenient for researchers: without correcting for multiple comparisons, the probability that the effect of interest will be significant in at least one of the measures skyrockets.

So does, of course, the probability that a significant result is a Type-I error (false positive). Such testing of multiple variables and reporting only the one which gives a significant result is an instance of p-hacking. It becomes problematic when another researcher tries to establish whether there is good evidence for an effect: if one performs a meta-analysis of the published analyses (using standardised effect sizes to be able to compare the different outcome measures across tasks), one can get a significant effect, even if each study reports only random noise and one creatively calculated outcome variable that ‘worked’.

Similarly, it becomes difficult for a researcher to establish how reliable a task is. Take, for example, statistical learning. Statistical learning, the cognitive ability to derive regularities from the environment and apply them to future events, has been linked to everything from language learning to autism. The concept of statistical learning ties to many theoretically interesting and practically important questions, for example, about how we learn, and what enables us to be able to use an abstract, complex system such as languages before we even learn to tie a shoelace.

Unsurprisingly, many tasks have been developed that are supposed to measure this cognitive ability of ours, and to correlate performance on these tasks to various everyday skills. Let us set aside the theoretical issues with the proposition that a statistical learning mechanism underlies the learning of statistical regularities in the environment, and concentrate on the way statistical learning is measured. This is an important question for someone who wants to study this statistical learning process: before running an experiment, one would like to be sure that the experimental task ‘works’.

As it turns out, statistical learning tasks don’t have particularly good psychometric properties: when the same individuals perform different tasks, the correlations between performance on different tasks are rather low; the test-retest reliability varies across tasks, but ranges from pretty good to pretty bad (Siegelman & Frost, 2015). For some tasks, performance on statistical learning tasks is not above chance for the majority of the participants, meaning that they cannot be used as valid indicators of individual differences in the statistical learning skill. This raises questions about why such a large proportion of published studies find that individual differences in statistical learning are correlated with various life-skills, and explains anecdotal evidence from myself and colleagues of conducting statistical learning experiments that just don’t work, in the sense that there is no evidence of statistical learning.* Relying on flexible outcome measures increases the researcher’s chances of finding a significant effect or correlation, which can be especially handy when the task has sub-optimal psychometric properties (low reliability and validity reduce the statistical power to find an effect if it exists). Rather than trying to improve the validity or reliability of the task, it is easier to continue analysing different variables until something becomes significant.

The first example of a statistical learning tasks is the Serial Reaction Time Task. Here, the participants respond to a series of stimuli, which appear on different positions on a screen. The participant presses buttons which correspond to the location of the stimulus. Unbeknown to the participant, the sequence of the locations repeats – the participants’ error rates and reaction times decrease. Towards the end of the experiment, normally in the penultimate block, the order of the locations is scrambled, meaning that the learned sequence is disrupted. Participants perform worse in this scrambled block compared to the sequential one. Possible outcome variables (which can all be found in the literature) are:
- Comparison of accuracy in the scrambled block to the preceding block
- Comparison of accuracy in the scrambled block to the succeeding (final) block
- Comparison of accuracy in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in accuracy across the sequential blocks
- Comparison of reaction times in the scrambled block to the preceding block
- Comparison of reaction times in the scrambled block to the succeeding (final) block
- Comparison of reaction times in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in reaction times across the sequential blocks.

This can hardly compare to the 156 dependent variables from the Competitive Reaction Time Task, but it already gives the researcher increased flexibility in selectively reporting only the outcome measures that ‘worked’. As an example of how this can lead to conflicting conclusions about the presence or absence of an effect: in a recent review, we discussed the evidence for a statistical learning deficit in developmental dyslexia (Schmalz, Altoè, & Mulatti, in press). In regards to the Serial Reaction Time Task, we concluded that there was insufficient evidence to decide whether or not there are differences in performance on this task across dyslexic participants and controls. Partly, this is because researchers tend to report different variables (presumably the one that ‘worked’): as it is rare for researchers to report the average reaction times and accuracy per block (or to respond to requests for raw data), it was impossible to pick the same dependent measure from all studies (say, the difference between the scrambled block and the one that preceded it) and perform a meta-analysis on it. Today, I stumbled across a meta-analysis on the same question: without taking into account differences between experiments in the dependent variable, Lum, Ullman, and Conti-Ramsden (2013) conclude that there is evidence for a statistical learning deficit in developmental dyslexia.

As a second example: in many statistical learning tasks, participants are exposed to a stream of stimuli which contain regularities. In a subsequent test phase, the participants then need to make decisions about stimuli which either follow the same patterns or not. This task can take many shapes, from a set of letter strings generated by a so-called artificial grammar (Reber, 1967) to strings of syllables with varying transitional probabilities (Saffran, Aslin, & Newport, 1996). It should be noted that both the overall accuracy rates (i.e., the observed rates of learning) and the psychometric properties varies across different variants of this tasks (see, e.g., Siegelman, Bogaerts, & Frost, 2016, who specifically aimed to create a statistical learning task with good psychometric properties). In these tasks, accuracy is normally too low to allow an analysis of reaction times; nevertheless, different dependent variables can be used: overall accuracy, the accuracy of grammatical items only, or the sensitivity index (d’). And, if there is imaging data, one can apparently interpret brain patterns in the complete absence of any evidence of learning on the behavioural level.

In summary, flexible measures could be an issue for evaluating the statistical learning literature: both in finding out which tasks are more likely to ‘work’, and in determining to what extent individual differences in statistical learning may be related to everyday skills such as language or reading. This does not mean that statistical learning does not exists, or that all existing work on this topic is flawed. However, it creates cause for healthy scepticism about the published results, and many interesting questions and challenges for future research. Above all, the field would benefit from increased awareness of issues such as flexible measures, which would lead to the pressure to increase the probability of getting a significant result by maximising the statistical power, i.e., decreasing the Type-II error rate (through larger sample sizes and more reliable and valid measures), rather than using tricks that affect the Type-I error rate.

Lum, J. A., Ullman, M. T., & Conti-Ramsden, G. (2013). Procedural learning is impaired in dyslexia: Evidence from a meta-analysis of serial reaction time studies. Research in Developmental Disabilities, 34(10), 3460-3476.
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6(6), 855-863.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
Schmalz, X., Altoè, G., & Mulatti, C. (in press). Statistical learning and dyslexia: a systematic review. Annals of Dyslexia. doi:10.1007/s11881-016-0136-0
Siegelman, N., Bogaerts, L., & Frost, R. (2016). Measuring individual differences in statistical learning: Current pitfalls and possible solutions. Behavior Research Methods, 1-15.
Siegelman, N., & Frost, R. (2015). Statistical learning as an individual ability: Theoretical perspectives and empirical evidence. Journal of Memory and Language, 81, 105-120.

* In my case, it’s probably a lack of flair, actually.


  1. Dear Xenia, Thanks for this blog. Very interesting views. It is very much the issue that I have been trying to address in my present research that examines SL in language impairment. One way to counter this could be to have clear description of SL (as attempted by Siegelman and Frost, 2016) and stating that up front in your study (ie.,what is the nature of SL you are attempting to study). Another way is to have a task that allows for greater variability (unlike SRT which has rigid statistical structure). We have, in our present study, adopted these two approaches and hoping to see some reliable measure of SL. Thanks again.

    1. Dear Kuppu, thanks for your comments, it's kind of nice to know that others are struggling with the same issues! I'm trying to figure the nature of the relationship between statistical learning and reading acquisition, if any. Our idea is to pick a number of SL tasks and create a latent construct from the outcome measures, this way the hope is to pick out whatever variance is unique to the statistical learning construct rather than task artifacts. I'd be curious to hear more about your approach: what tasks would you classify as having greater variability?

    2. Hi Xenia
      Thanks. We tried to embed a spectrum of regularities : Starting from the easiest (TP of 1 and adjacent) to difficult (TP of .3 and non-adjacent) in a single task using the paradigm of serial search task. We piloted this on small sample and found no clear learning pattern across all the regularities. However we found that people are learning lower end of the regularities. I will be happy to write to you when I have more results on this.