Friday, August 2, 2019

Getting a precise RT estimate for single items in a reading aloud task

For Registered Reports, grant applications, ethics applications, and similar documents, researchers are expected to provide a power calculation. From my own experience, and from talking with colleagues in many different contexts, this is often a hurdle. Calculating power requires an effect size estimate. Sometimes, we try new things and have no idea what the size of the effect will be: even if we have some pilot data, we know that the observed effect size is variable when the sample size is small (Whitehead et al., 2016). We might have data from a previous study, but we also know that the presence of publication bias and questionable research practices leads to systematic over-estimation of the true effect size (Vasishth et al., 2018). The design of our study might be complex, and we don't really know which boxes to tick in G*Power. We might not even be sure what kind of effects we're looking for: if our study is more exploratory in nature, we will not know which statistical tests we will conduct, and calculating a formal power analysis would not make much sense, anyway (Nosek & Lakens, 2014). Still, we need to find some way to justify our sample size to the reviewers.

In justifying our sample size, an alternative to a power analysis is to plan for a certain degree of precision (e.g., Kelley et al., 2003). For estimating precision, we use our a priori expectation of the standard deviation to calculate a confidence interval that guarantees that, in the long run, our observed estimate is within an acceptable bound. Again, we have a freedom in deciding the width of the confidence interval (e.g., 80%, 90%, 95%), and we need to have an estimate of the standard deviation.

In the current blog post, I'd like to answer a question that is relevant to me at the moment: When we do a reading aloud study, a number of participants see a number of words, and are asked to read it aloud as accurately and quickly as possible. The variable which is analysed is often the Reaction Time (RT): the number of milliseconds between the appearance of the item and the onset of the vocal response. The items are generally chosen to vary in some linguistic characteristic, and subsequent statistical analyses would be conducted to see if the linguistic characteristics affect the RT.

In most cases, the data would be analysed using a Linear Mixed Effect model, where item- and participant-level characteristics can be included as predictor variables. More information about calculating power and required sample sizes for Linear Mixed Effect models can be found in Brysbaert and Stevens (2018) and Westfall et al. (2014); and a corresponding app can be found here. Here, I ask a different question: If we look at a single items, how many participants do we need to obtain stable estimates?

On the surface, the logic behind this question is very simple. For each item, we can calculate the average RT, across N participants. As N increases, the observed average should approach a hypothetical true value. If we want to see which item-level characteristics affect RTs, we should take care to have as precise an estimate as possible. If we have only a few participants responding to each item, the average observed RT is likely to vary extensively if we ask a couple of more participants to read aloud the same items.

As a complicating factor, the assumption that there is a true value for the average RTs is unreasonable. For example, familiarity with a given word will vary across participants: a psychology student is likely to respond faster to words that they encounter in their daily life, such as "depression", "diagnosis", "comorbidity", than someone who does not encounter these words on a regular basis (e.g., an economics student). Thus, the true RT is more likely to be a distribution rather than a single point.

Leaving this important caveat aside for a minute, we return to the basic principle that a larger number of observations should result in a more stable RT estimate. In a set of simulations, I decided to see what the trajectory of a given observed average RT is likely to look like, when we base it on the characteristics that we find, for various words, in the large-scale Lexicon projects. The English Lexicon Project (Balota et al., 2006) has responses for thousands of items, with up to 35 responses per item. In a first simulation, I focussed on the word "vanishes", which has 35 responses, and an average reading aloud RT of 743.4 ms (SD = 345.3), including only the correct responses. Based on the mean and SD, we can simulate the likely trajectories of the observed average RTs at different values of N. Using the item's mean and SD, we simulate a normal distribution, and draw a single value from it: We have an RT for N = 1. Then we draw the next value and calculate the average between this first and second values. We have an average RT for N = 2. We can repeat this procedure, while always plotting the observed average RT for each N. Here, I did this for 35 participants: this gives a single "walk", where the average RT approaches the RT which we specified as a parameter for our normal distribution. Then, we repeat the whole procedure, to simulate more "walks". The figure below shows 100 such "walks".

As expected, the initial average RTs tend to be all over the place: if we were to stop our simulated data collection at N = 5, we might be unlucky enough to get an estimate of 400 ms, or an estimate or 1200 ms. As the simulated data collection progresses, the variability between the "walks" diminishes, and at N = 30 we would expect the observed average RT to lie somewhere between 600 ms and 1,000 ms.

Analytically, the variability at different values of N can be quantified as confidence intervals: the proportion of times that we expect the average RT to exceed the interval, in the long run. The width of the confidence intervals depends (1) on the confidence level that we'd like to have (fixed here at 95%), (2) the population standard deviation (σ), and (3) the number of participants. Now, we don't really know what σ is, but we can get some kind of plausible range of σ-values, by looking at the data from the English Lexicon Project. I first removed all RTs < 250 ms, which are likely to be miscoded. Then I generated a box-plot of the SDs for all items:

The SDs are not normally distributed, with quite a lot of very large values. However, we can calculate a median, which happens to be SDmedian ≈ 200; a 20% quantile, SDlower ≈ 130; 80% quantile, SDupper ≈350, and a pessimistic estimate by taking the location of the upper bar in the boxplot above, SDpessimistic ≈ 600. For each of these SD estimates, we can calculate the 95% confidence interval for different values of N, with the formula: CIupper = 1.96*(σ/sqrt(N)); CIlower = CIupper * (-1). To calculate the expected range of average RTs, we would add these values to the average RTs. However, here we are more interested in the deviations from any hypothetical mean, therefore we can simply focus on the upper bound; the expected deviation is therefore CIupper * 2. 

Next, I plotted CIupper as a function of N for the different SD estimates (low, median, high, and pessimistic):

So, if we have 50 participants, the expected range of deviation (CIupper * 2) is 72 ms for the low estimate, 110 ms for the median estimate, 194 ms for the upper estimate, and 332 ms for the pessimistic estimate. For 100 participants, the range reduces to 50 ms, 78 ms, 137 ms, and 235 ms, respectively.

What does all of this mean? Well, at the end of this blog post we are still left with the situation that the researcher needs to decide on an acceptable range of deviation. This is likely to be a trade-off between the precision one wants to achieve and practical considerations. However, the simulations and calculations should give a feeling of what number of observations is typically needed to achieve what level of precision, when we look at the average RTs of single items. The general take-home messages can be summarised as: (1) It could be fruitful to consider precision when planning psycholinguistic experiments, and (2) the more observations, the more stable the average RT estimate, i.e., the less likely it is to vary across samples.


Link to the analyses and simulations:


Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. Behavior Research Methods, 39(3), 445-459.

Brysbaert, M., & Stevens, M. (2018). Power analysis and effect size in mixed effects models: A tutorial. Journal of Cognition, 1(1).
Kelley, K., Maxwell, S. E., & Rausch, J. R. (2003). Obtaining power or obtaining precision: Delineating methods of sample-size planning. Evaluation & the Health Professions, 26(3), 258-287.

Nosek, B. A., & Lakens, D. (2014). Registered Reports: A method to increase the credibility of published results. Social Psychology, 45(3), 137-141.

Vasishth, S., Mertzen, D., Jäger, L. A., & Gelman, A. (2018). The statistical significance filter leads to overoptimistic expectations of replicability. Journal of Memory and Language, 103, 151-175.

Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli. Journal of Experimental Psychology: General, 143(5), 2020.
Whitehead, A. L., Julious, S. A., Cooper, C. L., & Campbell, M. J. (2016). Estimating the sample size for a pilot randomised trial to minimise the overall trial sample size for the external pilot and main trial for a continuous outcome variable. Statistical Methods in Medical Research, 25(3), 1057-1073.

No comments:

Post a Comment