TL;DR: As much as possible.
The question of how much statistics psychological scientists
should know has been discussed numerous times on twitter and psychology method
groups on facebook. The consensus seems to be that psychologists need to know
some stats, but they don’t need to be statisticians. When it comes to
specifics, though, there does not seem to be any consensus: some argue that
knowing the basics of the tests that are useful for one’s specific field is
enough, while others argue that a thorough understanding of the concepts is
important.
Here, I argue, based on my own experience, that a thorough
understanding of statistics substantially enhances the quality of one’s work.
The reason why I think statistical knowledge is really important is that the amount of knowledge you have
constrains the experiments you can conduct: If one’s only tool is ANOVA, there
is only a limited set of possible experiments that fit within the mould of this
statistical test. *
First, a little bit about my stats background. I don’t
remember much from high school maths: I think I had a lot of motivation to
repress any memories about it. During my undergraduate course, one of the
biggest mysteries is how I even passed my statistics courses. I guess they had
to scale everyone’s marks to avoid failing too many students. After these
experiences, though, I have spent a lot of time learning about the tools that
are the corner stone of making sense of my experiments. As my supervisor told
me: the best way to learn about statistical analyses is when you have some data
that you care about. When I had started my PhD, I dreaded the day when I would
be asked to do anything more complex than a correlation matrix. But during the
PhD, I learned, through trial and error and with a lot of guidance from
experienced colleagues, to analyse data in R with linear mixed effect models
and Bayes Factors. When I started my post-doc, driven by my curiosity about how
it is possible that we can get two identical experiments with completely
different results (i.e., with p-values
on different sides of the significance threshold), I decided to learn more
about how this stats thing actually works. My interest was further sparked by
several papers I read on this topic, and a one-day workshop given by Daniël
Lakens in Rovereto, which I happened to hear about via twitter. It culminated
with my signing up for a part-time distance course, a graduate certificate in
statistics, which I’m due to finish in June.
This learning process has taken a lot of time. Cynically
speaking, I would not recommend it to early career researchers, who would
probably maximise their chances of success in academia by focussing on
publishing lots of papers (quantity is more important than quality, right?). If
you have only a short-term contract (anywhere between 6 months and 2 years),
you probably won’t have time to do both. Besides, you will never again want to do
N=20 studies, and unless your department is rich, conducting a high-powered
experiment might not be feasible during a short-term post-doc contract. Ideologically
speaking, I would recommend this learning process to every social scientist who
feels that they don’t know enough. In my experience, it’s worth putting one’s
research on hold to learn about stats: moving from following a set of arbitrary
conventions** to understanding why these conventions make sense is a liberating
experience, not to mention the increase of the quality of your work, and the
ability to design studies that maximise the chance of getting meaningful
results.
Useful resources
For anyone who is reading this blog post because they would
like to learn more about statistics, I have compiled a list of resources that I
found useful. They contain both statistics-oriented material, and material
which is more about philosophy of science. I see them as two sides of the same
coin, so I don’t make a distinction between them below.
First of all: If you haven’t already done so, sign up on
twitter, and follow people who tweet about stats. Read their blogs. I am pretty
sure that I learned more about stats this way than I have during my
undergraduate degree. Some people I’ve learned from (it’s not a comprehensive
list, but they’re all interconnected: if you follow some, you’ll find others
through their discussions): Daniel Lakens (@lakens), Dorothy Bishop
(@deevybee), Andrew Gelman (@StatModeling), Hilda Bastian (@hildabast),
Alexander Etz (@AlxEtz), Richard Morey (@richardmorey), Deborah Mayo
(@learnfromerror) and Uli Schimmack (@R_Index). If you’re on facebook, join
some psychological methods groups. I frequently lurk on PsycMAP and the
Psychological Methods Discussion Group.
Below are some papers (again, a non-comprehensive and
somewhat sporadic list) that I found useful. I tried to sort them in order of
difficulty, but I didn’t do it in a very systematic way (also, some of those
papers I read a long time ago, so I don’t remember how difficult they were). I
think most papers should be readable to most people with some experience with
statistical tests. As an aside: even those who don’t know much about statistics
may be aware of frequent discussions and disagreements among experts about how
to do stats. The readings below contain a mixture of views, some of which I agree
with more than with others. However, all of them have been useful for me in the
sense that they helped me understand some new concepts.
Cumming, G. (2014). The new statistics: Why and how. Psychological
Science, 25(1), 7-29.
Cohen, J. (1990). Things I have learned (so far). American
Psychologist, 45(12), 1304.
Savalei, V., & Dunn, E. (2015). Is the call to abandon
p-values the red herring of the replicability crisis? Frontiers in Psychology,
6, 245.
Gelman, A., & Carlin, J. (2014). Beyond power calculations:
Assessing type S (sign) and type M (magnitude) errors. Perspectives on
Psychological Science, 9(6), 641-651.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and
power: Too little attention has been paid to the statistical challenges in
estimating small effects. American Scientist, 97(4), 310-316.
Lakens, D., & Evers, E. R. (2014). Sailing from the seas
of chaos into the corridor of stability: Practical recommendations to increase
the informational value of studies. Perspectives on Psychological Science,
9(3), 278-292.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D.,
Steingroever, H., Wetzels, R., Grasman, R. P., ... & Wagenmakers, E. J.
(2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and
remedies. Psychonomic Bulletin & Review, 23(2), 640-647.
Luck, S. J., & Gaspelin, N. (2017). How to get
statistically significant effects in any ERP experiment (and why you
shouldn't). Psychophysiology, 54(1), 146-157.
Meehl, P. E. (1967). Theory-testing in psychology and
physics: A methodological paradox. Philosophy of Science, 34(2),
103-115.
Meehl, P. E. (1990). Why summaries of research on
psychological theories are often uninterpretable. Psychological Reports,
66(1), 195-244.
Schmidt, F. L. (1992). What do data really mean? Research
findings, meta-analysis, and cumulative knowledge in psychology. American
Psychologist, 47(10), 1173.
Schmidt,
F. L. (1996). Statistical significance testing and cumulative knowledge in
psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129.
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A.,
Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why
small sample size undermines the reliability of neuroscience. Nature Reviews
Neuroscience, 14(5), 365-376.
Schimmack, U. (2012). The ironic effect of significant
results on the credibility of multiple-study articles. Psychological Methods,
17(4), 551.
Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee,
M. D., Matzke, D., ... & Morey, R. D. (2015). A power fallacy. Behavior
Research Methods, 47(4), 913-917.
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., &
Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence
intervals. Psychonomic Bulletin & Review, 23(1), 103-123.
Hoekstra, R., Morey, R. D., Rouder, J. N., &
Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic
Bulletin & Review, 21(5), 1157-1164.
Kline, R. B. (2004). What's
Wrong With Statistical Tests--And Where We Go From Here. (Chapter 3 from
Beyond Significance Testing. Reforming data analysis methods in behavioural
research. Washington, DC: APA Books.)
Royall, R. M. (1986). The effect of sample size on the
meaning of significance tests. The American Statistician, 40(4),
313-315.
Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M.,
& Perugini, M. (2015). Sequential hypothesis testing with Bayes factors:
Efficiently testing mean differences. Psychological Methods.
Westfall, J., & Yarkoni, T. (2016). Statistically
controlling for confounding constructs is harder than you think. PloS One,
11(3), e0152719.
Forstmeier, W., & Schielzeth, H. (2011). Cryptic
multiple hypotheses testing in linear models: overestimated effect sizes and
the winner's curse. Behavioral Ecology and Sociobiology, 65(1),
47-55.
In terms of books, I recommend Dienes’ “Understanding
Psychology as a Science” and McElreath’s “Statistical Rethinking”.
Then there are some online courses and videos. First, there
is Daniel
Lakens’ Coursera course in statistical inferences. From a more theoretical
perspective, I like this MIT
Probability course by John Tsiskilis, and Meehl’s
Lectures. For something more serious, you could also try a university course,
like a distance education graduate certificate in statistics. Here
is a very positive review of the Sheffield University course which I am currently
doing. However, at least if your maths skills are as bad as mine, I would not
recommend to do it on top of a full-time job.
Conclusion
Learning stats is a long and never-ending road, but if you
are interested in designing strong and informative studies and being flexible
with what you can do with data, it is a worthwhile investment. There is always
more to learn, and no matter how much I learn I continue to feel like I know
less than I should. However, it’s a steep learning curve, so even investing a
little bit of time and effort can already have beneficial effects. This is
possible, even if you have only fifteen minutes to spare each day, through the
resources that I tried to do justice to in my list above.
I should conclude, I think, by thanking all those who make
these resources available, be it via published papers, lectures, blog posts, or
discussions on social media.
------------------------------------
* To provide an example, I will do some shameless
self-advertising: in a paper that came out of my PhD, we got data which seemed
uninterpretable at first. However, thanks to collaboration with a
colleague with a mathematics background, Serje Robidoux,
we could make sense of the data with an optimisation procedure. While an ANOVA
would not have given us anything useful, the optimisation procedure allowed us
to conclude that readers use different sources of information when they read
aloud unfamiliar words, and that there is individual variation in the relative
degree to which they rely on these different sources of information. This is
one of my favourite papers I’ve published so far, but it’s only been cited by
myself to date. (*He-hem!*) Here is the reference:
Schmalz, X., Marinus, E., Robidoux, S., Palethorpe, S.,
Castles, A., & Coltheart, M. (2014). Quantifying the reliance on different
sublexical correspondences in German and English. Journal of Cognitive
Psychology, 26(8), 831-852.
** Conventions such as:
“Control for multiple comparisons.”
“Don’t interpret non-significant p-values as evidence for the null hypothesis.”
“If you have a marginally significant p-value, don’t collect more data to see if the p-value drops below the threshold.”
“The p-value
relates to the probability of the data, not the hypothesis.”
“For Meehl’s sake, don’t mess up the exact wording of the
definition of a confidence interval!”