TL;DR: As much as possible.
The question of how much statistics psychological scientists should know has been discussed numerous times on twitter and psychology method groups on facebook. The consensus seems to be that psychologists need to know some stats, but they don’t need to be statisticians. When it comes to specifics, though, there does not seem to be any consensus: some argue that knowing the basics of the tests that are useful for one’s specific field is enough, while others argue that a thorough understanding of the concepts is important.
Here, I argue, based on my own experience, that a thorough understanding of statistics substantially enhances the quality of one’s work. The reason why I think statistical knowledge is really important is that the amount of knowledge you have constrains the experiments you can conduct: If one’s only tool is ANOVA, there is only a limited set of possible experiments that fit within the mould of this statistical test. *
First, a little bit about my stats background. I don’t remember much from high school maths: I think I had a lot of motivation to repress any memories about it. During my undergraduate course, one of the biggest mysteries is how I even passed my statistics courses. I guess they had to scale everyone’s marks to avoid failing too many students. After these experiences, though, I have spent a lot of time learning about the tools that are the corner stone of making sense of my experiments. As my supervisor told me: the best way to learn about statistical analyses is when you have some data that you care about. When I had started my PhD, I dreaded the day when I would be asked to do anything more complex than a correlation matrix. But during the PhD, I learned, through trial and error and with a lot of guidance from experienced colleagues, to analyse data in R with linear mixed effect models and Bayes Factors. When I started my post-doc, driven by my curiosity about how it is possible that we can get two identical experiments with completely different results (i.e., with p-values on different sides of the significance threshold), I decided to learn more about how this stats thing actually works. My interest was further sparked by several papers I read on this topic, and a one-day workshop given by Daniël Lakens in Rovereto, which I happened to hear about via twitter. It culminated with my signing up for a part-time distance course, a graduate certificate in statistics, which I’m due to finish in June.
This learning process has taken a lot of time. Cynically speaking, I would not recommend it to early career researchers, who would probably maximise their chances of success in academia by focussing on publishing lots of papers (quantity is more important than quality, right?). If you have only a short-term contract (anywhere between 6 months and 2 years), you probably won’t have time to do both. Besides, you will never again want to do N=20 studies, and unless your department is rich, conducting a high-powered experiment might not be feasible during a short-term post-doc contract. Ideologically speaking, I would recommend this learning process to every social scientist who feels that they don’t know enough. In my experience, it’s worth putting one’s research on hold to learn about stats: moving from following a set of arbitrary conventions** to understanding why these conventions make sense is a liberating experience, not to mention the increase of the quality of your work, and the ability to design studies that maximise the chance of getting meaningful results.
For anyone who is reading this blog post because they would like to learn more about statistics, I have compiled a list of resources that I found useful. They contain both statistics-oriented material, and material which is more about philosophy of science. I see them as two sides of the same coin, so I don’t make a distinction between them below.
First of all: If you haven’t already done so, sign up on twitter, and follow people who tweet about stats. Read their blogs. I am pretty sure that I learned more about stats this way than I have during my undergraduate degree. Some people I’ve learned from (it’s not a comprehensive list, but they’re all interconnected: if you follow some, you’ll find others through their discussions): Daniel Lakens (@lakens), Dorothy Bishop (@deevybee), Andrew Gelman (@StatModeling), Hilda Bastian (@hildabast), Alexander Etz (@AlxEtz), Richard Morey (@richardmorey), Deborah Mayo (@learnfromerror) and Uli Schimmack (@R_Index). If you’re on facebook, join some psychological methods groups. I frequently lurk on PsycMAP and the Psychological Methods Discussion Group.
Below are some papers (again, a non-comprehensive and somewhat sporadic list) that I found useful. I tried to sort them in order of difficulty, but I didn’t do it in a very systematic way (also, some of those papers I read a long time ago, so I don’t remember how difficult they were). I think most papers should be readable to most people with some experience with statistical tests. As an aside: even those who don’t know much about statistics may be aware of frequent discussions and disagreements among experts about how to do stats. The readings below contain a mixture of views, some of which I agree with more than with others. However, all of them have been useful for me in the sense that they helped me understand some new concepts.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304.
Savalei, V., & Dunn, E. (2015). Is the call to abandon p-values the red herring of the replicability crisis? Frontiers in Psychology, 6, 245.
Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.
Lakens, D., & Evers, E. R. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278-292.
Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., ... & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.
Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.
Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.
Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195-244.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129.
Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.
Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., ... & Morey, R. D. (2015). A power fallacy. Behavior Research Methods, 47(4), 913-917.
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103-123.
Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.
Kline, R. B. (2004). What's Wrong With Statistical Tests--And Where We Go From Here. (Chapter 3 from Beyond Significance Testing. Reforming data analysis methods in behavioural research. Washington, DC: APA Books.)
Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician, 40(4), 313-315.
Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods.
Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS One, 11(3), e0152719.
Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behavioral Ecology and Sociobiology, 65(1), 47-55.
In terms of books, I recommend Dienes’ “Understanding Psychology as a Science” and McElreath’s “Statistical Rethinking”.
Then there are some online courses and videos. First, there is Daniel Lakens’ Coursera course in statistical inferences. From a more theoretical perspective, I like this MIT Probability course by John Tsiskilis, and Meehl’s Lectures. For something more serious, you could also try a university course, like a distance education graduate certificate in statistics. Here is a very positive review of the Sheffield University course which I am currently doing. However, at least if your maths skills are as bad as mine, I would not recommend to do it on top of a full-time job.
Learning stats is a long and never-ending road, but if you are interested in designing strong and informative studies and being flexible with what you can do with data, it is a worthwhile investment. There is always more to learn, and no matter how much I learn I continue to feel like I know less than I should. However, it’s a steep learning curve, so even investing a little bit of time and effort can already have beneficial effects. This is possible, even if you have only fifteen minutes to spare each day, through the resources that I tried to do justice to in my list above.
I should conclude, I think, by thanking all those who make these resources available, be it via published papers, lectures, blog posts, or discussions on social media.
* To provide an example, I will do some shameless self-advertising: in a paper that came out of my PhD, we got data which seemed uninterpretable at first. However, thanks to collaboration with a colleague with a mathematics background, Serje Robidoux, we could make sense of the data with an optimisation procedure. While an ANOVA would not have given us anything useful, the optimisation procedure allowed us to conclude that readers use different sources of information when they read aloud unfamiliar words, and that there is individual variation in the relative degree to which they rely on these different sources of information. This is one of my favourite papers I’ve published so far, but it’s only been cited by myself to date. (*He-hem!*) Here is the reference:
Schmalz, X., Marinus, E., Robidoux, S., Palethorpe, S., Castles, A., & Coltheart, M. (2014). Quantifying the reliance on different sublexical correspondences in German and English. Journal of Cognitive Psychology, 26(8), 831-852.
** Conventions such as:
“Control for multiple comparisons.”
“Don’t interpret non-significant p-values as evidence for the null hypothesis.”
“If you have a marginally significant p-value, don’t collect more data to see if the p-value drops below the threshold.”
“The p-value relates to the probability of the data, not the hypothesis.”
“For Meehl’s sake, don’t mess up the exact wording of the definition of a confidence interval!”