Learning about Bayes Factors can be a very liberating
experience (as pointed out by Evie Vergauwe at the ESCOP session about Bayesian
analyses and subsequently by Candice Morey in her blog). The main attraction of Bayes
Factors is that you can use them to argue for the null hypothesis, which you
can’t do with frequentist statistics. This means that you can go back to your
file drawer, and make some conclusions based on all of those non-significant
results that have probably accumulated over the years (even if the conclusion
might be that you can’t draw any conclusions given this data).
When I learned about Bayes Factors just before my thesis
submission, I thought: “Sweet! Now I can interpret all of my non-significant
results!” I reported Bayes Factors in addition to most of the frequentist
analyses, which indeed allowed me to draw some conclusions that would otherwise
have been impossible: I provided some evidence for null effects, and showed
that for some unexpected statistically significant effects, the Bayes Factor
provided equivocal evidence, suggesting that they should not be taken too
seriously. Soon I realised, however, that there is more to arguing for a null
hypothesis than just calculating a Bayes Factor.
Arguing for the null comes with some challenges that are less
relevant if the experiment is simply designed to maximise the chances of
getting a significant p-value. Therefore,
it is not always meaningful to add Bayes Factors to an analysis ad hoc, if
the p-value did not come out as
significant. When designing a study with the a priori intention of using Bayes Factor analyses, it is important
to ask the question: “Does my design maximise my chances of drawing a
meaningful conclusion, regardless of whether I get evidence for H1
or H0?” This is a switch in the mind set of the traditional
question, “Does my design maximise my chances of getting a significant p-value?” The latter question is, of
course, problematic, because often, when such an experiment yields a
non-significant p-value, it can only
be discarded. This means that a carefully designed experiment addressing
theoretically interesting and practically important questions may well turn
into nothing more than a waste of the researcher’s resources and an impediment
to scientific progress.
Here, I discuss some practical considerations that I came
across in some attempts at arguing for the null. I hope this will be helpful to
others who are figuring out how to get the most out of Bayes Factors. If anyone
has any further suggestions, I would be very happy to get some feedback!
Sample sizes
Having a large sample size is always important, because
small samples are prone to be influenced by extreme chance events. Bayes Factor
analyses are somewhat less affected by this problem than frequentist
statistics: a frequentist analysis will give you a p-value no matter what, while the Bayes Factor also tells you how
confident you should be in the result. If the sample is small, Bayes Factor
values are likely to hover around the value of 1, thus providing very little
evidence about whether the data is more compatible with H0 or H1.
Such an equivocal value tells you that you need more data if you want to draw
any conclusions.
However, small-sampled studies should always be taken with a
grain of salt, as the law of large numbers
still applies. If the effect is small, a population group difference may yield
identical means in a small sample. Similarly, a numerical difference between
two conditions in the absence of a population effect is more likely in a small
than large sample. Therefore, bigger is always better when it comes to sample
size.
Think about the
expected effect size
The key point behind Bayes Factor analyses is that they
compare the degree to which the data is consistent with a null hypothesis compared to a pre-specified alternative
hypothesis. This is good, because it forces the researcher to be specific
about the kind of effect she expects. As a drawback, it allows critics to argue
that the evidence for the null is due to the use of an unrealistically high
prior for the H1 effect size, and that with a smaller prior the
Bayes Factor would probably provide evidence for H1. This issue
makes it theoretically impossible to be 100% confident that there is no effect.
However, some consideration of the minimum effect size that would be of
interest would allow the researcher to use this information in the construction
of the H1 prior, and to argue that if an effect exists, it is likely
to be even smaller than this minimum.
Avoid confounds, if
possible!
Confounding variables are a bigger problem when arguing for
H0 than when arguing for H1. For example, I would really
like to know whether statistical learning ability is associated with the ease
with which children learn print-to-speech correspondences (as measured, for example, by Grapheme-Phoneme Correspondence, or GPC knowledge). One way to go might be to recruit a large number of children, and
measure them on an extensive battery of tests. One could even go all out and
design a longitudinal study, to see if statistical learning ability in
pre-school predicts GPC knowledge after the onset of reading instruction. We
know from previous research that statistical learning ability is correlated
with vocabulary knowledge and phonological awareness (Spencer, Kaschak, Jones, & Lonigan), which, in turn, are
well-known correlates of reading ability and GPC knowledge. So we want to test
vocabulary knowledge, phonological awareness, and word reading skills as well,
to be sure that a correlation between statistical learning ability and GPC
knowledge is not due to these confounding variables. We also want to test other
potentially correlated participant-level factors, such as age, intelligence, attention,
etc. We expect all of the variables that we measure to correlate with each
other (because this is generally the case with developmental data), so the only
meaningful result will be the partial regression coefficient of statistical
learning ability on GPC knowledge.
If the partial regression coefficient is significant (or, even better, if
the Bayes Factor provides evidence for a model including statistical learning
ability as well as all the covariates compared to a model including the
covariates only), I can conclude that statistical learning ability is correlated
with GPC knowledge over and above the covariates.
If the regression coefficient is not significant, and even
if we get evidence for the base model including the covariates only over the
alternative model, drawing conclusions is trickier. It is possible that
statistical learning ability does not have a direct effect on GPC knowledge,
but that it affects vocabulary knowledge, which in turn affects phonological
awareness, which in turn affects GPC knowledge. Perhaps one could even
strengthen a case for such a potential causal pathway with one
of ‘em fancy Structural Equation Models. Given the inter-correlated nature
of all my independent variables, however, there would be a large amount of
possible mediators, and it is likely that even sophisticated statistical
analyses cannot give me much useful information. For example, the relationship
between statistical learning ability and GPC knowledge may disappear once we
take into account phonological awareness. But then, the relationship between
statistical learning ability and phonological awareness may also disappear
after we take into account GPC knowledge. Thus, we won’t be able to conclude
that one mediates the other, because both causal pathways are equally plausible
– as are other causal pathways, such as the possibility that all three
variables are affected by yet another confound, such as the child’s attentional
capacity.
In short, such large-scale experiments with
inter-correlated variables is an example of a design that does not maximise the
researcher’s chances of being able to draw meaningful conclusions, regardless
of the outcome. This is unfortunate, because such large-scale studies are often
very time-consuming (for the researcher and participants) and expensive to run.
They could still be useful for exploratory purposes, but they are not the best
way to answer questions about the relationship between two variables.
Transparency
To convince an audience of a null result[1],
it might be especially important to be as transparent as possible: if the
experiment is a failure-to-replicate, desperate authors of the original study
may clutch at straws and accuse the replicator of creatively excluding
outliers, choosing the wrong priors, or being generally incompetent. Such
claims can be mostly counteracted by providing the raw data and the analysis
scripts – and, even better, by pre-registering studies, and in an ideal-case
scenario, getting the original authors’ approval of the experimental design and
analysis plan before data is collected.
If the data and analyses scripts are available, and if
original-authors-as-reviewers want to make claims about flaws in the analyses,
they can (and should) show (1) how and why the replicator’s analysis are
problematic, and (2) that there is evidence for their original effect (or equivocal
evidence) once the analyses are done correctly.
Conclusion
In summary, I argue that it is important to design an
experiment in such a way that maximises the chance of being able to draw
meaningful results, regardless of whether H1 or H0 is
supported. Reading through what I have written so far, it strikes me that all
of the issues described above apply to all experimental designs, really.
However, because some researchers continue to be more easily convinced by
relatively shaky evidence for an H1 than by relatively strong
evidence for H0,[2]
it is especially important to make sure to maximise the strength of an
experiment when there is a real possibility that H0 will be
supported. I listed four considerations which might help a researcher to design
a strong experiment: (1) Large sample sizes, (2) a careful consideration of the
expected effect size, (3) avoiding confounding variables, and (4) being as
transparent as possible by making the raw data available and by providing full
information about all analyses.
Reference
Spencer, M., Kaschak, M. P., Jones, J. L., &
Lonigan, C. J. Statistical learning is related to early literacy-related
skills. Reading and Writing, 1-24.
Another thing you can do to avoid accusations of choosing the "wrong" prior is to re-analyze it yourself and explicitly show that the evidence isn't qualitatively changed by using other reasonable priors.
ReplyDeleteThanks, Alex, that's a great suggestion!
ReplyDeleteI actually have a question about this - maybe you know the answer or could direct me to a relevant source:
I have repeated the Bayes Factor analyses for a study where the default parameters provided evidence for the null, using smaller priors ("rscaleFixed = 0.1"). This has increased the error margin to +/-100%, making the results uninterpretable. Is there a way to change the prior that has less of an effect on the error margins?
As an alternative approach, a statistician once told me that one can do the analyses both with a large and with a small prior, and then compare the posteriors of the critical effect: if H0 is true, both values should be about equally close to zero. Would you know of any recommendations for drawing conclusions about whether or not the posterior estimates are similar to each other, or whether this is this something that’s up to the researcher’s judgement?
Any advice would be appreciated!
"Is there a way to change the prior that has less of an effect on the error margins?" Probably not, but I bet you'd be ok with more sensitive data! Maybe ask Morey about it, he'd know more about it than me.
DeleteI wouldn't worry about the posteriors matching. Cauchy/g priors are designed to represent the information value of very few observations (they are t distributions with ~1 df), so with moderate amounts of data they wash out fast. Models with fat-tailed priors will generally do that. If you're using more informative priors, then naturally they will converge at a rate related to their relative informativeness and the information contained in the data.
But in general I don't recommend looking at posteriors to judge support for H0. Too ad hoc and vague. The only principled way to do it is with a Bayes factor :)