Thursday, September 12, 2019

Bayes Factors 101: Justifying prior parameters in JASP

Update (27.12.2021): The content of this blog post has been extended and published in the following article:

Schmalz, X., Biurrun Manresa, J., & Zhang, L. (2021). What is a Bayes factor? Psychological Methods. Preprint: https://osf.io/5geqt/.

 

TL;DR: Do you need a justification for your prior parameters in JASP? Scroll down to find fill-in-the-blank sentences which you can use, and a table where you can pick a range of effect sizes which you expect and the corresponding prior parameters.


----------------------------
With many psychologists turning Bayes-curious, softwares are appearing that make it easy to calculate Bayes Factors. JASP (Love et al., 2015) has a similar layout to SPSS, and allows the user to perform Bayesian analyses which are equivalent to a series of popular frequentist tests. Here, I will describe the priors which are implemented in JASP for two frequently used tests: the t-test and Pearson's correlation. I will also explain what we do when we change them. The aim is to provide the basis for a better understanding of what priors mean, and how we can justify our choice of prior parameters.

Both frequentist and Bayesian statistics rely on a series of underlying assumptions and calculations, which are important to understand in order to interpret the value that the software spits out (i.e., a p-value or a Bayes Factor). Given that very few psychologists have been schooled in Bayesian statistics, the assumptions underlying the Bayes Factor are often not intuitive.

One important difference between Bayesian and frequentist data analyses is the use of a prior. The prior represents the beliefs or knowledge that we have about our effect of interest, before we consider the data which we aim to analyse. The prior is a distribution which can be specified by the experimenter. This distribution becomes updated, once we have data, to give a posterior distribution. For calculating a Bayes Factor, we have two priors: one that describes one hypothesis (e.g., a null hypothesis: no difference between groups, or no correlation between two variables), and one that describes a different hypothesis. JASP then computes the probability of the observed data under each of these hypotheses, and divides one by the other to obtain the Bayes Factor: the degree to which the data is compatible with one hypothesis over the other.

To some extent, then, the inference depends on the prior. The degree to which the prior matters depends on how much data one has: when there is a lot of data, it “overrides” the prior, and the Bayes Factor becomes very similar across a wide range of plausible priors. Choosing an appropriate prior becomes more important, though, when (1) we do not have a lot of data, (2) when we need to justify why we use a particular prior (e.g., for a Registered Report or grant proposal), or (3) when we would just like to get a better idea of how the Bayes Factor is calculated. The aim of the current blog post is to provide an introduction to the default parameters of JASP, and what it means when we change them around, while assuming very little knowledge of probability and statistics from the reader.

T-tests
Let's start with t-tests. JASP has the option to do a Bayesian independent samples t-test. It also provides some toy data: here, I'm using the data set “Kitchen Rolls”. Perhaps we want to see if age differs as a function of sex (which makes no sense, theoretically, but we need one dichotomous and one continuous variable for the t-test). Below the fields where you specify the variables, you can adjust two prior parameters: (1) The hypothesis (two-tailed or directional), and (2) the prior (Cauchy prior width). Let's start with the Cauchy. The default parameter is set to 0.707. Contrary to what is often believed, this does not represent the size of the effect that we expect. To understand what it represents, we need to take a step back to explain what a Cauchy is.

A Cauchy is a probability distribution. (Wikipedia is a very good source for finding information about the properties of all kinds of distributions.) Probability distributions describe the probability of possible occurrences in an experiment. Each type of distribution takes a set of parameters, with which we can infer the exact shape of the distribution. The shape of our well-familiar normal distribution, for example, depends both on the mean and on the variance: if you look up the normal distribution on Wikipedia, you will indeed see in the box on the right that the two parameters for this distribution are μ and σ2. On the Wikipedia distribution pages, the top figure in the box shows how the shape of the distribution changes if we change around the parameters. Visually, the Cauchy distribution is similar to the normal distribution: it is also symmetrical and kind-of bell-shaped, but it has thicker tails. It also takes two parameters: the location parameter and a scale parameter. The location parameter determines where the mode of the distribution is. The scale parameter determines its width. The latter is what we're after: in the context of Cauchy priors, it is also often called the width parameter.

Back to JASP: when we change the Cauchy prior width, we don't change the mode of our distribution, but its width (i.e., the scale parameter): we are not saying that we are considering certain values to be more or less likely, but that we consider the range of likely effect sizes to be more or less narrow. The Cauchy, in JASP, is by default centred on zero, which gives us a bidirectional test. Overall, small effect sizes are considered to be more likely than large effect sizes (as shown by the general upside-down-U shape of the distribution). If we have a directional hypothesis, rather than shifting the location parameter, JASP allows us to pick which group we expect to have higher values (Group 1 > Group 2, or Group 1 < Group 2). This simply cuts the distribution in half. We can try this with our Kitchen Rolls data: If, under the section “Plots”, we tick “Prior and posterior”, we will see a figure, in addition to the Bayes Factor, which shows the prior for the alternative hypothesis, as well as the posterior (which we will ignore in the current blog post). The default settings show the following plot (note the symmetrical prior distribution):


When we anticipate that Group 1 will have higher values than Group 2, half of the prior distribution is cut:


And when we anticipate that Group 2 will have higher values than Group 1:


So, what do you do when you plan to use the Bayes Factor t-test for inference and the reviewer of the Registered Report asks you to justify your prior? What the Cauchy can tell us is how confident we are that the effect lies within a certain range. We might write something like:

The prior is described by a Cauchy distribution centred around zero and with a width parameter of x. This corresponds to a probability of P% that the effect size lies between -y and y. [Some literature to support that this is a reasonable expectation of the effect size.]”

So, how do you determine x, P, and y? For P, that's a matter of preference. For a registered report of mine, I chose 80%, but this is rather arbitrary. The y you pick in such a way that it describes what you believe about your effect size. If you think it cannot possibly be bigger than Cohen's d = 0.5, that could be your y. And once you've picked your y, you can calculate the x. This is the tricky part, though it can be done relatively easily in R. We want to find a the parameter x where we have an 80% probability of obtaining values between -y and y. To do this, we use the cumulative distribution function, which measures the area under the curve of a probability distribution (i.e., the cumulative probability of a range of values). The R function pcauchy takes the values of y, assuming a location parameter and a scale parameter, to get the probability that an observation randomly drawn from this distribution is greater than y. To get the probability that an observation randomly drawn from this distribution lies between y and -y, we type:

pcauchy(2,0,0.707) - pcauchy(-2,0,0.707)

This is for the default settings of JASP (location parameter = 0, scale parameter = 0.707). This gives us the following probability:

[1] 0.7836833

Thus, if we use the default JASP parameters, we could write (rounding the output up 0.78 to 80%):
The prior is described by a Cauchy distribution centred around zero and with a width parameter of 0.707. This corresponds to a probability of 80% that the effect size lies between -2 and 2. [Some literature to support that this is a reasonable expectation of the effect size.]”

An effect size of 2 is rather large for most psychology studies: we might be sure that we're looking for smaller effects than this. To check how we would need to change the scale parameter set to obtain an 80% probability (or any other value of P) to get the expected effect sizes, you can copy-and-paste the code above into R, change the effect size range (2 and -2) to your desired ys, and play around with the scale parameters until you get the output you like. Or, if you would like to stick with the 80% interval, you can pick the scale parameter for a set of effect size ranges from the table below (the percentage and the scale parameter are rounded):


Range of effect sizes (non-directional)
Range of effect sizes (directional)
Scale parameter required for 80% probability
-2 to 2
0 to 2 or -2 to 0
0.71 (default)
-1.5 to 1.5
0 to 1.5 or -1.5 to 0
0.47
-1.3 to 1.3
0 to 1.3 or -1.3 to 0
0.41
-1.1 to 1.1
0 to 1.1 or -1.1 to 0
0.35
-0.9 to 0.9
0 to 0.9 or -0.9 to 0
0.3
-0.7 to 0.7
0 to 0.7 or -0.7 to 0
0.22
-0.5 to 0.5
0 to 0.5 or -0.5 to 0
0.16
-0.3 to 0.3
0 to 0.3 or -0.3 to 0
0.1
The middle column shows what happens when we have a directional hypothesis. Basically, the probability of finding a range between 0 and y under the cut-in-half Cauchy is the same as the probability of finding a range between -y and y in the full Cauchy. I explain in a footnote1 why this is the case.

How does the choice of prior affect the results? In JASP, after you have collected your data, you can check this by ticking the “Bayes factor robustness check” box under “Plots”. Below is what this plot looks like for our age as a function of sex example. The grey dot marks the Bayes Factor value for the prior which we chose: here, I took the scale parameter of 0.1, corresponding to an 80% chance of effect sizes between -0.3 and 0.3.



After having played around with different parameters in R and doing the calculations above, E.J. Wagenmakers drew my attention to the fact that, when we choose the range width to be 50%, not 80%, the width parameter is equal to the range of values that we expect. So, if we are less confident about how big we expect the effect to be (and less keen to mess around with the different parameter values in R), we can simply write (below, I assume the default prior; if you have different expectations about the effect size, replace all mentions of the value “0.707” with your preferred effect size):

The prior is described by a Cauchy distribution centred around zero and with a width parameter of 0.707. This corresponds to a probability of 50% that the effect size lies between -0.707 and 0.707. [Some literature to support that this is a reasonable expectation of the effect size.]”

After having written most of the above, I also realised that I had not updated JASP for a while, and the newer version allows us to change the location parameter of the Cauchy, as well as its width. Thus, it is possible to change the mode of the distribution to the effect size that you consider the most likely. Then, you can calculate the new effect size range by taking the values from the table above, and adding the location parameter to the upper and lower bound, for example:
The prior is described by a Cauchy distribution centred around 0.707 and with a width parameter of 0.707. This corresponds to a probability of 50% that the effect size lies between 0 and 1.141. [Some literature to support that this is a reasonable expectation of the effect size.]”

You can find more information about location parameter shifting in Gronau, Q. F., Ly, A., & Wagenmakers, E.-J. (in press). Informed Bayesian t-tests. The American Statistician. https://arxiv.org/abs/1704.02479. For a step-by-step instruction, or in order to get hands-on experience with construction your own prior parameters, I also recommend going through this blogpost by Jeff Rouder: http://jeffrouder.blogspot.com/2016/01/what-priors-should-i-use-part-i.html.


Correlations
Now, let's move on to correlations. Again, our goal is to make the statement:

The prior is described by a beta-distribution centred around zero and with a width parameter of x. This corresponds to a probability of P% that the correlation coefficient lies between -y and y. [Some literature to support that this is a reasonable expectation of the effect size.]”

When you generate a Bayesian correlation matrix in JASP, it gives you two things: The Pearson's correlation coefficient (r) that we all know and love, and the Bayes Factor, which quantifies the degree to which the observed r is compatible with the presence of a correlation over the absence of a correlation. The prior for the alternative hypothesis is now described by a beta-distribution, not by a Cauchy. More details about the beta-distribution can be found in the footnote2. For the less maths-inclined people, suffice it to say that the statistical parameters of the distribution do not directly translate into the parameters that you input in JASP, but never fear: the table and text below explain how you can easily jump from one to the other, if you want to play around with the different parameters yourself.

The default parameter for the correlation alternative prior is 1. This corresponds to a flat line, and is identical to a so-called uniform distribution. Beware that describing this distribution as “All possible values of r are equally likely” will trigger anything from a long lecture to a condescending snort from maths nerds: as we're dealing with a continuous distribution, a single value does not have a probability associated with it. The mathematically correct way to put it is: “If we take any two intervals (A and B) of the same length from the continuous uniform distribution, the probability of the observation falling interval A will equal to the probability of the observation falling into interval B.Basically, if you have no idea what the correlation coefficient will be like, you can keep the prior as it is. As with the t-test, you can test directional hypotheses (r > 0 or r < 0).

Changing the parameter will either make the prior convex (U-shaped) or concave (upside-down-U-shaped). In the former case, you consider values closer to -1 and 1 (i.e., very strong correlations) to be more likely. Perhaps this could be useful, for example, if you want to show that a test has a very high test-retest correlation. In the latter case, you consider smaller correlation coefficients to be more likely. This is probably closer to the type of data that, as a psychologist, you'll be dealing with.

So, without further ado, here is the table from which you can pick the prior (first column), based on the effect size range (possible correlation coefficients) that you expect with 80% certainty:

JASP parameter (A)
Range of effect sizes (r)
Statistical parameters (a, b)
Statistical inputs (R)
1
-0.8 to 0.8
1, 1
0.1 to 0.9
1/3
-0.5 to 0.5
3, 3
0.25 to 0.75
1/13
-0.25 to 0.25
7, 7
0.375 to 0.625
 
--------------
Update (18.11.2020): After a comment by Katrin pointed out that the numbers in the table above don't add up, I had a look and indeed found some discrepancies, was not able to follow exactly how I got them, and it also took me a while to understand the text I'd written below. I changed the numbers in the table above, and rewrote the section below to make this more clear. 
-------------
 
The R code (if you want to play around with the parameters and ranges), for the first row, is:
pbeta(0.925,1,1) - pbeta(0.125,1,1)
 
If you have a range of correlations, centered on zero, that you are X% confident about, you can do the following: 
 
Step 1: Convert the effect sizes from your desired range (rl, ru) to the statistical inputs in the fourth row (Rl, Ru) with the formula: R = 0.5 + 0.5*r.
 
Step 2: Insert the values (Rl, Ru) into the code above, for the first parameter in the two pbeta-commands. The output should be a number between 0 and 1: This is your confidence level.
 
Step 3: The second and third parameter in the pbeta-command (a, b) should be identical (i.e., a = b): these are the parameters of the beta-distribution, which need to be the same to ensure that the distribution is symmetrical (as written above, JASP, at least in its current implementation, only contains symmetric distributions, and hence there is only one parameter that you can input to determine prior width). The statistical parameters for the R-function (a, b; third column) are not the same as the JASP parameter (A; 1st column), but can be obtained by the simple formula: A = 1/a. There must be a more elegant way to do this, but to get to the desired confidence level, you can change around the beta-parameter until you get to the confidence level you desire.
 
As with the t-test, when you chop the beta-distribution in half (i.e., when you test a directional hypothesis), the upper bound (y) should be identical to the upper bound in the non-directional prior.


Conclusion
The blogpost aims to provide a psychologist reader with a sufficiently deep understanding to justify the choice of prior, e.g., for a Registered Report. If you've worked your way through the text above (in which case: thank you for bearing with me!), you should now be able to choose a prior parameter in JASP in such as way that it translates directly to the expectations you have about possible effect size ranges.

To end the blogpost on a more general note, here are some random thoughts. The layout of JASP is based on SPSS, but unlike SPSS, JASP is open source and based on the programming language R. JASP aims to provide an easy way for researchers to switch from frequentist testing in SPSS to doing Bayesian analyses. Moving away from SPSS is always a good idea. However, due to the similar, easy-to-use layout, JASP inherits one of the problems of SPSS: it's possible to do analyses without actually understanding what the output means. Erik-Jan Wagenmakers (?) once wrote on Twitter (?) that JASP aims to provide “training wheels” for researchers moving away from frequentist statistics and SPSS, who will eventually move to more sophisticated analysis tools such as R. I hope that this blogpost will contribute a modest step to this goal, by giving a more thorough understanding of possible prior parameters in the context of Bayes Factor hypothesis testing.

-----------------------------
I thank E.J. Wagenmakers for his comments on an earlier version of this blog post. Any remaining errors are my own.
-----------------------------
Edit (17.9.2019): I changed the title of the blogpost, to mirror one that I wrote a few months ago: "P-values 101: An attempt at an intuitive but mathematically correct explanation".
-----------------------------
1 If we simply cut the Cauchy distribution in half, we no longer have a probability distribution: a probability distribution, by definition, needs to integrate to 1 across the range of all possible values. If we think about a discrete distribution (e.g., the outcome of a die toss), it's intuitive that the sum of all possible outcomes should be 1: that's how we can infer that the probability of throwing a 6 is 1/6, given a cubical die (because we also know that we have 6 possible, equiprobable outcomes). For continuous distributions, we have an infinite range of values, so we can't really sum them. Integrating is therefore the continuous-distribution-equivalent to summing. Anyhow: If we remove half of our Cauchy distribution, we end up with a distribution which integrates out to 0.5 (across the range from 0 to infinity). To change this back to a probability distribution, we need to multiply the function by a constant, in this case, by 2. If you look at the plots for the full Cauchy prior versus the directional priors, you will notice that, at x = 0, y ≈ 0.5 for the full Cauchy, and y ≈ 1 for the two truncated Cauchys. For calculating the probability of a certain range, this means that we need to multiply it by two. Which is easy for our case: We start off with a given range (-y to y) our full Cauchy and cut off half (so we get the range 0 to y), so we lose half of the area and need to divide the probability of getting values in this range by half. Then we multiply our function by 2, because we need to turn it back to a probability distribution: we also multiply the area between 0 and y by two, which gives us the same proportion that we started with in the first place.

2 A beta-distribution looks very different to a Cauchy or normal, and takes different parameters (on wikipedia, denoted α and β). When both of the parameters are equal (α = β), the distribution is symmetrical. In JASP, you can only adjust one number: the same value is then used for both parameters, so the prior distribution is always symmetrical. The number which you can adjust in the JASP box (let's call it A) does not equal to the true parameter that defines the beta-function (let's call it a), as it has an inverse relationship to it (A = 1/a). The other difference between the actual beta-distribution and the JASP prior is that the beta-distribution is defined for values between 0 and 1: the JASP prior is stretched between values of -1 and 1. Thus, when using the function to calculate the probabilities of different ranges of r under different parameters, we need to transform r to to a value between 0 and 1 before we can make a statement about the size of the correlations. I hope to make this clearer when I present the table with parameters and effect size ranges.

3 comments:

  1. Hi Xenia, thank you for this great post! I really appreciate it. I just have a question concerning the table about the parameters for the Bayesian correlation: if the equation 0.5+0.5*r=R holds, should it not be 0.875 instead of 0.925 in the first row? If that is true, I get a probability closer to 0.8 for kappa=0.85 (first row). Also, I get a better result for kappa=1/8 (row 4). Or am I missing anything? Best regards, Katrin

    ReplyDelete
    Replies
    1. Dear Katrin, thanks for your comment, and you are perfectly right, the numbers didn't add up with what I'd written. I've worked on fixing this now, and on making the description more detailed and transparent, so that it will be easier to identify any other mistakes and their source in the future. :) If you (or anyone else) notice some other errors, please let me know, I'll work on fixing them ASAP!
      Xenia

      Delete
  2. Hi Xenia, thank you, I'm really appreciate it and I'm so gratful with u!

    ReplyDelete