- A brief and practical introduction to fitting Bayesian multilevel models in R and Stan
- Using
**brms**(**B**ayesian**R**egression**M**odels using**S**tan) - Quick intro to Bayesian inference
- Mostly practical skills

- Know what mixed-effects or multilevel model is
- A little experience with stats and/or data science in R
- Vague knowledge of what Bayesian stats are

- Knowing about the
**lme4**package will help - Knowing about
**tidyverse**and**ggplot2**will help

- Slides and text version of lessons are online
- Fill in code in the worksheet (replace
`...`

with code) - You can always copy and paste code from text version of lesson if you fall behind

At the end of this course, you will understand …

- The basics of Bayesian inference
- What a prior, likelihood, and posterior are
- The basics of how Markov Chain Monte Carlo works
- What a credible interval is

At the end of this course, you will be able to …

- Write
**brms**code to fit a multilevel model with random intercepts and random slopes - Diagnose and deal with convergence problems
- Interpret
**brms**output - Compare models with LOO information criteria
- Use Bayes factors to assess strength of evidence for effects
- Make plots of model parameters and predictions with credible intervals

A method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information

- Used in many disciplines and fields
- We’re going to look at how to use it to estimate parameters of statistical models to analyze scientific data
- Powerful, user-friendly, open-source software is making it easier for everyone to go Bayesian

- Thomas Bayes, 1763
- Pierre-Simon Laplace, 1774

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

- How likely an event is to happen based on our prior knowledge about conditions related to that event
- The
*conditional*probability of an event*A*occurring, conditioned on the probability of another event*B*occurring

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

The **probability of A being true given that B is true** (\(P(A|B)\))

is equal to the **probability that B is true given that A is true** (\(P(B|A)\))

times the **ratio of probabilities that A and B are true** (\(\frac{P(A)}{P(B)}\))

- Let’s say \(A\) is a statistical model (a hypothesis about the world)
- How probable is it that our hypothesis is true?
- \(P(A)\):
*prior probability*that we assign based on our subjective knowledge before we get any data

- We go out and get some data \(B\)
- \(P(B|A)\):
*likelihood*is the probability of observing that data if our model \(A\) is true - Use the likelihood to update our estimate of probability of our model
- \(P(A|B)\):
*posterior probability*that model \(A\) is true, given that we observed \(B\).

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

- What about \(P(B)\)?
*marginal probability*, the probability of the data- Basically just a normalizing constant
- If we are comparing two models with the same data, the two \(P(B)\)s cancel out

\[P(model|data) \propto P(data|model)P(model)\]

\[posterior = likelihood \times prior\]

*what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)*

- Find a coin on the street. What is our prior estimate of the probability of flipping heads?
- Now we flip 10 times and get 8 heads. What is our belief now?
- Probably doesn’t change much because we have a strong prior and the likelihood of probability = 0.5 is still high enough even if we see 8/10 heads

- Shady character on the street shows us a coin and offers to flip it. He will pay $1 for each tails if we pay $1 for each heads
- What is our prior estimate of the probability?
- He flips 10 times and gets 8 heads. What’s our belief now?

- In classical “frequentist” analysis we cannot incorporate prior information into the analysis
- In each case our point estimate of the probability would be 0.8

- \(P(data|model)\), the likelihood, is needed to get \(P(model|data)\), the posterior
- But the “model” is not just one parameter, it might be 100s or 1000s of parameters
- Need to integrate a probability distribution with 100s or 1000s of dimensions
- For many years, this was computationally not possible

- Class of algorithms for sampling from probability distributions
- The longer the chain runs, the closer it gets to the true distribution
- In Bayesian inference, we run multiple Markov chains for a preset number of samples
- Discard the initial samples (warmup)
- What remains is our estimate of the posterior distribution

- HMC is the fastest and most efficient MCMC algorithm that has ever been developed
- It’s implemented in software called Stan

- An easy way to fit Bayesian mixed models using Stan in R
- Syntax of
**brms**models is just like**lme4** - Runs a Stan model behind the scenes
- Automatically assigns sensible priors and does lots of tricks to speed up HMC convergence

- Some models just can’t be fit with frequentist maximum-likelihood methods
- Estimate how big effects are instead of yes-or-no framework of rejecting a null hypothesis
- We can say “the probability of something being between a and b is 95%” instead of “if we ran this experiment many times, 95% of the confidence intervals would contain this value.”

Let’s finally fit some Bayesian models!

Load packages.

Set plotting theme.

Set **brms** options (back-end and cores).

Read the simulated yield dataset from CSV on GitHub

`field`

: text ID for each field where responses were measured (`'field1'`

through`'field25'`

)`plot`

: text ID for each plot, 40 plots in each field (`'plot1'`

through`'plot40'`

)`variety`

: text ID identifying variety (`'short'`

and`'tall'`

)`rainfall`

: numeric variable. Same for all plots in a given field`soilN`

: numeric variable (soil nitrogen). Unique value for each plot`yield`

: numeric variable measured at plot level (outcome variable)

- Look at relationship between
`yield`

and`soilN`

- I will not explain
**ggplot2**code for now - Soil N values are jittered in
*x*direction

Add trendline.

Take multilevel structure of data into account (color by field).

Add field-level trendlines.

For reference this is mixed model syntax from **lme4** package:

- Dependent or response variable (
`yield`

) on left side - Tilde
`~`

separates dependent from independent variables - Here the only fixed effect is the global intercept (
`1`

) - Random effects specification (
`(1 | field)`

) has a*design*side (on the left hand) and*group*side (on the right hand) separated by`|`

. - In this case, the
`1`

on the design side means only fit random intercepts and no random slopes `field`

on the group side means each field will get its own random intercept

- Same formula as
**lme4**but with some extra instructions for the HMC sampler- Number of Markov chains
- Iterations for each chain
- How many iterations to discard as warmup
- Random initial values
- No priors specified, so defaults are used

- Warning about convergence
`Rhat > 1.05`

for some parameters`Rhat`

indicates convergence of MCMC chains, approaching 1 at convergence`Rhat < 1.01`

is ideal

Posterior distributions and trace plots for

- the fixed effect intercept (
`b_Intercept`

) - the standard deviation of the random field intercepts (
`sd_field__Intercept`

) - the standard deviation of the model’s residuals (
`sigma`

)

- Warning says either increase iterations or set stronger priors
- Increase iterations to 1000 warmup, 1000 sampling per chain (2000 total)
`update()`

lets us draw more samples without recompiling code

- What is the “95% CI” thing on the parameter summaries?
*credible interval*, not confidence interval- more direct interpretation than confidence interval:
- We are 95% sure the parameter’s value is in the 95% credible interval

- based on quantiles of the posterior distribution

Median and 90% quantile-based credible interval (QI) of the intercept

`as_draws_df()`

gets all posterior samples for all parameters and puts them into a data frame

- Literally anything in a Bayesian model has a posterior distribution, so anything can have a credible interval!
- In frequentist models, you have to do bootstrapping to get that kind of interval on most quantities

- Proportion of variation at different nested levels
- Calculate the ratio of variance within fields to between fields
`variance_decomposition()`

from**performance**package also gives us a credible interval on the variance ratio

- Variance ratio is much greater than zero so there is a need for a multilevel model
- But I would recommend using one anyway, if that’s the way your study was designed

- So far we have only calculated mean of yield and random variation by field
- But what factors influence yield at the plot level?
- Add first-level predictors (that vary by plot) as fixed effects
- variety
- soil N

- Fixed-effect part of model formula is
`1 + variety + soilN`

- No random slope (effect of variety and soil N on yield is the same in each field)
- Still using default priors

`seed`

sets a random seed for reproducibility, and`file`

creates a`.rds`

file in the working directory so you can reload the model later without rerunning.

- Low Rhat (the model converged)
- Posterior distribution mass for fixed effects is well above zero
`sigma`

(SD of residuals) is smaller than before because we’re explaining more variation

`prior_summary()`

shows what priors were used to fit the model

- t-distributions on intercept, random effect SD, and residual SD (
`sigma`

) - mean of intercept prior is the mean of the data
- mean of the variance parameters is 0 but lower bound is 0 (half bell curves)

- By default they are flat
- Assigns equal prior probability to
*any*possible value - 0 is as probable as 100000 which is as probable as -55555, etc.
- Not very plausible
- It is OK in this case because the model converged, but often it helps convergence to use priors that only allow “reasonable” values

`normal(0, 5)`

is a good choice- Mean of 0 means we think that positive effects are just as likely as negative
- SD of 5 means we are still assigning pretty high probability to large effect sizes
- Use
`prior()`

to assign a prior to each class of parameters

- Basically no effect on the results or the performance of the HMC sampler
- But it’s something to be mindful of in the future!

`pp_check()`

is a useful diagnostic for how well the model fits the data

- black line: density plot of observed data
- blue lines: density plot of predicted data from 10 random draws from the posterior distribution
- Because these are simulated data, it looks very good in this case

`summary()`

only gives us the median and 95% credible interval- We can work with the full uncertainty distribution (nothing special about 95%)
- Functions from
**tidybayes**used to make tables and plots `gather_draws()`

makes a data frame from the posterior samples of parameters that we choose

`median_qi()`

gives us median and quantiles of the parameters

- Special extensions to
**ggplot2**for plotting quantiles of posterior distribution - I also include a dotted line at zero for comparison

- Variety and soil N vary by plot (first-level predictors)
- Rainfall is shared by all plots in the same field (second-level predictor)
- The same syntax is used
- Fixed-effect part is now
`1 + variety + soilN + rainfall`

Look at trace plots, posterior predictive check, and model summary.

- What do they show?

- You can add interaction terms separated by
`:`

- example:
`1 + variety + soilN + rainfall + variety:soilN + soilN:rainfall`

- Interactions can be within level (like
`variety:soilN`

) or between levels (like`soilN:rainfall`

)

- So far we’ve assumed any predictor’s effect is the same in every field (only intercept varies, not slope)
- Add
*random slope*term to allow both intercept and slope to vary - Specify a random slope by adding the appropriate slope to the design side of the random effect specification
- random intercept only
`(1 | field)`

- random intercept and random slope with respect to soil N
`(1 + soilN | field)`

- random intercept and random slope with respect to variety and soil N
`(1 + variety + soilN | field)`

- random intercept only
- Doesn’t make sense to have a random slope for
`rainfall`

because all plots in the same field have the same

Note: this model may take a minute or two to sample

- Look at trace plots,
`pp_check`

, and model summary - We see some divergent transitions, maybe because we are fitting a model that is too complicated

- Standard deviation of random slopes for variety is very small
- We can probably omit the random slope for variety from the model
- To confirm, plot field-level slopes for variety (sum of fixed + random) with QI intervals

```
variety_slopes <- spread_draws(fit_randomslopes, b_varietytall, r_field[field,variable]) %>%
filter(variable == 'varietytall') %>%
mutate(slope = b_varietytall + r_field)
variety_slopes %>%
median_qi(slope, .width = c(.66, .90, .95)) %>%
ggplot(aes(y = field, x = slope, xmin = .lower, xmax = .upper)) +
geom_interval() +
geom_point(size = 2) +
scale_color_brewer(palette = 'Blues')
```

- The field-level effects of variety are all very similar
- Refit the model without the random slope for variety, only for soil N

- Leave-one-out (LOO) cross-validation compares models
- How well does a model fit to all data points but one predict the one remaining data point?
- First use
`add_criterion()`

to compute the LOO criterion for each model - Then use
`loo_compare()`

to rank the models

```
fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo')
fit_fixedslopes_priors <- add_criterion(fit_fixedslopes_priors, 'loo')
fit_fixed12 <- add_criterion(fit_fixed12, 'loo')
fit_randomslopes <- add_criterion(fit_randomslopes, 'loo')
fit_soilNrandomslope <- add_criterion(fit_soilNrandomslope, 'loo')
```

- Models ranked by ELPD (expected log pointwise predictive density)
- The best one always has 0 and the others are ranked relative to it
- The random slope models do equally well, so we can prefer the simpler one
- The random intercept models do quite a bit worse
- The model with no fixed effects does very poorly

- Bayesian analogue of a p-value
- Ratio of evidence between two models: \(\frac{P(model_1|data)}{P(model_2|data)}\)
- Ranges from 0 to infinity
- BF = 1 means equal evidence, BF > 1 means more evidence for model 1
- No “significance” threshold but BF > 10 is usually called strong evidence

- R package
**bayestestR**lets us compute BF for each parameter - Ratio of evidence for posterior : evidence for prior
- Our prior distributions were centered at 0 so they are like “null hypotheses”
- BF = 1 means we did not change our belief about the parameter at all from the prior, after seeing the data
- BF > 1 means we’ve changed our belief about the parameter after seeing the data
- BF < 1 means we have even stronger evidence that the prior is true, after seeing the data

- BF < 1 for the intercept
- BF >> 1000 for all other fixed effects
**WARNING**: BFs are very sensitive to your choice of prior

- To finish, here are some examples of plots you can make to visualize model output
- Posterior expected values given different values of the predictors

`conditional_effects()`

is a quick way to plot all fixed effects with 95% credible intervals

- We can also incorporate random effects into the prediction
- This plot shows the “shrinkage” effect of a mixed model

- Customized plots can be made with packages
**emmeans**and**tidybayes** `emmeans()`

gives you the expected value at specific levels of fixed predictors

```
variety_emmeans <- emmeans(fit_soilNrandomslope, ~ variety)
gather_emmeans_draws(variety_emmeans) %>%
ggplot(aes(x = .value, y = variety)) +
stat_interval(.width = c(.66, .95, .99)) +
stat_summary(fun = median, geom = 'point', size = 2) +
scale_color_brewer(palette = 'Greens') +
ggtitle('posterior expected value by variety', 'averaged across soil N and rainfall')
```

`add_epred_draws()`

gives you the expected value at levels of categorical and continuous fixed effects- Trend of yield versus soil N for each sex, at average value of rainfall
- Averages across field-level random effects

```
expand_grid(rainfall = mean(yield_data$rainfall),
soilN = seq(2, 9, by = .5),
variety = c('short', 'tall')) %>%
add_epred_draws(fit_soilNrandomslope, re_formula = ~ 0) %>%
ggplot(aes(x = soilN, y = .epred, group = variety, color = variety)) +
stat_lineribbon(.width = c(.66, .95, .99)) +
scale_fill_grey() +
labs(y = 'posterior expected value of yield') +
ggtitle('posterior expected value by soil N + variety', 'at average value of rainfall')
```

What did we learn? Let’s revisit the learning objectives!

You now understand…

- The basics of Bayesian inference
- Definition of prior, likelihood, and posterior
- How Markov Chain Monte Carlo works
- What a credible interval is

You now can…

- Write
**brms**code to fit a multilevel model with random intercepts and random slopes - Diagnose and deal with convergence problems
- Interpret
**brms**output - Compare models with LOO information criteria
- Use Bayes factors to assess strength of evidence for effects
- Make plots of model parameters and predictions with credible intervals

Congratulations, you are now Bayesians!

- See text version of lesson for further reading and useful resources
- Please send any feedback to
`quentin.read@usda.gov`

!