# A crash course in Bayesian mixed models with brms

## What is this class?

• A brief and practical introduction to fitting Bayesian multilevel models in R and Stan
• Using brms (Bayesian Regression Models using Stan)
• Quick intro to Bayesian inference
• Mostly practical skills

### Minimal prerequisites

• Know what mixed-effects or multilevel model is
• A little experience with stats and/or data science in R
• Vague knowledge of what Bayesian stats are

• Knowing about the lme4 package will help
• Knowing about tidyverse and ggplot2 will help

### How to follow the course

• Slides and text version of lessons are online
• Fill in code in the worksheet (replace ... with code)
• You can always copy and paste code from text version of lesson if you fall behind

### Conceptual learning objectives

At the end of this course, you will understand …

• The basics of Bayesian inference
• What a prior, likelihood, and posterior are
• The basics of how Markov Chain Monte Carlo works
• What a credible interval is

### Practical learning objectives

At the end of this course, you will be able to …

• Write brms code to fit a multilevel model with random intercepts and random slopes
• Diagnose and deal with convergence problems
• Interpret brms output
• Compare models with LOO information criteria
• Use Bayes factors to assess strength of evidence for effects
• Make plots of model parameters and predictions with credible intervals

## What is Bayesian inference?

A method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information

• Used in many disciplines and fields
• We’re going to look at how to use it to estimate parameters of statistical models to analyze scientific data
• Powerful, user-friendly, open-source software is making it easier for everyone to go Bayesian

## Bayes’ Theorem

• Thomas Bayes, 1763
• Pierre-Simon Laplace, 1774

## Bayes’ Theorem

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

• How likely an event is to happen based on our prior knowledge about conditions related to that event
• The conditional probability of an event A occurring, conditioned on the probability of another event B occurring

## Bayes’ Theorem

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

The probability of A being true given that B is true ($P(A|B)$)
is equal to the probability that B is true given that A is true ($P(B|A)$)
times the ratio of probabilities that A and B are true ($\frac{P(A)}{P(B)}$)

### Bayes’ theorem and statistical inference

• Let’s say $A$ is a statistical model (a hypothesis about the world)
• How probable is it that our hypothesis is true?
• $P(A)$: prior probability that we assign based on our subjective knowledge before we get any data

### Bayes’ theorem and statistical inference

• We go out and get some data $B$
• $P(B|A)$: likelihood is the probability of observing that data if our model $A$ is true
• Use the likelihood to update our estimate of probability of our model
• $P(A|B)$: posterior probability that model $A$ is true, given that we observed $B$.

### Bayes’ theorem and statistical inference

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

• What about $P(B)$?
• marginal probability, the probability of the data
• Basically just a normalizing constant
• If we are comparing two models with the same data, the two $P(B)$s cancel out

### Restating Bayes’ theorem

$P(model|data) \propto P(data|model)P(model)$

$posterior = likelihood \times prior$

what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)

## Example

• Find a coin on the street. What is our prior estimate of the probability of flipping heads?
• Now we flip 10 times and get 8 heads. What is our belief now?
• Probably doesn’t change much because we have a strong prior and the likelihood of probability = 0.5 is still high enough even if we see 8/10 heads
• Shady character on the street shows us a coin and offers to flip it. He will pay $1 for each tails if we pay$1 for each heads
• What is our prior estimate of the probability?
• He flips 10 times and gets 8 heads. What’s our belief now?
• In classical “frequentist” analysis we cannot incorporate prior information into the analysis
• In each case our point estimate of the probability would be 0.8

## Bayes is computationally intensive

• $P(data|model)$, the likelihood, is needed to get $P(model|data)$, the posterior
• But the “model” is not just one parameter, it might be 100s or 1000s of parameters
• Need to integrate a probability distribution with 100s or 1000s of dimensions
• For many years, this was computationally not possible

## Markov Chain Monte Carlo (MCMC)

• Class of algorithms for sampling from probability distributions
• The longer the chain runs, the closer it gets to the true distribution
• In Bayesian inference, we run multiple Markov chains for a preset number of samples
• Discard the initial samples (warmup)
• What remains is our estimate of the posterior distribution

## Hamiltonian Monte Carlo (HMC) and Stan

• HMC is the fastest and most efficient MCMC algorithm that has ever been developed
• It’s implemented in software called Stan

## What is brms?

• An easy way to fit Bayesian mixed models using Stan in R
• Syntax of brms models is just like lme4
• Runs a Stan model behind the scenes
• Automatically assigns sensible priors and does lots of tricks to speed up HMC convergence

## Why use Bayes?

• Some models just can’t be fit with frequentist maximum-likelihood methods
• Estimate how big effects are instead of yes-or-no framework of rejecting a null hypothesis
• We can say “the probability of something being between a and b is 95%” instead of “if we ran this experiment many times, 95% of the confidence intervals would contain this value.”

Let’s finally fit some Bayesian models!

## Setup

library(brms)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(easystats)
library(logspline)

Set plotting theme.

theme_set(theme_minimal())

Set brms options (back-end and cores).

options(brms.backend = 'cmdstanr', mc.cores = 4)

## The data

Read the simulated yield dataset from CSV on GitHub

yield_data <- read_csv('https://github.com/qdread/brms-crash-course/raw/main/data/yield_data.csv')

### Examine the data

• field: text ID for each field where responses were measured ('field1' through 'field25')
• plot: text ID for each plot, 40 plots in each field ('plot1' through 'plot40')
• variety: text ID identifying variety ('short' and 'tall')
• rainfall: numeric variable. Same for all plots in a given field
• soilN: numeric variable (soil nitrogen). Unique value for each plot
• yield: numeric variable measured at plot level (outcome variable)

## Exploratory plots

• Look at relationship between yield and soilN
• I will not explain ggplot2 code for now
• Soil N values are jittered in x direction
(yield_vs_N <- ggplot(data = yield_data, aes(x = soilN, y = yield)) +
geom_point(size = 1.2,
alpha = .8,
position = position_jitter(width = .2, height = 0)) +
ggtitle('Yield vs. soil N'))

yield_vs_N +
geom_smooth(method = lm, se = FALSE) +
ggtitle('Yield vs. soil N', subtitle = 'overall trendline')

Take multilevel structure of data into account (color by field).

(yield_vs_N_colored <- ggplot(data = yield_data, aes(x = soilN, y = yield, color = field, group = field)) +
geom_point(size = 1.2,
alpha = .8,
position = position_jitter(width = .2, height = 0)) +
theme(legend.position = 'none') +
ggtitle('Yield vs. soil N', 'points colored by field'))

yield_vs_N_colored +
geom_smooth(method = lm, se = FALSE, linewidth = 0.7, alpha = 0.8) +
ggtitle('Yield vs. soil N', 'trendline by field')

## Fitting models

For reference this is mixed model syntax from lme4 package:

lmer(yield ~ 1 + (1 | field), data = yield_data)
• Dependent or response variable (yield) on left side
• Tilde ~ separates dependent from independent variables
• Here the only fixed effect is the global intercept (1)
• Random effects specification ((1 | field)) has a design side (on the left hand) and group side (on the right hand) separated by |.
• In this case, the 1 on the design side means only fit random intercepts and no random slopes
• field on the group side means each field will get its own random intercept

### Our first Bayesian multilevel model!

fit_interceptonly <- brm(yield ~ 1 + (1 | field),
data = yield_data,
chains = 2,
iter = 200,
warmup = 100,
init = 'random')
• Same formula as lme4 but with some extra instructions for the HMC sampler
• Number of Markov chains
• Iterations for each chain
• How many iterations to discard as warmup
• Random initial values
• No priors specified, so defaults are used

### Model output

summary(fit_interceptonly)
• Rhat > 1.05 for some parameters
• Rhat indicates convergence of MCMC chains, approaching 1 at convergence
• Rhat < 1.01 is ideal
plot(fit_interceptonly)

Posterior distributions and trace plots for

• the fixed effect intercept (b_Intercept)
• the standard deviation of the random field intercepts (sd_field__Intercept)
• the standard deviation of the model’s residuals (sigma)

### Dealing with convergence problems

• Warning says either increase iterations or set stronger priors
• Increase iterations to 1000 warmup, 1000 sampling per chain (2000 total)
• update() lets us draw more samples without recompiling code
fit_interceptonly_moresamples <- update(fit_interceptonly, chains = 2, iter = 2000, warmup = 1000)

plot(fit_interceptonly_moresamples)

summary(fit_interceptonly_moresamples)

## Credible intervals

• What is the “95% CI” thing on the parameter summaries?
• credible interval, not confidence interval
• more direct interpretation than confidence interval:
• We are 95% sure the parameter’s value is in the 95% credible interval
• based on quantiles of the posterior distribution

### Calculating credible intervals

Median and 90% quantile-based credible interval (QI) of the intercept

post_samples <- as_draws_df(fit_interceptonly_moresamples)
post_samples_intercept <- post_samples$b_Intercept median(post_samples_intercept) quantile(post_samples_intercept, c(0.05, 0.95)) as_draws_df() gets all posterior samples for all parameters and puts them into a data frame • Literally anything in a Bayesian model has a posterior distribution, so anything can have a credible interval! • In frequentist models, you have to do bootstrapping to get that kind of interval on most quantities ## Variance decomposition • Proportion of variation at different nested levels • Calculate the ratio of variance within fields to between fields • variance_decomposition() from performance package also gives us a credible interval on the variance ratio variance_decomposition(fit_interceptonly_moresamples, ci = 0.99) • Variance ratio is much greater than zero so there is a need for a multilevel model • But I would recommend using one anyway, if that’s the way your study was designed ## Mixed-effects model with first-level predictors • So far we have only calculated mean of yield and random variation by field • But what factors influence yield at the plot level? • Add first-level predictors (that vary by plot) as fixed effects • variety • soil N • Fixed-effect part of model formula is 1 + variety + soilN • No random slope (effect of variety and soil N on yield is the same in each field) • Still using default priors fit_fixedslopes <- brm(yield ~ 1 + variety + soilN + (1 | field), data = yield_data, chains = 4, iter = 2000, warmup = 1000, seed = 807, file = 'fit_fixedslopes') seed sets a random seed for reproducibility, and file creates a .rds file in the working directory so you can reload the model later without rerunning. summary(fit_fixedslopes) • Low Rhat (the model converged) • Posterior distribution mass for fixed effects is well above zero • sigma (SD of residuals) is smaller than before because we’re explaining more variation ## Modifying priors • prior_summary() shows what priors were used to fit the model prior_summary(fit_fixedslopes) • t-distributions on intercept, random effect SD, and residual SD (sigma) • mean of intercept prior is the mean of the data • mean of the variance parameters is 0 but lower bound is 0 (half bell curves) ### Priors on fixed effect slopes • By default they are flat • Assigns equal prior probability to any possible value • 0 is as probable as 100000 which is as probable as -55555, etc. • Not very plausible • It is OK in this case because the model converged, but often it helps convergence to use priors that only allow “reasonable” values ### Refitting with reasonable fixed-effect priors • normal(0, 5) is a good choice • Mean of 0 means we think that positive effects are just as likely as negative • SD of 5 means we are still assigning pretty high probability to large effect sizes • Use prior() to assign a prior to each class of parameters fit_fixedslopes_priors <- brm(yield ~ 1 + variety + soilN + (1 | field), data = yield_data, prior = c( prior(normal(0,5), class = b) ), chains = 4, iter = 2000, warmup = 1000, seed = 811, file = 'fit_fixedslopes_priors') summary(fit_fixedslopes_priors) • Basically no effect on the results or the performance of the HMC sampler • But it’s something to be mindful of in the future! ### Posterior predictive check • pp_check() is a useful diagnostic for how well the model fits the data pp_check(fit_fixedslopes_priors) • black line: density plot of observed data • blue lines: density plot of predicted data from 10 random draws from the posterior distribution • Because these are simulated data, it looks very good in this case ### Plotting posterior estimates • summary() only gives us the median and 95% credible interval • We can work with the full uncertainty distribution (nothing special about 95%) • Functions from tidybayes used to make tables and plots • gather_draws() makes a data frame from the posterior samples of parameters that we choose posterior_slopes <- gather_draws(fit_fixedslopes_priors, b_varietytall, b_soilN) • median_qi() gives us median and quantiles of the parameters posterior_slopes %>% median_qi(.width = c(.66, .95, .99)) • Special extensions to ggplot2 for plotting quantiles of posterior distribution • I also include a dotted line at zero for comparison ggplot(posterior_slopes, aes(y = .variable, x = .value)) + stat_halfeye(.width = c(.8, .95)) + geom_vline(xintercept = 0, linetype = 'dashed', linewidth = 1) ggplot(posterior_slopes, aes(y = .variable, x = .value)) + stat_interval() + stat_summary(fun = median, geom = 'point', size = 2) + scale_color_brewer(palette = 'Blues') + geom_vline(xintercept = 0, linetype = 'dashed', linewidth = 1) ## Model with first-level and second-level predictors • Variety and soil N vary by plot (first-level predictors) • Rainfall is shared by all plots in the same field (second-level predictor) • The same syntax is used • Fixed-effect part is now 1 + variety + soilN + rainfall fit_fixed12 <- brm(yield ~ 1 + variety + soilN + rainfall + (1 | field), data = yield_data, prior = c( prior(normal(0, 5), class = b) ), chains = 4, iter = 2000, warmup = 1000, seed = 703, file = 'fit_fixed12') Look at trace plots, posterior predictive check, and model summary. • What do they show? ## Interactions between fixed effects • You can add interaction terms separated by : • example: 1 + variety + soilN + rainfall + variety:soilN + soilN:rainfall • Interactions can be within level (like variety:soilN) or between levels (like soilN:rainfall) ## Model with random slopes • So far we’ve assumed any predictor’s effect is the same in every field (only intercept varies, not slope) • Add random slope term to allow both intercept and slope to vary • Specify a random slope by adding the appropriate slope to the design side of the random effect specification • random intercept only (1 | field) • random intercept and random slope with respect to soil N (1 + soilN | field) • random intercept and random slope with respect to variety and soil N (1 + variety + soilN | field) • Doesn’t make sense to have a random slope for rainfall because all plots in the same field have the same fit_randomslopes <- brm(yield ~ 1 + variety + soilN + rainfall + (1 + variety + soilN | field), data = yield_data, prior = c( prior(normal(0, 5), class = b) ), chains = 4, iter = 4000, warmup = 3000, seed = 777, file = 'fit_randomslopes') Note: this model may take a minute or two to sample • Look at trace plots, pp_check, and model summary • We see some divergent transitions, maybe because we are fitting a model that is too complicated • Standard deviation of random slopes for variety is very small • We can probably omit the random slope for variety from the model • To confirm, plot field-level slopes for variety (sum of fixed + random) with QI intervals variety_slopes <- spread_draws(fit_randomslopes, b_varietytall, r_field[field,variable]) %>% filter(variable == 'varietytall') %>% mutate(slope = b_varietytall + r_field) variety_slopes %>% median_qi(slope, .width = c(.66, .90, .95)) %>% ggplot(aes(y = field, x = slope, xmin = .lower, xmax = .upper)) + geom_interval() + geom_point(size = 2) + scale_color_brewer(palette = 'Blues') • The field-level effects of variety are all very similar • Refit the model without the random slope for variety, only for soil N fit_soilNrandomslope <- brm(yield ~ 1 + variety + soilN + rainfall + (1 + soilN | field), data = yield_data, prior = c( prior(normal(0, 5), class = b) ), chains = 4, iter = 4000, warmup = 3000, seed = 888, file = 'fit_soilNrandomslope') ## Comparing models with information criteria • Leave-one-out (LOO) cross-validation compares models • How well does a model fit to all data points but one predict the one remaining data point? • First use add_criterion() to compute the LOO criterion for each model • Then use loo_compare() to rank the models fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo') fit_fixedslopes_priors <- add_criterion(fit_fixedslopes_priors, 'loo') fit_fixed12 <- add_criterion(fit_fixed12, 'loo') fit_randomslopes <- add_criterion(fit_randomslopes, 'loo') fit_soilNrandomslope <- add_criterion(fit_soilNrandomslope, 'loo') loo_compare(fit_interceptonly_moresamples, fit_fixedslopes_priors, fit_fixed12, fit_randomslopes, fit_soilNrandomslope) • Models ranked by ELPD (expected log pointwise predictive density) • The best one always has 0 and the others are ranked relative to it • The random slope models do equally well, so we can prefer the simpler one • The random intercept models do quite a bit worse • The model with no fixed effects does very poorly ## Assessing evidence with Bayes Factors • Bayesian analogue of a p-value • Ratio of evidence between two models: $\frac{P(model_1|data)}{P(model_2|data)}$ • Ranges from 0 to infinity • BF = 1 means equal evidence, BF > 1 means more evidence for model 1 • No “significance” threshold but BF > 10 is usually called strong evidence ### Bayes factors for each parameter • R package bayestestR lets us compute BF for each parameter • Ratio of evidence for posterior : evidence for prior • Our prior distributions were centered at 0 so they are like “null hypotheses” • BF = 1 means we did not change our belief about the parameter at all from the prior, after seeing the data • BF > 1 means we’ve changed our belief about the parameter after seeing the data • BF < 1 means we have even stronger evidence that the prior is true, after seeing the data bayesfactor_parameters(fit_soilNrandomslope) • BF < 1 for the intercept • BF >> 1000 for all other fixed effects • WARNING: BFs are very sensitive to your choice of prior ## Making prediction plots • To finish, here are some examples of plots you can make to visualize model output • Posterior expected values given different values of the predictors • conditional_effects() is a quick way to plot all fixed effects with 95% credible intervals conditional_effects(fit_extravrandomslope) • We can also incorporate random effects into the prediction • This plot shows the “shrinkage” effect of a mixed model plot(conditional_effects(fit_soilNrandomslope, effects="soilN:field", re_formula = NULL), line_args=list(linewidth=1.2, alpha = 0.2), theme = theme(legend.position = 'none')) • Customized plots can be made with packages emmeans and tidybayes • emmeans() gives you the expected value at specific levels of fixed predictors variety_emmeans <- emmeans(fit_soilNrandomslope, ~ variety) gather_emmeans_draws(variety_emmeans) %>% ggplot(aes(x = .value, y = variety)) + stat_interval(.width = c(.66, .95, .99)) + stat_summary(fun = median, geom = 'point', size = 2) + scale_color_brewer(palette = 'Greens') + ggtitle('posterior expected value by variety', 'averaged across soil N and rainfall') • add_epred_draws() gives you the expected value at levels of categorical and continuous fixed effects • Trend of yield versus soil N for each sex, at average value of rainfall • Averages across field-level random effects expand_grid(rainfall = mean(yield_data$rainfall),
soilN = seq(2, 9, by = .5),
variety = c('short', 'tall')) %>%
add_epred_draws(fit_soilNrandomslope, re_formula = ~ 0) %>%
ggplot(aes(x = soilN, y = .epred, group = variety, color = variety)) +
stat_lineribbon(.width = c(.66, .95, .99)) +
scale_fill_grey() +
labs(y = 'posterior expected value of yield') +
ggtitle('posterior expected value by soil N + variety', 'at average value of rainfall')

## Conclusion

What did we learn? Let’s revisit the learning objectives!

## Conceptual learning objectives

You now understand…

• The basics of Bayesian inference
• Definition of prior, likelihood, and posterior
• How Markov Chain Monte Carlo works
• What a credible interval is

## Practical skills

You now can…

• Write brms code to fit a multilevel model with random intercepts and random slopes
• Diagnose and deal with convergence problems
• Interpret brms output
• Compare models with LOO information criteria
• Use Bayes factors to assess strength of evidence for effects
• Make plots of model parameters and predictions with credible intervals

Congratulations, you are now Bayesians!

• See text version of lesson for further reading and useful resources
• Please send any feedback to quentin.read@usda.gov!