A crash course in Bayesian mixed models with brms

What is this class?

  • A brief and practical introduction to fitting Bayesian multilevel models in R and Stan
  • Using brms (Bayesian Regression Models using Stan)
  • Quick intro to Bayesian inference
  • Mostly practical skills

Minimal prerequisites

  • Know what mixed-effects or multilevel model is
  • A little experience with stats and/or data science in R
  • Vague knowledge of what Bayesian stats are

Advanced prerequisites

  • Knowing about the lme4 package will help
  • Knowing about tidyverse and ggplot2 will help

How to follow the course

  • Slides and text version of lessons are online
  • Fill in code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Conceptual learning objectives

At the end of this course, you will understand …

  • The basics of Bayesian inference
  • What a prior, likelihood, and posterior are
  • The basics of how Markov Chain Monte Carlo works
  • What a credible interval is

Practical learning objectives

At the end of this course, you will be able to …

  • Write brms code to fit a multilevel model with random intercepts and random slopes
  • Diagnose and deal with convergence problems
  • Interpret brms output
  • Compare models with LOO information criteria
  • Use Bayes factors to assess strength of evidence for effects
  • Make plots of model parameters and predictions with credible intervals

What is Bayesian inference?

What is Bayesian inference?

A method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information

  • Used in many disciplines and fields
  • We’re going to look at how to use it to estimate parameters of statistical models to analyze scientific data
  • Powerful, user-friendly, open-source software is making it easier for everyone to go Bayesian

Bayes’ Theorem

  • Thomas Bayes, 1763
  • Pierre-Simon Laplace, 1774

Bayes’ Theorem

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

  • How likely an event is to happen based on our prior knowledge about conditions related to that event
  • The conditional probability of an event A occurring, conditioned on the probability of another event B occurring

Bayes’ Theorem

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

The probability of A being true given that B is true (\(P(A|B)\))
is equal to the probability that B is true given that A is true (\(P(B|A)\))
times the ratio of probabilities that A and B are true (\(\frac{P(A)}{P(B)}\))

Bayes’ theorem and statistical inference

  • Let’s say \(A\) is a statistical model (a hypothesis about the world)
  • How probable is it that our hypothesis is true?
  • \(P(A)\): prior probability that we assign based on our subjective knowledge before we get any data

Bayes’ theorem and statistical inference

  • We go out and get some data \(B\)
  • \(P(B|A)\): likelihood is the probability of observing that data if our model \(A\) is true
  • Use the likelihood to update our estimate of probability of our model
  • \(P(A|B)\): posterior probability that model \(A\) is true, given that we observed \(B\).

Bayes’ theorem and statistical inference

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

  • What about \(P(B)\)?
  • marginal probability, the probability of the data
  • Basically just a normalizing constant
  • If we are comparing two models with the same data, the two \(P(B)\)s cancel out

Restating Bayes’ theorem

\[P(model|data) \propto P(data|model)P(model)\]

\[posterior = likelihood \times prior\]

what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)

Example

  • Find a coin on the street. What is our prior estimate of the probability of flipping heads?
  • Now we flip 10 times and get 8 heads. What is our belief now?
  • Probably doesn’t change much because we have a strong prior and the likelihood of probability = 0.5 is still high enough even if we see 8/10 heads
  • Shady character on the street shows us a coin and offers to flip it. He will pay $1 for each tails if we pay $1 for each heads
  • What is our prior estimate of the probability?
  • He flips 10 times and gets 8 heads. What’s our belief now?
  • In classical “frequentist” analysis we cannot incorporate prior information into the analysis
  • In each case our point estimate of the probability would be 0.8

Bayes is computationally intensive

  • \(P(data|model)\), the likelihood, is needed to get \(P(model|data)\), the posterior
  • But the “model” is not just one parameter, it might be 100s or 1000s of parameters
  • Need to integrate a probability distribution with 100s or 1000s of dimensions
  • For many years, this was computationally not possible

Markov Chain Monte Carlo (MCMC)

  • Class of algorithms for sampling from probability distributions
  • The longer the chain runs, the closer it gets to the true distribution
  • In Bayesian inference, we run multiple Markov chains for a preset number of samples
  • Discard the initial samples (warmup)
  • What remains is our estimate of the posterior distribution

Hamiltonian Monte Carlo (HMC) and Stan

  • HMC is the fastest and most efficient MCMC algorithm that has ever been developed
  • It’s implemented in software called Stan

What is brms?

  • An easy way to fit Bayesian mixed models using Stan in R
  • Syntax of brms models is just like lme4
  • Runs a Stan model behind the scenes
  • Automatically assigns sensible priors and does lots of tricks to speed up HMC convergence

Why use Bayes?

  • Some models just can’t be fit with frequentist maximum-likelihood methods
  • Estimate how big effects are instead of yes-or-no framework of rejecting a null hypothesis
  • We can say “the probability of something being between a and b is 95%” instead of “if we ran this experiment many times, 95% of the confidence intervals would contain this value.”

Let’s finally fit some Bayesian models!

Setup

Load packages.

library(brms)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(easystats)
library(logspline)

Set plotting theme.

theme_set(theme_minimal())

Set brms options (back-end and cores).

options(brms.backend = 'cmdstanr', mc.cores = 4)

The data

Read the simulated yield dataset from CSV on GitHub

yield_data <- read_csv('https://github.com/qdread/brms-crash-course/raw/main/data/yield_data.csv')

Examine the data

  • field: text ID for each field where responses were measured ('field1' through 'field25')
  • plot: text ID for each plot, 40 plots in each field ('plot1' through 'plot40')
  • variety: text ID identifying variety ('short' and 'tall')
  • rainfall: numeric variable. Same for all plots in a given field
  • soilN: numeric variable (soil nitrogen). Unique value for each plot
  • yield: numeric variable measured at plot level (outcome variable)

Exploratory plots

  • Look at relationship between yield and soilN
  • I will not explain ggplot2 code for now
  • Soil N values are jittered in x direction
(yield_vs_N <- ggplot(data = yield_data, aes(x = soilN, y = yield)) +
  geom_point(size = 1.2,
             alpha = .8,
             position = position_jitter(width = .2, height = 0)) +
  ggtitle('Yield vs. soil N'))

Add trendline.

yield_vs_N +
  geom_smooth(method = lm, se = FALSE) +
  ggtitle('Yield vs. soil N', subtitle = 'overall trendline')

Take multilevel structure of data into account (color by field).

(yield_vs_N_colored <- ggplot(data = yield_data, aes(x = soilN, y = yield, color = field, group = field)) +
   geom_point(size = 1.2,
              alpha = .8,
              position = position_jitter(width = .2, height = 0)) +
   theme(legend.position = 'none') +
   ggtitle('Yield vs. soil N', 'points colored by field'))

Add field-level trendlines.

yield_vs_N_colored + 
  geom_smooth(method = lm, se = FALSE, linewidth = 0.7, alpha = 0.8) +
  ggtitle('Yield vs. soil N', 'trendline by field')

Fitting models

For reference this is mixed model syntax from lme4 package:

lmer(yield ~ 1 + (1 | field), data = yield_data)
  • Dependent or response variable (yield) on left side
  • Tilde ~ separates dependent from independent variables
  • Here the only fixed effect is the global intercept (1)
  • Random effects specification ((1 | field)) has a design side (on the left hand) and group side (on the right hand) separated by |.
  • In this case, the 1 on the design side means only fit random intercepts and no random slopes
  • field on the group side means each field will get its own random intercept

Our first Bayesian multilevel model!

fit_interceptonly <- brm(yield ~ 1 + (1 | field),
                         data = yield_data,
                         chains = 2,
                         iter = 200,
                         warmup = 100,
                         init = 'random')
  • Same formula as lme4 but with some extra instructions for the HMC sampler
    • Number of Markov chains
    • Iterations for each chain
    • How many iterations to discard as warmup
    • Random initial values
    • No priors specified, so defaults are used

Model output

summary(fit_interceptonly)
  • Warning about convergence
  • Rhat > 1.05 for some parameters
  • Rhat indicates convergence of MCMC chains, approaching 1 at convergence
  • Rhat < 1.01 is ideal
plot(fit_interceptonly)

Posterior distributions and trace plots for

  • the fixed effect intercept (b_Intercept)
  • the standard deviation of the random field intercepts (sd_field__Intercept)
  • the standard deviation of the model’s residuals (sigma)

Dealing with convergence problems

  • Warning says either increase iterations or set stronger priors
  • Increase iterations to 1000 warmup, 1000 sampling per chain (2000 total)
  • update() lets us draw more samples without recompiling code
fit_interceptonly_moresamples <- update(fit_interceptonly, chains = 2, iter = 2000, warmup = 1000)

plot(fit_interceptonly_moresamples)

summary(fit_interceptonly_moresamples)

Credible intervals

  • What is the “95% CI” thing on the parameter summaries?
  • credible interval, not confidence interval
  • more direct interpretation than confidence interval:
    • We are 95% sure the parameter’s value is in the 95% credible interval
  • based on quantiles of the posterior distribution

Calculating credible intervals

Median and 90% quantile-based credible interval (QI) of the intercept

post_samples <- as_draws_df(fit_interceptonly_moresamples)
post_samples_intercept <- post_samples$b_Intercept

median(post_samples_intercept)
quantile(post_samples_intercept, c(0.05, 0.95))

as_draws_df() gets all posterior samples for all parameters and puts them into a data frame

  • Literally anything in a Bayesian model has a posterior distribution, so anything can have a credible interval!
  • In frequentist models, you have to do bootstrapping to get that kind of interval on most quantities

Variance decomposition

  • Proportion of variation at different nested levels
  • Calculate the ratio of variance within fields to between fields
  • variance_decomposition() from performance package also gives us a credible interval on the variance ratio
variance_decomposition(fit_interceptonly_moresamples, ci = 0.99)
  • Variance ratio is much greater than zero so there is a need for a multilevel model
  • But I would recommend using one anyway, if that’s the way your study was designed

Mixed-effects model with first-level predictors

  • So far we have only calculated mean of yield and random variation by field
  • But what factors influence yield at the plot level?
  • Add first-level predictors (that vary by plot) as fixed effects
    • variety
    • soil N
  • Fixed-effect part of model formula is 1 + variety + soilN
  • No random slope (effect of variety and soil N on yield is the same in each field)
  • Still using default priors
fit_fixedslopes <- brm(yield ~ 1 + variety + soilN + (1 | field),  
                       data = yield_data, 
                       chains = 4, iter = 2000, warmup = 1000,
                       seed = 807, file = 'fit_fixedslopes')

seed sets a random seed for reproducibility, and file creates a .rds file in the working directory so you can reload the model later without rerunning.

summary(fit_fixedslopes)
  • Low Rhat (the model converged)
  • Posterior distribution mass for fixed effects is well above zero
  • sigma (SD of residuals) is smaller than before because we’re explaining more variation

Modifying priors

  • prior_summary() shows what priors were used to fit the model
prior_summary(fit_fixedslopes)
  • t-distributions on intercept, random effect SD, and residual SD (sigma)
  • mean of intercept prior is the mean of the data
  • mean of the variance parameters is 0 but lower bound is 0 (half bell curves)

Priors on fixed effect slopes

  • By default they are flat
  • Assigns equal prior probability to any possible value
  • 0 is as probable as 100000 which is as probable as -55555, etc.
  • Not very plausible
  • It is OK in this case because the model converged, but often it helps convergence to use priors that only allow “reasonable” values

Refitting with reasonable fixed-effect priors

  • normal(0, 5) is a good choice
  • Mean of 0 means we think that positive effects are just as likely as negative
  • SD of 5 means we are still assigning pretty high probability to large effect sizes
  • Use prior() to assign a prior to each class of parameters
fit_fixedslopes_priors <- brm(yield ~ 1 + variety + soilN + (1 | field),
                              data = yield_data,
                              prior = c(
                                prior(normal(0,5), class = b)
                              ),
                              chains = 4, iter = 2000, warmup = 1000,
                              seed = 811, file = 'fit_fixedslopes_priors')
summary(fit_fixedslopes_priors)
  • Basically no effect on the results or the performance of the HMC sampler
  • But it’s something to be mindful of in the future!

Posterior predictive check

  • pp_check() is a useful diagnostic for how well the model fits the data
pp_check(fit_fixedslopes_priors)
  • black line: density plot of observed data
  • blue lines: density plot of predicted data from 10 random draws from the posterior distribution
  • Because these are simulated data, it looks very good in this case

Plotting posterior estimates

  • summary() only gives us the median and 95% credible interval
  • We can work with the full uncertainty distribution (nothing special about 95%)
  • Functions from tidybayes used to make tables and plots
  • gather_draws() makes a data frame from the posterior samples of parameters that we choose
posterior_slopes <- gather_draws(fit_fixedslopes_priors, b_varietytall, b_soilN)
  • median_qi() gives us median and quantiles of the parameters
posterior_slopes %>%
  median_qi(.width = c(.66, .95, .99))
  • Special extensions to ggplot2 for plotting quantiles of posterior distribution
  • I also include a dotted line at zero for comparison
ggplot(posterior_slopes, aes(y = .variable, x = .value)) +
  stat_halfeye(.width = c(.8, .95)) +
  geom_vline(xintercept = 0, linetype = 'dashed', linewidth = 1)
ggplot(posterior_slopes, aes(y = .variable, x = .value)) +
  stat_interval() +
  stat_summary(fun = median, geom = 'point', size = 2) +
  scale_color_brewer(palette = 'Blues') +
  geom_vline(xintercept = 0, linetype = 'dashed', linewidth = 1)

Model with first-level and second-level predictors

  • Variety and soil N vary by plot (first-level predictors)
  • Rainfall is shared by all plots in the same field (second-level predictor)
  • The same syntax is used
  • Fixed-effect part is now 1 + variety + soilN + rainfall
fit_fixed12 <- brm(yield ~ 1 + variety + soilN + rainfall + (1 | field),
                   data = yield_data,
                   prior = c(
                     prior(normal(0, 5), class = b)
                   ),
                   chains = 4, iter = 2000, warmup = 1000,
                   seed = 703, file = 'fit_fixed12')

Look at trace plots, posterior predictive check, and model summary.

  • What do they show?

Interactions between fixed effects

  • You can add interaction terms separated by :
  • example: 1 + variety + soilN + rainfall + variety:soilN + soilN:rainfall
  • Interactions can be within level (like variety:soilN) or between levels (like soilN:rainfall)

Model with random slopes

  • So far we’ve assumed any predictor’s effect is the same in every field (only intercept varies, not slope)
  • Add random slope term to allow both intercept and slope to vary
  • Specify a random slope by adding the appropriate slope to the design side of the random effect specification
    • random intercept only (1 | field)
    • random intercept and random slope with respect to soil N (1 + soilN | field)
    • random intercept and random slope with respect to variety and soil N (1 + variety + soilN | field)
  • Doesn’t make sense to have a random slope for rainfall because all plots in the same field have the same
fit_randomslopes <- brm(yield ~ 1 + variety + soilN + rainfall + (1 + variety + soilN | field),
                        data = yield_data,
                        prior = c(
                          prior(normal(0, 5), class = b)
                        ),
                        chains = 4, iter = 4000, warmup = 3000,
                        seed = 777, file = 'fit_randomslopes')

Note: this model may take a minute or two to sample

  • Look at trace plots, pp_check, and model summary
  • We see some divergent transitions, maybe because we are fitting a model that is too complicated
  • Standard deviation of random slopes for variety is very small
  • We can probably omit the random slope for variety from the model
  • To confirm, plot field-level slopes for variety (sum of fixed + random) with QI intervals
variety_slopes <- spread_draws(fit_randomslopes, b_varietytall, r_field[field,variable]) %>%
  filter(variable == 'varietytall') %>%
  mutate(slope = b_varietytall + r_field)

variety_slopes %>% 
  median_qi(slope, .width = c(.66, .90, .95)) %>%
  ggplot(aes(y = field, x = slope, xmin = .lower, xmax = .upper)) +
  geom_interval() +
  geom_point(size = 2) +
  scale_color_brewer(palette = 'Blues')
  • The field-level effects of variety are all very similar
  • Refit the model without the random slope for variety, only for soil N
fit_soilNrandomslope <- brm(yield ~ 1 + variety + soilN + rainfall + (1 + soilN | field),
                            data = yield_data, 
                            prior = c(
                              prior(normal(0, 5), class = b)
                            ),
                            chains = 4, iter = 4000, warmup = 3000,
                            seed = 888, file = 'fit_soilNrandomslope')

Comparing models with information criteria

  • Leave-one-out (LOO) cross-validation compares models
  • How well does a model fit to all data points but one predict the one remaining data point?
  • First use add_criterion() to compute the LOO criterion for each model
  • Then use loo_compare() to rank the models
fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo')
fit_fixedslopes_priors <- add_criterion(fit_fixedslopes_priors, 'loo')
fit_fixed12 <- add_criterion(fit_fixed12, 'loo')
fit_randomslopes <- add_criterion(fit_randomslopes, 'loo')
fit_soilNrandomslope <- add_criterion(fit_soilNrandomslope, 'loo')
loo_compare(fit_interceptonly_moresamples, fit_fixedslopes_priors, fit_fixed12, fit_randomslopes, fit_soilNrandomslope)
  • Models ranked by ELPD (expected log pointwise predictive density)
  • The best one always has 0 and the others are ranked relative to it
  • The random slope models do equally well, so we can prefer the simpler one
  • The random intercept models do quite a bit worse
  • The model with no fixed effects does very poorly

Assessing evidence with Bayes Factors

  • Bayesian analogue of a p-value
  • Ratio of evidence between two models: \(\frac{P(model_1|data)}{P(model_2|data)}\)
  • Ranges from 0 to infinity
  • BF = 1 means equal evidence, BF > 1 means more evidence for model 1
  • No “significance” threshold but BF > 10 is usually called strong evidence

Bayes factors for each parameter

  • R package bayestestR lets us compute BF for each parameter
  • Ratio of evidence for posterior : evidence for prior
  • Our prior distributions were centered at 0 so they are like “null hypotheses”
  • BF = 1 means we did not change our belief about the parameter at all from the prior, after seeing the data
  • BF > 1 means we’ve changed our belief about the parameter after seeing the data
  • BF < 1 means we have even stronger evidence that the prior is true, after seeing the data
bayesfactor_parameters(fit_soilNrandomslope)
  • BF < 1 for the intercept
  • BF >> 1000 for all other fixed effects
  • WARNING: BFs are very sensitive to your choice of prior

Making prediction plots

  • To finish, here are some examples of plots you can make to visualize model output
  • Posterior expected values given different values of the predictors
  • conditional_effects() is a quick way to plot all fixed effects with 95% credible intervals
conditional_effects(fit_extravrandomslope)
  • We can also incorporate random effects into the prediction
  • This plot shows the “shrinkage” effect of a mixed model
plot(conditional_effects(fit_soilNrandomslope, effects="soilN:field", re_formula = NULL), 
     line_args=list(linewidth=1.2, alpha = 0.2), theme = theme(legend.position = 'none'))
  • Customized plots can be made with packages emmeans and tidybayes
  • emmeans() gives you the expected value at specific levels of fixed predictors
variety_emmeans <- emmeans(fit_soilNrandomslope, ~ variety)

gather_emmeans_draws(variety_emmeans) %>%
  ggplot(aes(x = .value, y = variety)) +
  stat_interval(.width = c(.66, .95, .99)) +
  stat_summary(fun = median, geom = 'point', size = 2) +
  scale_color_brewer(palette = 'Greens') +
  ggtitle('posterior expected value by variety', 'averaged across soil N and rainfall')
  • add_epred_draws() gives you the expected value at levels of categorical and continuous fixed effects
  • Trend of yield versus soil N for each sex, at average value of rainfall
  • Averages across field-level random effects
expand_grid(rainfall = mean(yield_data$rainfall),
            soilN = seq(2, 9, by = .5),
            variety = c('short', 'tall')) %>%
  add_epred_draws(fit_soilNrandomslope, re_formula = ~ 0) %>%
  ggplot(aes(x = soilN, y = .epred, group = variety, color = variety)) +
  stat_lineribbon(.width = c(.66, .95, .99)) +
  scale_fill_grey() +
  labs(y = 'posterior expected value of yield') +
  ggtitle('posterior expected value by soil N + variety', 'at average value of rainfall')

Conclusion

What did we learn? Let’s revisit the learning objectives!

Conceptual learning objectives

You now understand…

  • The basics of Bayesian inference
  • Definition of prior, likelihood, and posterior
  • How Markov Chain Monte Carlo works
  • What a credible interval is

Practical skills

You now can…

  • Write brms code to fit a multilevel model with random intercepts and random slopes
  • Diagnose and deal with convergence problems
  • Interpret brms output
  • Compare models with LOO information criteria
  • Use Bayes factors to assess strength of evidence for effects
  • Make plots of model parameters and predictions with credible intervals

Congratulations, you are now Bayesians!

  • See text version of lesson for further reading and useful resources
  • Please send any feedback to quentin.read@usda.gov!