Lesson 5: Generalized linear mixed models

Lesson 5 learning objectives

At the end of this lesson, students will …

  • Know what a generalized linear mixed model is and why you might want to use one.
  • Fit a generalized linear mixed model to binary outcome data.
  • Fit a generalized linear mixed model to count outcome data.

GLMMs

  • Taking the log transformation of the response variable won’t always work
  • The next step is to enter the world of generalized linear mixed models (GLMMs)
  • General linear mixed model and generalized linear mixed model begin with the same letters, so it’s a little bit confusing!

GLMM with binary response variable

  • Let’s say we have response data that is binary
  • For example, testing the ability of a vaccine to prevent some disease in animals
  • Control group was not inoculated and treatment group was inoculated
  • The response variable is whether or not the disease is present
  • The animals are raised in different pens of 10 animals each in a complete block design
  • We need to account for the effect of pen on disease using a random intercept

Load needed packages and dataset (simulated)

library(tidyverse)
library(lme4)
library(easystats)
disease <- read_csv('datasets/disease.csv')

Summarize the data

  • How many animals in each pen and treatment got the disease?
  • Use group_by() and summarize() to make a table
disease %>%
  group_by(inoc, pen) %>%
  summarize(no_disease = sum(disease == 0), disease = sum(disease == 1))
  • What if we tried to just use a linear mixed model as we have done before?
  • We code “disease absent” as 0 and “disease present” as 1 to make it numeric
lmm_disease <- lmer(disease ~ inoc + (1|pen), data = disease)
  • Seems reasonable at first glance
  • The intercept is 0.78, so the baseline rate is 78% disease
  • The coefficient on inoculation is -0.64, so inoculation treatment makes disease go down to 14%
  • What is wrong with that?
  • Let’s say we had a population where the baseline rate of disease was only 40%.
  • Our model would predict -24% disease when inoculated
  • It makes more sense to think about relative changes to odds.
  • If the baseline rate is very low, it can’t be reduced much more
  • If the baseline rate is high, it’s easier to reduce

Diagnostics of linear model

check_model(lmm_disease)
  • Residuals of this model do not really meet assumptions
  • Not normally distributed with homogeneous variance
  • Predicted values of the model go well outside the 0 to 1 range
  • Very poor match between the data (green line) and predictions (blue lines)

Solving the problem

  • Model binary response with a generalized linear mode
  • Use a link function to convert the predicted response value into a scale that can be normally distributed
  • Predict the probability that an individual from each pen with or without the inoculation treatment will get the disease
  • Probability ranges from 0 to 1, so it cannot really have normally distributed error

GLMM formula

Formula for predicting the probability of disease for individual \(i\) in pen \(j\):

\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]

  • \(\hat{y}_{i,j}\): the predicted probability of disease for individual \(i\) in pen \(j\)
  • \(\beta_0\): overall intercept (fixed effect)
  • \(\beta_1\): treatment coefficient (fixed effect)
  • \(inoc_i\): binary x variable, 0 for uninoculated control and 1 for inoculated treatment
  • \(u_j\): intercept for each pen (random effect)

What is that “logit” thing???

\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]

  • Logit, or log odds, is the link function

\[\text{logit}~p = \log \frac {p}{1-p}\]

  • Transforms the probability (ranges from 0 to 1) to log-odds scale (can take any value, + or -)
  • R function is qlogis()
  • Inverse is plogis() (converts log odds back to probability)

Graph of the logit function

  • Maps the values from 0 to 1 to negative to positive infinity.
  • More or less a straight line between about 0.25 and 0.75
  • Starts to get steep as you get close to the boundaries

Fit the logit model

  • Binomial (aka logistic) GLMM with a logit link function
  • Binomial means response variable can have two values (0 and 1, or no and yes)
  • Now we use glmer() instead of lmer()
  • Same model formula as with lmer()
  • New argument, family, refers to the “family” of error distributions
  • binomial(link = 'logit') used here, other links are possible
glmm_disease <- glmer(disease ~ inoc + (1|pen), data = disease, family = binomial(link = 'logit'))

Diagnostics of GLMM

check_model(glmm_disease)
  • Looks better, especially the predictive check
  • Model only predicts values between 0 and 1
  • Makes logical sense and matches the observations

Exploring model summary

summary(glmm_disease)
  • Intercept is 1.39 and treatment coefficient is -3.37. What does that mean???
  • The coefficients are on the log odds scale and need to be back-transformed to probability scale

Back-transforming intercept

  • 1.39 is the fitted log odds of disease in uninoculated control, averaged across the random effect of pen
  • Transform this log odds back to the probability scale ranging from 0 to 1
  • Use the inverse logit function, plogis()
plogis(1.39)
[1] 0.8005922
  • Population-level prediction: a random individual who isn’t inoculated has ~80% probability of disease

Back-transforming slope

  • -3.37 is the change in the log odds between uninoculated and inoculated
  • Add intercept and slope, then take the inverse logit
plogis(1.39 - 3.37)
[1] 0.1213188
  • Population-level prediction: random uninoculated individual has ~12% probability of disease
  • We will not have to do these predictions “by hand” – wait for next lesson!

Effects in GLMMs

  • GLMM can have all the same kinds of effects as “plain” LMMs
    • Continuous and categorical fixed effect predictors
    • Interaction effects
    • Crossed and nested random effects
    • Random intercepts and random slopes

GLMM with count response variable

  • Response variable can only be a non-negative integer
  • Discrete values, instead of continuous like a normal distribution
  • Model cannot predict negative counts
  • We can use GLMM with Poisson distribution with a log link function
  • Poisson distribution is bounded over non-negative integers
  • Ideal for count data (as long as there are not too many 0 values)

Example count dataset

  • From agridat package
  • Experiment conducted by George Stirret and colleagues in Canada in 1935
  • Use of fungal spores applied to corn plants to control European corn borer
  • Four levels of the fungal spore treatment (trt column in the dataset):
    • untreated control ("None")
    • early fungal treatment ("Early")
    • late fungal treatment ("Late")
    • fungal spores were applied both early and late ("Both")
  • 15 experimental blocks (block column), each containing four plots
  • Number of borers per plot was counted on two different dates
  • We will only consider the latest count (count2 column)

Load the dataset

data('stirret.borers', package = 'agridat')

If you are running R locally and didn’t/can’t install the agridat package, I included the CSV in the example datasets

stirret.borers <- read_csv('datasets/stirret.borers.csv')

Examine the data

Histogram of the data by treatment, with means as a line (don’t worry about the code for now)

borer_means <- stirret.borers %>%
  group_by(trt) %>%
  summarize(count2_mean = mean(count2))

ggplot(stirret.borers, aes(x = count2)) +
  geom_histogram(bins = 10) +
  geom_vline(aes(xintercept = count2_mean), data = borer_means, color = 'red') +
  facet_wrap(~ trt) +
  theme_bw()
  • Quite a few counts of zero or close to it
  • “Both” and “Late” treatments have the lowest mean counts, then “Early”, then “None”
  • Rearrange the factor levels of the trt column so control level `’None`` is first
  • It will be the reference level in the model, for easier interpretation
stirret.borers <- stirret.borers %>%
  mutate(trt = factor(trt, levels = c('None', 'Early', 'Late', 'Both')))

Fit as general linear model

  • For proof of concept
lmm_borers <- lmer(count2 ~ trt + (1|block), data = stirret.borers)

Fit as generalized linear model

  • Use family = poisson(link = 'log')
  • Log is the link function that works best with the Poisson distribution
glmm_borers <- glmer(count2 ~ trt + (1|block), data = stirret.borers, family = poisson(link = 'log'))

Diagnostic plots

Compare LMM and Poisson GLMM diagnostics

check_model(lmm_borers)
check_model(glmm_borers)
  • Residual diagnostics actually look acceptable for both models
  • The count data are really not that far off from normal
  • But LMM produces a lot of negative predictions
  • GLMM allows us to make more meaningful predictions

Coefficients of GLMM

summary(glmm_borers)
  • Back-transform coefficient estimates from the link function scale back to the original scale of the data
  • Coefficients on log scale so we use exponentiation (exp()) as the inverse
  • Intercept (3.42) is the estimated mean for the “None” treatment (control group) on a log scale
exp(3.42)
[1] 30.56942
  • For other treatment groups, add coefficients to intercept and then exponentiate
  • For example, “Early” treatment coefficient was -0.24
exp(3.42 - 0.24)
[1] 24.04675

Hey! What about …

  • we are barely scratching the surface of GLMMs
  • glmer() can only fit a small subset of GLMMs
  • For example, your response variable might be a categorical outcome with more than two possibilities (multinomial GLMM)
  • Or ordered categorical, such as disease ratings on a 1-5 scale

Other topics for future workshops

  • Repeated measures designs
  • Spatial autocorrelation
  • Overdispersion in count data
  • Zero-inflation
  • Different types of random effects (G-side versus R-side) and error structures

And what about Bayes?

  • In many real world cases of data analysis, the “frequentist” methods we have covered today simply will not work
  • The best alternative is to use a Bayesian approach
  • I recommend the Bayesian R modeling packages brms and rstanarm
  • I am planning a workshop on brms for this coming year