# Lesson 5: Generalized linear mixed models

## Lesson 5 learning objectives

At the end of this lesson, students will …

• Know what a generalized linear mixed model is and why you might want to use one.
• Fit a generalized linear mixed model to binary outcome data.
• Fit a generalized linear mixed model to count outcome data.

## GLMMs

• Taking the log transformation of the response variable won’t always work
• The next step is to enter the world of generalized linear mixed models (GLMMs)
• General linear mixed model and generalized linear mixed model begin with the same letters, so it’s a little bit confusing!

## GLMM with binary response variable

• Let’s say we have response data that is binary
• For example, testing the ability of a vaccine to prevent some disease in animals
• Control group was not inoculated and treatment group was inoculated
• The response variable is whether or not the disease is present
• The animals are raised in different pens of 10 animals each in a complete block design
• We need to account for the effect of pen on disease using a random intercept

Load needed packages and dataset (simulated)

library(tidyverse)
library(lme4)
library(easystats)
disease <- read_csv('datasets/disease.csv')

## Summarize the data

• How many animals in each pen and treatment got the disease?
• Use group_by() and summarize() to make a table
disease %>%
group_by(inoc, pen) %>%
summarize(no_disease = sum(disease == 0), disease = sum(disease == 1))
• What if we tried to just use a linear mixed model as we have done before?
• We code “disease absent” as 0 and “disease present” as 1 to make it numeric
lmm_disease <- lmer(disease ~ inoc + (1|pen), data = disease)
• Seems reasonable at first glance
• The intercept is 0.78, so the baseline rate is 78% disease
• The coefficient on inoculation is -0.64, so inoculation treatment makes disease go down to 14%
• What is wrong with that?
• Let’s say we had a population where the baseline rate of disease was only 40%.
• Our model would predict -24% disease when inoculated
• It makes more sense to think about relative changes to odds.
• If the baseline rate is very low, it can’t be reduced much more
• If the baseline rate is high, it’s easier to reduce

## Diagnostics of linear model

check_model(lmm_disease)
• Residuals of this model do not really meet assumptions
• Not normally distributed with homogeneous variance
• Predicted values of the model go well outside the 0 to 1 range
• Very poor match between the data (green line) and predictions (blue lines)

## Solving the problem

• Model binary response with a generalized linear mode
• Use a link function to convert the predicted response value into a scale that can be normally distributed
• Predict the probability that an individual from each pen with or without the inoculation treatment will get the disease
• Probability ranges from 0 to 1, so it cannot really have normally distributed error

## GLMM formula

Formula for predicting the probability of disease for individual $i$ in pen $j$:

$\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j$

• $\hat{y}_{i,j}$: the predicted probability of disease for individual $i$ in pen $j$
• $\beta_0$: overall intercept (fixed effect)
• $\beta_1$: treatment coefficient (fixed effect)
• $inoc_i$: binary x variable, 0 for uninoculated control and 1 for inoculated treatment
• $u_j$: intercept for each pen (random effect)

## What is that “logit” thing???

$\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j$

• Logit, or log odds, is the link function

$\text{logit}~p = \log \frac {p}{1-p}$

• Transforms the probability (ranges from 0 to 1) to log-odds scale (can take any value, + or -)
• R function is qlogis()
• Inverse is plogis() (converts log odds back to probability)

## Graph of the logit function

• Maps the values from 0 to 1 to negative to positive infinity.
• More or less a straight line between about 0.25 and 0.75
• Starts to get steep as you get close to the boundaries

## Fit the logit model

• Binomial (aka logistic) GLMM with a logit link function
• Binomial means response variable can have two values (0 and 1, or no and yes)
• Now we use glmer() instead of lmer()
• Same model formula as with lmer()
• New argument, family, refers to the “family” of error distributions
• binomial(link = 'logit') used here, other links are possible
glmm_disease <- glmer(disease ~ inoc + (1|pen), data = disease, family = binomial(link = 'logit'))

## Diagnostics of GLMM

check_model(glmm_disease)
• Looks better, especially the predictive check
• Model only predicts values between 0 and 1
• Makes logical sense and matches the observations

## Exploring model summary

summary(glmm_disease)
• Intercept is 1.39 and treatment coefficient is -3.37. What does that mean???
• The coefficients are on the log odds scale and need to be back-transformed to probability scale

## Back-transforming intercept

• 1.39 is the fitted log odds of disease in uninoculated control, averaged across the random effect of pen
• Transform this log odds back to the probability scale ranging from 0 to 1
• Use the inverse logit function, plogis()
plogis(1.39)
 0.8005922
• Population-level prediction: a random individual who isn’t inoculated has ~80% probability of disease

## Back-transforming slope

• -3.37 is the change in the log odds between uninoculated and inoculated
• Add intercept and slope, then take the inverse logit
plogis(1.39 - 3.37)
 0.1213188
• Population-level prediction: random uninoculated individual has ~12% probability of disease
• We will not have to do these predictions “by hand” – wait for next lesson!

## Effects in GLMMs

• GLMM can have all the same kinds of effects as “plain” LMMs
• Continuous and categorical fixed effect predictors
• Interaction effects
• Crossed and nested random effects
• Random intercepts and random slopes

## GLMM with count response variable

• Response variable can only be a non-negative integer
• Discrete values, instead of continuous like a normal distribution
• Model cannot predict negative counts
• We can use GLMM with Poisson distribution with a log link function
• Poisson distribution is bounded over non-negative integers
• Ideal for count data (as long as there are not too many 0 values)

## Example count dataset

• From agridat package
• Experiment conducted by George Stirret and colleagues in Canada in 1935
• Use of fungal spores applied to corn plants to control European corn borer
• Four levels of the fungal spore treatment (trt column in the dataset):
• untreated control ("None")
• early fungal treatment ("Early")
• late fungal treatment ("Late")
• fungal spores were applied both early and late ("Both")
• 15 experimental blocks (block column), each containing four plots
• Number of borers per plot was counted on two different dates
• We will only consider the latest count (count2 column)

data('stirret.borers', package = 'agridat')

If you are running R locally and didn’t/can’t install the agridat package, I included the CSV in the example datasets

stirret.borers <- read_csv('datasets/stirret.borers.csv')

## Examine the data

Histogram of the data by treatment, with means as a line (don’t worry about the code for now)

borer_means <- stirret.borers %>%
group_by(trt) %>%
summarize(count2_mean = mean(count2))

ggplot(stirret.borers, aes(x = count2)) +
geom_histogram(bins = 10) +
geom_vline(aes(xintercept = count2_mean), data = borer_means, color = 'red') +
facet_wrap(~ trt) +
theme_bw()
• Quite a few counts of zero or close to it
• “Both” and “Late” treatments have the lowest mean counts, then “Early”, then “None”
• Rearrange the factor levels of the trt column so control level ’None is first
• It will be the reference level in the model, for easier interpretation
stirret.borers <- stirret.borers %>%
mutate(trt = factor(trt, levels = c('None', 'Early', 'Late', 'Both')))

## Fit as general linear model

• For proof of concept
lmm_borers <- lmer(count2 ~ trt + (1|block), data = stirret.borers)

## Fit as generalized linear model

• Use family = poisson(link = 'log')
• Log is the link function that works best with the Poisson distribution
glmm_borers <- glmer(count2 ~ trt + (1|block), data = stirret.borers, family = poisson(link = 'log'))

## Diagnostic plots

Compare LMM and Poisson GLMM diagnostics

check_model(lmm_borers)
check_model(glmm_borers)
• Residual diagnostics actually look acceptable for both models
• The count data are really not that far off from normal
• But LMM produces a lot of negative predictions
• GLMM allows us to make more meaningful predictions

## Coefficients of GLMM

summary(glmm_borers)
• Back-transform coefficient estimates from the link function scale back to the original scale of the data
• Coefficients on log scale so we use exponentiation (exp()) as the inverse
• Intercept (3.42) is the estimated mean for the “None” treatment (control group) on a log scale
exp(3.42)
 30.56942
• For other treatment groups, add coefficients to intercept and then exponentiate
• For example, “Early” treatment coefficient was -0.24
exp(3.42 - 0.24)
 24.04675

• we are barely scratching the surface of GLMMs
• glmer()` can only fit a small subset of GLMMs
• For example, your response variable might be a categorical outcome with more than two possibilities (multinomial GLMM)
• Or ordered categorical, such as disease ratings on a 1-5 scale

## Other topics for future workshops

• Repeated measures designs
• Spatial autocorrelation
• Overdispersion in count data
• Zero-inflation
• Different types of random effects (G-side versus R-side) and error structures