At the end of this lesson, students will …

- Know what a generalized linear mixed model is and why you might want to use one.
- Fit a generalized linear mixed model to binary outcome data.
- Fit a generalized linear mixed model to count outcome data.

- Taking the log transformation of the response variable won’t always work
- The next step is to enter the world of general
*ized*linear mixed models (GLMMs) - General linear mixed model and generalized linear mixed model begin with the same letters, so it’s a little bit confusing!

- Let’s say we have response data that is binary
- For example, testing the ability of a vaccine to prevent some disease in animals
- Control group was not inoculated and treatment group was inoculated
- The response variable is whether or not the disease is present
- The animals are raised in different pens of 10 animals each in a complete block design
- We need to account for the effect of pen on disease using a random intercept

Load needed packages and dataset (simulated)

- How many animals in each pen and treatment got the disease?
- Use
`group_by()`

and`summarize()`

to make a table

- What if we tried to just use a linear mixed model as we have done before?
- We code “disease absent” as 0 and “disease present” as 1 to make it numeric

- Seems reasonable at first glance
- The intercept is 0.78, so the baseline rate is 78% disease
- The coefficient on inoculation is -0.64, so inoculation treatment makes disease go down to 14%
- What is wrong with that?

- Let’s say we had a population where the baseline rate of disease was only 40%.
- Our model would predict -24% disease when inoculated
- It makes more sense to think about relative changes to odds.
- If the baseline rate is very low, it can’t be reduced much more
- If the baseline rate is high, it’s easier to reduce

- Residuals of this model do not really meet assumptions
- Not normally distributed with homogeneous variance
- Predicted values of the model go well outside the 0 to 1 range
- Very poor match between the data (green line) and predictions (blue lines)

- Model binary response with a generalized linear mode
- Use a link function to convert the predicted response value into a scale that can be normally distributed
- Predict the probability that an individual from each pen with or without the inoculation treatment will get the disease
- Probability ranges from 0 to 1, so it cannot really have normally distributed error

Formula for predicting the probability of disease for individual \(i\) in pen \(j\):

\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]

- \(\hat{y}_{i,j}\): the predicted probability of disease for individual \(i\) in pen \(j\)
- \(\beta_0\): overall intercept (fixed effect)
- \(\beta_1\): treatment coefficient (fixed effect)
- \(inoc_i\): binary
*x*variable, 0 for uninoculated control and 1 for inoculated treatment - \(u_j\): intercept for each pen (random effect)

\[\text{logit}~\hat{y}_{i,j} = \beta_0 + \beta_1 inoc_i + u_j\]

- Logit, or log odds, is the link function

\[\text{logit}~p = \log \frac {p}{1-p}\]

- Transforms the probability (ranges from 0 to 1) to log-odds scale (can take any value, + or -)
- R function is
`qlogis()`

- Inverse is
`plogis()`

(converts log odds back to probability)

- Maps the values from 0 to 1 to negative to positive infinity.
- More or less a straight line between about 0.25 and 0.75
- Starts to get steep as you get close to the boundaries

- Binomial (aka logistic) GLMM with a logit link function
- Binomial means response variable can have two values (0 and 1, or no and yes)
- Now we use
`glmer()`

instead of`lmer()`

- Same model formula as with
`lmer()`

- New argument,
`family`

, refers to the “family” of error distributions `binomial(link = 'logit')`

used here, other links are possible

- Looks better, especially the predictive check
- Model only predicts values between 0 and 1
- Makes logical sense and matches the observations

- Intercept is 1.39 and treatment coefficient is -3.37. What does that mean???
- The coefficients are on the log odds scale and need to be back-transformed to probability scale

- 1.39 is the fitted log odds of disease in uninoculated control, averaged across the random effect of pen
- Transform this log odds back to the probability scale ranging from 0 to 1
- Use the inverse logit function,
`plogis()`

- Population-level prediction: a random individual who isn’t inoculated has ~80% probability of disease

- -3.37 is the change in the log odds between uninoculated and inoculated
- Add intercept and slope, then take the inverse logit

- Population-level prediction: random uninoculated individual has ~12% probability of disease
- We will not have to do these predictions “by hand” – wait for next lesson!

- GLMM can have all the same kinds of effects as “plain” LMMs
- Continuous and categorical fixed effect predictors
- Interaction effects
- Crossed and nested random effects
- Random intercepts and random slopes

- Response variable can only be a non-negative integer
- Discrete values, instead of continuous like a normal distribution
- Model cannot predict negative counts
- We can use GLMM with Poisson distribution with a log link function
- Poisson distribution is bounded over non-negative integers
- Ideal for count data (as long as there are not too many 0 values)

- From
**agridat**package - Experiment conducted by George Stirret and colleagues in Canada in 1935
- Use of fungal spores applied to corn plants to control European corn borer
- Four levels of the fungal spore treatment (
`trt`

column in the dataset):- untreated control (
`"None"`

) - early fungal treatment (
`"Early"`

) - late fungal treatment (
`"Late"`

) - fungal spores were applied both early and late (
`"Both"`

)

- untreated control (

- 15 experimental blocks (
`block`

column), each containing four plots - Number of borers per plot was counted on two different dates
- We will only consider the latest count (
`count2`

column)

If you are running R locally and didn’t/can’t install the **agridat** package, I included the CSV in the example datasets

Histogram of the data by treatment, with means as a line (don’t worry about the code for now)

- Quite a few counts of zero or close to it
- “Both” and “Late” treatments have the lowest mean counts, then “Early”, then “None”

- Rearrange the factor levels of the
`trt`

column so control level `’None`` is first - It will be the reference level in the model, for easier interpretation

- For proof of concept

- Use
`family = poisson(link = 'log')`

- Log is the link function that works best with the Poisson distribution

Compare LMM and Poisson GLMM diagnostics

- Residual diagnostics actually look acceptable for both models
- The count data are really not that far off from normal
- But LMM produces a lot of negative predictions
- GLMM allows us to make more meaningful predictions

- Back-transform coefficient estimates from the link function scale back to the original scale of the data
- Coefficients on log scale so we use exponentiation (
`exp()`

) as the inverse

- Intercept (3.42) is the estimated mean for the “None” treatment (control group) on a log scale

- For other treatment groups, add coefficients to intercept and then exponentiate
- For example, “Early” treatment coefficient was -0.24

- we are barely scratching the surface of GLMMs
`glmer()`

can only fit a small subset of GLMMs- For example, your response variable might be a categorical outcome with more than two possibilities (multinomial GLMM)
- Or ordered categorical, such as disease ratings on a 1-5 scale

- Repeated measures designs
- Spatial autocorrelation
- Overdispersion in count data
- Zero-inflation
- Different types of random effects (G-side versus R-side) and error structures

- In many real world cases of data analysis, the “frequentist” methods we have covered today simply will not work
- The best alternative is to use a Bayesian approach
- I recommend the Bayesian R modeling packages
**brms**and**rstanarm** - I am planning a workshop on
**brms**for this coming year