# Lesson 3: From linear model to linear mixed model

## Lesson 3 learning objectives

At the end of this lesson, students will …

• Understand the difference between a linear model and a linear mixed model.
• Understand the difference between random intercepts and random slopes.
• Be able to fit a mixed model with fixed effects, both continuous and categorical, and interactions.
• Be able to fit a mixed model with random effects, both random intercepts and random slopes.
• Examine diagnostic plots to ensure that model assumptions are met.
• Examine summary information from mixed model output, including ANOVA tables.

## Load the packages we need

library(tidyverse)
library(lme4)
library(lmerTest)
library(easystats)

## Read in some example data

• For this lesson, we will keep it simple and use simulated data
• Read the soilN_biomass.csv file to a data frame called soilN_biomass
• Simulates a study with five different fields with 20 plots each
• In each plot, soil nitrogen and plant biomass is measured
soilN_biomass <- read_csv('datasets/soilN_biomass.csv')

## Examine the data

• Use functions you learned in lesson 2
soilN_biomass
summary(soilN_biomass)
glimpse(soilN_biomass)
• field: character column to identify which field the measurement comes from
• soilN: numeric column for the soil N value
• biomass: numeric column for the plant biomass value

## Plot the data

• Plot with ggplot() (Don’t worry about the code used to make this plot for now.)
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw()
• The relationship between soil N and plant biomass is sort of unclear
• How can we fit a model to this?

## Model Syntax in R

• At minimum, you need
• model formula
• data frame that the variables in the model formula come from
• Formula has two sides: left and right, separated by tilde ~
• LHS ~ RHS

## LHS and RHS

• Left-hand side is the response variable (y)
• Right-hand side contains the predictor variables (x)
• Effects are separated by a plus sign +
• Interaction effects are denoted with a colon :
• The shorthand * indicates all possible combinations of interactions for a set of variables (I don’t recommend)

### Example of a formula

• weight ~ height + sex + height:sex
• In SAS this would be model weight = height sex height*sex
• “The expected value of weight is a linear combination of height, sex, and the product of height and sex.”

### Intercept

• An intercept is included by default
• You can be explicit by adding 1 to the RHS
• weight ~ 1 + height
• To fit a model without an intercept (force regression line through origin) replace 1 with 0
• weight ~ 0 + height

## Linear model with continuous predictor

• Relationship between soil nitrogen (x) and plant biomass (y) in a plot
• Fit linear regression with lm() (linear model)
• Model formula is just y ~ x
• data argument is the data frame containing columns biomass and soilN
lm_fit <- lm(biomass ~ soilN, data = soilN_biomass)

## Look at model output

• Use summary()
summary(lm_fit)
• Positive coefficient (~1.04) on soilN
• For every 1 unit increase in soil N, there is a 1.04 unit increase in biomass

If you just want the coefficients you can use the coef() function:

lm_coefs <- coef(lm_fit)

## Plot fitted trendline from regression

• Same plot as before but with a line using intercept and slope from model
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_abline(intercept = lm_coefs[1], slope = lm_coefs[2], size = 1)
• Relatively poor fit – Why?
• Linear regression assumes a single intercept and slope holds across all fields
• But fields might have different unmeasured environmental conditions causing different relationships
• Some fields might have higher soil phosphorus or better soil moisture conditions
• Or just environmental “noise” with no discernible cause
• There is a bigger problem with the simple linear regression
• The 20 measurements that come from the same field are not independent because they have a common cause
• The regression model has 98 degrees of freedom
• 100 data points - 2 parameters, the intercept and slope = 98
• But if we use one average value from each field we will only have 3 degrees of freedom
• Ideally we want something in between
• Measurements from the same field are partially but not completely dependent on one another

## Separate linear regressions

• So why not fit separate models for each field?
• This code is just for demonstration: fit five separate regressions, get intercept and slope for each field, then plot the lines
separate_lm_fits <- soilN_biomass %>%
group_by(field) %>%
group_map(~ lm(biomass ~ soilN, data = .))

separate_lm_coefs <- data.frame(field = c('a', 'b', 'c', 'd', 'e'),
intercept = map_dbl(separate_lm_fits, ~ coef(.)[1]),
slope = map_dbl(separate_lm_fits, ~ coef(.)[2]))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_abline(aes(intercept = intercept, slope = slope, color = field), size = 1, data = separate_lm_coefs)
• We now have a separate relationship for each field
• But we might hypothesize that there is some overall tendency of soil N to affect plant biomass
• This process occurs in every field but has different outcomes because of the different environments in each field
• How do we account for this?

## Our first linear mixed model

• Goal of our scientific inference is to generalize what we find in a particular study to a larger population
• Separate regressions for each field doesn’t generalize because we aren’t estimating the overall relationship
• Linear mixed models (LMM) estimate the overall relationship and account for unknown variation between fields at the same time

### What do you call this model?

Mixed models meme courtesy of Chelsea Parlett-Pelleriti on Twitter

## Partial pooling

• complete pooling in simple linear regression (all observations share intercept and slope parameters)
• no pooling when we did separate regressions for each field (separate intercept and separate slope for each field)
• Mixed models do partial pooling
• some parameters shared by all groups (the fixed effects)
• some parameters are unique to each group (the random effects)
• called “mixed” model because there’s a “mix” of fixed and random effects

## Random intercept model

• We will use lme4 package to fit mixed models
• lmer() instead of lm()
mm_fit <- lmer(biomass ~ soilN + (1 | field), data = soilN_biomass)
• + (1 | field) – what’s that new thing?
• A separate random intercept will be fit to each field

## Syntax of random effect term

(1 | field)

• Contained within parentheses ()
• Has a vertical bar | in the middle of it.
• Left of | is the design component (1 indicates an intercept)
• Right of | are the grouping factors (here it is only field)

## Visualizing model fit of random intercept model

Plot model fit with ggplot2 (again don’t worry about code)

pred_grid <- expand.grid(soilN = c(0, 10), field = letters[1:5])
mm_pred_group <- cbind(pred_grid, biomass = predict(mm_fit, newdata = pred_grid))
mm_pred_population <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_line(data = mm_pred_group) +
geom_line(data = mm_pred_population, color = 'black', size = 1)
• Model now accounts for different “baseline” biomass levels in each field with different intercepts
• Thick black line is population-level expectation for the “average” field
• Mixed model treats intercepts as if they were drawn from a normal distribution
• Population-level intercept is the mean of that distribution
• But lines are all parallel (slope is the same for each field)

## Random slope model

• Slopes can also vary by field
• Change the design (left-hand side of random effects term) to (soilN | field)
• Slope with respect to soilN (and intercept) will be different for each field
mm_fit_randomslopes <- lmer(biomass ~ soilN + (soilN | field), data = soilN_biomass)

## Visualizing model fit of random slope model

mm_pred_group_randomslopes <- cbind(pred_grid, biomass = predict(mm_fit_randomslopes, newdata = pred_grid))
mm_pred_population_randomslopes <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit_randomslopes, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_line(data = mm_pred_group_randomslopes) +
geom_line(data = mm_pred_population_randomslopes, color = 'black', size = 1)
• Group-level predictions for each field now have different slopes and intercepts

## Delving into mixed model output

• Diagnostic plots
• Model summaries

## Diagnostic plots

• Does the statistical model meet all assumptions?
• Residuals need to be roughly normally distributed and have no trend with respect to the fitted values
• Random effects need to be normally distributed
• check_model() from easystats package produces all the diagnostic plots you need
check_model(mm_fit_randomslopes)

## Residual diagnostics

• The plots look great! (because this is fake data)
• Linearity and homogeneity of variance plots show no trend
• Normal Q-Q plots for the overall residuals (bottom left) and the random effects (bottom right) both straight lines
• I recommend against formal tests on residuals, e.g. Shapiro-Wilk test
• Assumptions like “normal residuals” are only approximations
• A perfect normal distribution does not exist in nature, just like a perfect circle does not exist
• Just has to be “close enough” for the model to be valid

## The most important assumption of a linear model

• Are individual data points independent samples?
• Or if not, did you account for any dependence between them in your model?
• Mixed models allow us to account for the partial dependency of data points

## Model summaries

• summary() of the fitted model object gives us a lot of output
• Distributions of residuals and random effects
• Fixed effect coefficients with t-tests
summary(mm_fit_randomslopes)

## Getting parameters from the fit object

• ranef(): random effects
• fixef(): fixed effects
• coef(): coefficients
• Add the random effect for each group to the fixed effect
• Result is the group-level intercepts and slopes
ranef(mm_fit_randomslopes)
fixef(mm_fit_randomslopes)
coef(mm_fit_randomslopes)

## ANOVA

• Use the function anova()
• We need the lmerTest package for this
anova(mm_fit_randomslopes)
• Degrees of freedom for the F-test are estimated with an approximation
• Because linear mixed models’ degrees of freedom are “somewhere in between”

## Mixed models with categorical predictor

• Categorical predictor such as treatment variable with a control and one or more treatment levels
• New example data: still simulated but now each field has ten unfertilized control plots, ten low fertilizer plots, and ten high fertilizer plots
• Read data and then sort treatment factor levels to a logical order
fert_biomass <- read_csv('datasets/fertilizer_biomass.csv') %>%
mutate(treatment = factor(treatment, levels = c('control', 'low', 'high')))

## Look at the data

• Box plot for each treatment within each field using ggplot2
ggplot(fert_biomass, aes(x = field, fill = treatment, y = biomass)) +
geom_boxplot() + theme_bw()
• A lot of variation among fields
• Biomass means tend to increase from control to low to high.

## Fit a model

• Try the random intercept model first
fert_mm <- lmer(biomass ~ treatment + (1 | field), data = fert_biomass)
• What does the model summary show us?
summary(fert_mm)
• Three coefficients in the model: the intercept, treatmentlow, and treatmenthigh
• Three treatments (control, low, and high) but only two have their own coefficient
• This is because the mean value of one of the three treatments is set as the intercept
• We only need $n-1$ treatment coefficients for $n$ levels if we have an intercept
• Expected mean value of biomass in unfertilized control plots = ~18.1
• Coefficients on low and high are positive, so the expected biomass is greater than the control
• Expected value of biomass in the low fertilizer treatment = 18.1 + 2.5 = ~20.6
• There are built-in t-tests for each coefficient, comparing them to 0
• The t-test for the intercept is not really meaningful
• The other two t-tests are testing the null hypothesis that the difference between each other treatment and the control is zero
• But they don’t account for multiple comparisons
• And there is no t-test showing whether low and high differ from one another
• Typically do not report these default t-tests

## Random slopes model with categorical predictor

fert_mm_slopes <- lmer(biomass ~ treatment + (treatment | field), data = fert_biomass)
• We get a message that the fit is singular – what?!?

## Singular fit

• “Some components of the variance-covariance matrix of the random effects, or possibly one or more linear combinations of them, are either exactly zero or exactly one.”
• Still … what?!?
• The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
• Often occurs when fitting (overly) complex models
• Or random effects are very small and cannot be distinguished from zero
• We still get output (it’s a note, not an error or warning)
• But we need to take a close look at the random effects and their variances
• Random slopes look fine
ranef(fert_mm_slopes)
• Look at the variance-covariance matrix of the random effects using VarCorr()
VarCorr(fert_mm_slopes)
• Correlations between the random intercepts and the random slopes are exactly 1 or exactly -1
• This means no additional information is being provided by the random slopes
• We should just stick with the random intercepts model in this case
• The random slopes model is probably still statistically valid, just a little too complex for the data
• This is not surprising here because I simulated the data to have no difference in slopes between fields

## Day 1 recap

### What have we done so far?

• Learned the basics of R
• Learned how to work with data frames in R
• Fit a linear model
• Fit linear mixed models with random intercepts and random slopes

Impressive!

### What will we do tomorrow?

• Data transformations in linear mixed models
• Crossed and nested random effects
• Generalized linear mixed models
• How to compare means and test specific hypotheses with contrasts

Excited to see you all then!!!