At the end of this lesson, students will …

- Understand the difference between a linear model and a linear mixed model.
- Understand the difference between random intercepts and random slopes.
- Be able to fit a mixed model with fixed effects, both continuous and categorical, and interactions.

- Be able to fit a mixed model with random effects, both random intercepts and random slopes.
- Examine diagnostic plots to ensure that model assumptions are met.
- Examine summary information from mixed model output, including ANOVA tables.

- For this lesson, we will keep it simple and use simulated data
- Read the
`soilN_biomass.csv`

file to a data frame called`soilN_biomass`

- Simulates a study with five different fields with 20 plots each
- In each plot, soil nitrogen and plant biomass is measured

- Use functions you learned in lesson 2

`field`

: character column to identify which field the measurement comes from`soilN`

: numeric column for the soil N value`biomass`

: numeric column for the plant biomass value

- Plot with
`ggplot()`

(Don’t worry about the code used to make this plot for now.)

- The relationship between soil N and plant biomass is sort of unclear
- How can we fit a model to this?

- At minimum, you need
- model
**formula** **data**frame that the variables in the model formula come from

- model
- Formula has two sides: left and right, separated by tilde
`~`

`LHS ~ RHS`

- Left-hand side is the response variable (
*y*) - Right-hand side contains the predictor variables (
*x*) - Effects are separated by a plus sign
`+`

- Interaction effects are denoted with a colon
`:`

- The shorthand
`*`

indicates all possible combinations of interactions for a set of variables (I don’t recommend)

`weight ~ height + sex + height:sex`

- In SAS this would be
`model weight = height sex height*sex`

- “The expected value of weight is a linear combination of height, sex, and the product of height and sex.”

- An intercept is included by default
- You can be explicit by adding
`1`

to the RHS`weight ~ 1 + height`

- To fit a model without an intercept (force regression line through origin) replace
`1`

with`0`

`weight ~ 0 + height`

- Relationship between soil nitrogen (
*x*) and plant biomass (*y*) in a plot - Fit linear regression with
`lm()`

(linear model) - Model formula is just
`y ~ x`

`data`

argument is the data frame containing columns`biomass`

and`soilN`

- Use
`summary()`

- Positive coefficient (~1.04) on
`soilN`

- For every 1 unit increase in soil N, there is a 1.04 unit increase in biomass

If you just want the coefficients you can use the `coef()`

function:

- Same plot as before but with a line using intercept and slope from model

- Relatively poor fit – Why?

- Linear regression assumes a single intercept and slope holds across all fields
- But fields might have different unmeasured environmental conditions causing different relationships
- Some fields might have higher soil phosphorus or better soil moisture conditions
- Or just environmental “noise” with no discernible cause

- There is a bigger problem with the simple linear regression
- The 20 measurements that come from the same field are not independent because they have a common cause
- The regression model has 98 degrees of freedom
- 100 data points - 2 parameters, the intercept and slope = 98

- But if we use one average value from each field we will only have 3 degrees of freedom
- Ideally we want something in between
- Measurements from the same field are
*partially*but not completely dependent on one another

- So why not fit separate models for each field?
- This code is just for demonstration: fit five separate regressions, get intercept and slope for each field, then plot the lines

```
separate_lm_fits <- soilN_biomass %>%
group_by(field) %>%
group_map(~ lm(biomass ~ soilN, data = .))
separate_lm_coefs <- data.frame(field = c('a', 'b', 'c', 'd', 'e'),
intercept = map_dbl(separate_lm_fits, ~ coef(.)[1]),
slope = map_dbl(separate_lm_fits, ~ coef(.)[2]))
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_abline(aes(intercept = intercept, slope = slope, color = field), size = 1, data = separate_lm_coefs)
```

- We now have a separate relationship for each field
- But we might hypothesize that there is some overall tendency of soil N to affect plant biomass
- This process occurs in every field but has different outcomes because of the different environments in each field
- How do we account for this?

- Goal of our scientific inference is to generalize what we find in a particular study to a larger population
- Separate regressions for each field doesn’t generalize because we aren’t estimating the overall relationship
**Linear mixed models**(LMM) estimate the overall relationship and account for unknown variation between fields at the same time

*complete pooling*in simple linear regression (all observations share intercept and slope parameters)*no pooling*when we did separate regressions for each field (separate intercept and separate slope for each field)- Mixed models do
*partial pooling*- some parameters shared by all groups (the fixed effects)
- some parameters are unique to each group (the random effects)
- called “mixed” model because there’s a “mix” of fixed and random effects

- We will use
**lme4**package to fit mixed models `lmer()`

instead of`lm()`

`+ (1 | field)`

– what’s that new thing?- A separate
*random intercept*will be fit to each field

`(1 | field)`

- Contained within parentheses
`()`

- Has a vertical bar
`|`

in the middle of it. - Left of
`|`

is the*design*component (`1`

indicates an intercept) - Right of
`|`

are the*grouping factors*(here it is only`field`

)

Plot model fit with **ggplot2** (again don’t worry about code)

```
pred_grid <- expand.grid(soilN = c(0, 10), field = letters[1:5])
mm_pred_group <- cbind(pred_grid, biomass = predict(mm_fit, newdata = pred_grid))
mm_pred_population <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_line(data = mm_pred_group) +
geom_line(data = mm_pred_population, color = 'black', size = 1)
```

- Model now accounts for different “baseline” biomass levels in each field with different intercepts
- Thick black line is population-level expectation for the “average” field
- Mixed model treats intercepts as if they were drawn from a normal distribution
- Population-level intercept is the mean of that distribution
- But lines are all parallel (slope is the same for each field)

- Slopes can also vary by field
- Change the design (left-hand side of random effects term) to
`(soilN | field)`

- Slope with respect to
`soilN`

(and intercept) will be different for each field

```
mm_pred_group_randomslopes <- cbind(pred_grid, biomass = predict(mm_fit_randomslopes, newdata = pred_grid))
mm_pred_population_randomslopes <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit_randomslopes, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
geom_point(size = 1.5) +
theme_bw() +
geom_line(data = mm_pred_group_randomslopes) +
geom_line(data = mm_pred_population_randomslopes, color = 'black', size = 1)
```

- Group-level predictions for each field now have different slopes and intercepts

- Diagnostic plots
- Model summaries

- Does the statistical model meet all assumptions?
- Residuals need to be roughly normally distributed and have no trend with respect to the fitted values
- Random effects need to be normally distributed
`check_model()`

from**easystats**package produces all the diagnostic plots you need

- The plots look great! (because this is fake data)
- Linearity and homogeneity of variance plots show no trend
- Normal Q-Q plots for the overall residuals (bottom left) and the random effects (bottom right) both straight lines

- I recommend against formal tests on residuals, e.g. Shapiro-Wilk test
- Assumptions like “normal residuals” are only approximations
- A perfect normal distribution does not exist in nature, just like a perfect circle does not exist
- Just has to be “close enough” for the model to be valid

- Are individual data points
*independent samples*? - Or if not, did you account for any dependence between them in your model?
- Mixed models allow us to account for the
*partial*dependency of data points

`summary()`

of the fitted model object gives us a lot of output- Distributions of residuals and random effects
- Fixed effect coefficients with t-tests

`ranef()`

: random effects`fixef()`

: fixed effects`coef()`

: coefficients- Add the random effect for each group to the fixed effect
- Result is the group-level intercepts and slopes

- Use the function
`anova()`

- We need the
**lmerTest**package for this

- Degrees of freedom for the F-test are estimated with an approximation
- Because linear mixed models’ degrees of freedom are “somewhere in between”

- Categorical predictor such as treatment variable with a control and one or more treatment levels
- New example data: still simulated but now each field has ten unfertilized control plots, ten low fertilizer plots, and ten high fertilizer plots
- Read data and then sort
`treatment`

factor levels to a logical order

- Box plot for each treatment within each field using
**ggplot2**

- A lot of variation among fields
- Biomass means tend to increase from control to low to high.

- Try the random intercept model first

- What does the model summary show us?

- Three coefficients in the model: the intercept,
`treatmentlow`

, and`treatmenthigh`

- Three treatments (control, low, and high) but only two have their own coefficient
- This is because the mean value of one of the three treatments is set as the intercept
- We only need \(n-1\) treatment coefficients for \(n\) levels if we have an intercept

- Expected mean value of biomass in unfertilized control plots = ~18.1
- Coefficients on low and high are positive, so the expected biomass is greater than the control
- Expected value of biomass in the low fertilizer treatment = 18.1 + 2.5 = ~20.6

- There are built-in t-tests for each coefficient, comparing them to 0
- The t-test for the intercept is not really meaningful
- The other two t-tests are testing the null hypothesis that the difference between each other treatment and the control is zero
- But they don’t account for multiple comparisons
- And there is no t-test showing whether low and high differ from one another
- Typically do not report these default t-tests

- We get a message that the fit is singular – what?!?

*“Some components of the variance-covariance matrix of the random effects, or possibly one or more linear combinations of them, are either exactly zero or exactly one.”*- Still … what?!?
- The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
- Often occurs when fitting (overly) complex models
- Or random effects are very small and cannot be distinguished from zero
- We still get output (it’s a note, not an error or warning)
- But we need to take a close look at the random effects and their variances

- Random slopes look fine

- Look at the variance-covariance matrix of the random effects using
`VarCorr()`

- Correlations between the random intercepts and the random slopes are exactly 1 or exactly -1
- This means no additional information is being provided by the random slopes
- We should just stick with the random intercepts model in this case
- The random slopes model is probably still statistically valid, just a little too complex for the data
- This is not surprising here because I simulated the data to have no difference in slopes between fields

- Learned the basics of R
- Learned how to work with data frames in R
- Fit a linear model
- Fit linear mixed models with random intercepts and random slopes

Impressive!

- Data transformations in linear mixed models
- Crossed and nested random effects
- Generalized linear mixed models
- How to compare means and test specific hypotheses with contrasts

Excited to see you all then!!!