Create a new dataset or data frame with only the states in the Southeast region

In SAS we use the data step

data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;

Data frame is first argument, computation is the second argument

Assign the result back to the original data frame, overwriting it

Combining data-wrangling operations: SAS

data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
total_yield = acres * yield;
run;

In SAS the row subsetting and new column calculation can be part of the same data step

We end up with more rows than we started with (missing cells now have rows)

R is better at renaming the columns sensibly than SAS is

Make exploratory graphs: SAS

proc sgplot data = se_soybeans;
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
series x = year y = yield / group = state;
yaxis label='yield (bu/ac)';
run;

Time series for four states as different lines on the same plot

Make exploratory graphs: R

fourstates <-filter(se_soybeans, state %in%c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))ggplot(fourstates, aes(x = year, y = yield, color = state)) +geom_line(linewidth =1) +theme_bw() +scale_y_continuous(name ='yield (bu/ac)') +theme(legend.position =c(0.2, 0.8))

Check out my ggplot2 lesson to learn more

Make multi-panel version of the plot: SAS

proc sgpanel data=se_soybeans;
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
panelby state;
series x = year y = yield;
rowaxis label = 'yield (bu/ac)';
run;

Same code as before except we removed color = state and added facet_wrap(~ state)

Make tables of summary statistics

Many ways to do it in SAS but I like proc sql

In R we will use a combination of group_by() and summarize() with pipes

Summary statistics to calculate for all states for every 10th year and put into a table:

total acreage harvested

total yield in bushels

weighted mean of yield per acre

Table of summary statistics: SAS

proc sql;
select
year,
sum(acres) as grand_total_acres,
sum(total_yield) as grand_total_yield,
sum(yield * acres) / sum(acres) as mean_yield
from se_soybeans
where mod(year, 10) = 0
group by year;
quit;

Piped statement with filter(), group_by(), and summarize()

group_by() identifies column or columns by which to split the data into groups of rows

summarize() includes a list of summary statistics separated by commas

PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.

PROTIP 2: In SAS, a single equal sign = is used to test whether two values are equal. In R (and in many other languages such as C and Python) you use the double equal sign ==.

Fit statistical models

Both SAS and R have tons of different options for model fitting

SAS code is sometimes more concise (spits out lots of output automagically)

R code usually needs to be explicit about what output you want from the model

This has both pros and cons

Simple linear regression: SAS

proc reg data = se_soybeans;
model yield = year;
run;

Is there a linear trend over time in yield per acre?

SAS uses proc reg for this

Simple linear regression: R

yield_fit <-lm(yield ~ year, data = se_soybeans)

Use lm()

Model formula y ~ x1 + x2 …

data argument says which data frame the variables come from

summary() gives us model fit statistics and parameter estimates

anova() shows us F-test for the slope

check_model() gives us nice regression diagnostics (from easystats)

Mixed models

Simple linear regression assumes all data points are independent

But yield values from the same state in different years aren’t independent

We have “multilevel” or “nested” data (repeated measures)

Mixed models: SAS

proc glimmix data = se_soybeans plots = residualpanel;
class state;
model yield = year / solution;
random intercept / subject = state;
random year / type = ar(1) subject = state;
run;

proc glimmix with appropriate random statement

Random intercepts and slopes, with AR1 error structure

Temporal autocorrelation within states

Mixed models: R

yield_fit_lme <-lme(fixed = yield ~ year, random =~1+ year | state, correlation =corAR1(), data = se_soybeans)

Use lme(): same components as proc glimmix but different syntax

random component:

model terms on left-hand side: 1 + year is intercept and slope

grouping factor on right-hand side of |

correlation = corAR1() argument is similar to type = ar(1) in SAS