R for SAS users

What is this workshop?

  • For SAS users who want to learn R
  • Practicing researchers who have a decent working knowledge of SAS and of basic statistical analysis
  • SAS code is provided for comparison purposes
  • Lesson 1 in a series, more soon!

Conceptual learning objectives

During this workshop, you will …

  • Learn the similarities and differences between SAS and R, both different tools for the job
  • Get introduced to what R packages are, in particular the “tidyverse”
  • Learn which common SAS statistical procedures correspond to which R packages/functions

Practical skills

“Data to doc” pipeline, including …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a linear model and a linear mixed-effects model
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
    • quentinread.com/SEAStats
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Background: R and SAS

  • SAS created in the late 1960s as a “statistical analysis system” for agricultural researchers at North Carolina State University
  • Spun off as an independent business in 1976


SAS Hall at NC State University. There is no R Hall … yet!

  • R first released in 1993, created by statisticians Ross Ihaka and Robert Gentleman, University of Auckland, New Zealand

Pros and cons of R

  • R is free and open-source
  • Anyone can contribute a package (set of functions serving a common purpose)
  • Lots of different capabilities
    • GIS and geospatial analysis
    • High-performance computing
    • Machine learning
  • Lots of different user communities
    • Social sciences
    • Ecology
    • Economics
    • Pharmacology
    • Linguistics
    • and more …


Just a few of the R packages out there

Top-down versus crowdsourced

  • SAS is more top-down and centralized than R
  • Usually in SAS, one kind of statistical model = one procedure
  • R may have dozens of ways to fit a particular stat model
  • This is both good and bad

Transition from SAS to R

  • SAS may be winding down support for its desktop products, moving to cloud-based services
  • This is good for big corporations but not great for government and academic researchers
  • R is more reliable for the future
  • Python is another option

Tools for a job

  • R and SAS are both good tools, neither is perfect
  • I still recommend R for stats beginners
  • I’m not trying to convert you from SAS to R, just expose SAS users to a few of R’s capabilities

SAS procedures vs. R packages/functions: an overview table

task SAS R
import data proc import, data step read.csv (base R), read_csv (readr package in tidyverse), read_xlsx (readxl package)
clean, manipulate, and sort data proc transpose, data step dplyr and tidyr packages in tidyverse
make graphs proc sgplot, etc. plot (base R), ggplot (ggplot2 package in tidyverse)
calculate summary statistics proc means, proc freq table (base R), group_by and summarize (dplyr)
fit statistical models (frequentist) proc reg, proc glm, proc mixed, proc glimmix, etc. lm, glm (base R), lme4 package, nlme package, etc.
fit statistical models (Bayesian) proc bglimm brms
look at model diagnostics diagnostics produced within model fitting procedures some diagnostics produced by model fitting packages, also see easystats package
extract predictions from models lsmeans statement, also see proc plm emmeans, marginaleffects

Data to doc pipeline

  • Import the data from a file
  • Clean, manipulate, and sort the data
  • Make exploratory graphs and tables
  • Fit statistical models
  • Look at model diagnostics to make sure our models are valid
  • Extract predictions from the statistical models
  • Make graphs and tables of model predictions
  • Put results into a document

Load R packages

  • tidyverse for reading, manipulating, and plotting data
  • nlme for fitting linear mixed models
  • easystats for making model diagnostic plots
    • (tidyverse and easystats are actually collections of several packages)
library(tidyverse)
library(nlme)
library(easystats)

Import the data from a file

  • The example data for this lesson are hosted on GitHub
  • We are importing the data directly from a URL (usually you read data you have saved locally)

PROTIP: Use CSV not XLSX!

Import the data: SAS

I recommend using proc import but some use the data step

filename csvFile url "https://github.com/qdread/R-for-SAS-users/raw/main/data/NASS_soybean.csv";
proc import datafile = csvFile out = nass_soybeans replace dbms = csv; guessingrows = 2000; run;

Import the data: R

nass_soybeans <- read_csv('https://github.com/qdread/R-for-SAS-users/raw/main/data/NASS_soybean.csv')
  • Use read_csv() function
  • variable is created using the syntax variable <- value: “variable gets value”
  • functions are called using the syntax function(argument)

Examining the data: SAS and R

SAS

proc print data = nass_soybeans(obs = 10); run;

ods select variables;
proc contents data = nass_soybeans; run;

R

head(nass_soybeans, 10)
summary(nass_soybeans)
glimpse(nass_soybeans)

Subsetting the data: SAS

  • Create a new dataset or data frame with only the states in the Southeast region
  • In SAS we use the data step
data se_soybeans; set nass_soybeans;
    where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;

Subsetting the data: R

se_states <- c('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee')

se_soybeans <- filter(nass_soybeans, state %in% se_states)
  • In R tidyverse, we use filter()
  • Create character vector of state names separately
  • filter() function has data frame as first argument, condition as second argument
  • Keep only rows that match the condition
  • Assign the result to a new data frame

Doing calculations on the data: SAS

  • Calculate a new column from existing columns
  • Multiply acreage by yield (bushels per acre) to get total yield in bushels
  • SAS uses data step
data se_soybeans; set se_soybeans;
    total_yield = acres * yield;
run;

Doing calculations on the data: R

se_soybeans <- mutate(se_soybeans, total_yield = acres * yield)
  • R tidyverse: mutate()
  • Data frame is first argument, computation is the second argument
  • Assign the result back to the original data frame, overwriting it

Combining data-wrangling operations: SAS

data se_soybeans; set nass_soybeans;
  where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
    total_yield = acres * yield;
run;
  • In SAS the row subsetting and new column calculation can be part of the same data step

Combining data-wrangling operations: R

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield)
  • Use the “pipe” operator %>% to “chain” operations
  • The %>% passes the result of one line to the next line
  • Here we first pass nass_soybeans unmodified to filter()
  • Then pass the result of filter() to mutate()

Sorting the data: SAS

proc sort data = se_soybeans;
    by year state;
run;
  • Sort by year and then state using proc sort

Sorting the data: R

se_soybeans <- arrange(se_soybeans, year, state)
  • Use arrange() in R tidyverse
  • First argument is data frame, following arguments are the column names to sort on

Even longer pipes

  • filter(), mutate(), and arrange() in the same pipe statement
se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield) %>%
  arrange(year, state)

Reshaping the data

  • Each row is a unique observation
    • Identifier columns are year and state
    • Data columns are acres and yield
  • This is sometimes called “tidy” data
  • For visualization, we might want to reshape to a wide format instead of long and skinny


animation by Garrick Aden-Buie

Reshaping the data: SAS

proc transpose data = se_soybeans out = total_yield_wide;
    by year; id state; var total_yield;
run;
  • We get wide form data, row for each year and column for each state
  • proc transpose specifying by, id, and var
  • Create a new dataset with the reshaped result

Reshaping the data: R

total_yield_wide <- se_soybeans %>%
  pivot_wider(id_cols = year, names_from = state, values_from = total_yield)
  • pivot_wider() from tidyverse
  • Pass the column names as arguments
  • id_cols is for row IDs, confusingly in SAS id is for column IDs

Reshaping wide to long

SAS

proc transpose data = total_yield_wide out = total_yield_long;
    by year;
run;

R

total_yield_long <- total_yield_wide %>%
  pivot_longer(-year, names_to = 'state', values_to = 'total_yield')
  • We end up with more rows than we started with (missing cells now have rows)
  • R is better at renaming the columns sensibly than SAS is

Make exploratory graphs: SAS

proc sgplot data = se_soybeans;
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    series x = year y = yield / group = state; 
    yaxis label='yield (bu/ac)';
run;
  • Time series for four states as different lines on the same plot

Make exploratory graphs: R

fourstates <- filter(se_soybeans, state %in% c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))

ggplot(fourstates, aes(x = year, y = yield, color = state)) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)') +
  theme(legend.position = c(0.2, 0.8))
  • Check out my ggplot2 lesson to learn more

Make multi-panel version of the plot: SAS

proc sgpanel data=se_soybeans;
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    panelby state;
    series x = year y = yield;
    rowaxis label = 'yield (bu/ac)';
run;
  • A separate procedure is needed (sgpanel)

Make multi-panel version of the plot: R

ggplot(fourstates, aes(x = year, y = yield)) +
  facet_wrap(~ state) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)')
  • Same code as before except we removed color = state and added facet_wrap(~ state)

Make tables of summary statistics

  • Many ways to do it in SAS but I like proc sql
  • In R we will use a combination of group_by() and summarize() with pipes
  • Summary statistics to calculate for all states for every 10th year and put into a table:
    • total acreage harvested
    • total yield in bushels
    • weighted mean of yield per acre

Table of summary statistics: SAS

proc sql;
    select 
        year,
        sum(acres) as grand_total_acres,
        sum(total_yield) as grand_total_yield,
        sum(yield * acres) / sum(acres) as mean_yield
    from se_soybeans
    where mod(year, 10) = 0
    group by year;
quit;

Table of summary statistics: R

se_soybeans %>%
  filter(year %% 10 == 0) %>%
  group_by(year) %>%
  summarize(
    grand_total_acres = sum(acres),
    grand_total_yield = sum(total_yield),
    mean_yield = weighted.mean(yield, acres)
  )
  • Piped statement with filter(), group_by(), and summarize()
  • group_by() identifies column or columns by which to split the data into groups of rows
  • summarize() includes a list of summary statistics separated by commas

PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.

PROTIP 2: In SAS, a single equal sign = is used to test whether two values are equal. In R (and in many other languages such as C and Python) you use the double equal sign ==.

Fit statistical models

  • Both SAS and R have tons of different options for model fitting
  • SAS code is sometimes more concise (spits out lots of output automagically)
  • R code usually needs to be explicit about what output you want from the model
    • This has both pros and cons

Simple linear regression: SAS

proc reg data = se_soybeans;
    model yield = year;
run;    
  • Is there a linear trend over time in yield per acre?
  • SAS uses proc reg for this

Simple linear regression: R

yield_fit <- lm(yield ~ year, data = se_soybeans)
  • Use lm()
  • Model formula y ~ x1 + x2
  • data argument says which data frame the variables come from
summary(yield_fit)
anova(yield_fit)
check_model(yield_fit)
  • summary() gives us model fit statistics and parameter estimates
  • anova() shows us F-test for the slope
  • check_model() gives us nice regression diagnostics (from easystats)

Mixed models

  • Simple linear regression assumes all data points are independent
  • But yield values from the same state in different years aren’t independent
  • We have “multilevel” or “nested” data (repeated measures)

Mixed models: SAS

proc glimmix data = se_soybeans plots = residualpanel;
    class state;
    model yield = year / solution;
    random intercept / subject = state;
    random year / type = ar(1) subject = state;
run;
  • proc glimmix with appropriate random statement
  • Random intercepts and slopes, with AR1 error structure
  • Temporal autocorrelation within states

Mixed models: R

yield_fit_lme <- lme(fixed = yield ~ year, 
                     random = ~ 1 + year | state, 
                     correlation = corAR1(), 
                     data = se_soybeans)
  • Use lme(): same components as proc glimmix but different syntax
  • random component:
    • model terms on left-hand side: 1 + year is intercept and slope
    • grouping factor on right-hand side of |
  • correlation = corAR1() argument is similar to type = ar(1) in SAS
check_model(yield_fit_lme)
summary(yield_fit_lme)
anova(yield_fit_lme)
coef(yield_fit_lme)
  • Similar to the lm() output functions
  • coef() gives us a table of intercepts and slopes for each state
  • Shows relatively small difference between states (all ~0.28 more bushels per acre per year)

Make graph of model predictions

  • Model predictions for individual states and population-level expectation
yield_pred_bystate <- expand_grid(year = c(1924, 2011), state = se_states) %>%
  mutate(yield = as.numeric(predict(yield_fit_lme, newdata = .)))

yield_pred_overall <- data.frame(state = 'overall', year = c(1924, 2011)) %>% 
  mutate(yield = as.numeric(predict(yield_fit_lme, newdata = ., level = 0)))
ggplot(mapping = aes(x = year, y = yield, color = state, group = state)) +
  geom_line(data = se_soybeans, alpha = 0.3) +
  geom_line(data = yield_pred_bystate, linewidth = 0.7) +
  geom_line(data = yield_pred_overall, color = 'black', linewidth = 1.2) +
  theme_bw() +
  ggtitle('soybean yields by state, observations and modeled trends, 1924-2011',
          'black line is overall modeled trend')
  • Observed and modeled on the same plot
  • Linear trend is reasonably good model
  • I have no idea how to make this plot with SAS

What did you learn today?

  • The pros and cons of R and SAS
  • How to create a data frame in R by reading data from an external file
  • How to clean, manipulate, sort, and reshape your data frame
  • How to calculate summary statistics from a data frame
  • How to fit a linear model and a linear mixed model

Impressive!

  • See text version of lesson for further reading and useful resources
  • Fill out Google Forms survey
  • Contact me at quentin.read@usda.gov!