Lesson 1: R basics

Course introduction: mixed models in R

  • Follow along with lesson slides and less on text
  • Each lesson has a worksheet
  • Fill in the ... in the worksheet with code
  • Typing in the code yourself is better than copy+paste!
  • Optional exercises at the end
  • This is a practical skills course, not “Principles of Statistics 101!”

Day 1 Schedule

Time Activity
9:00-9:15 AM Introductions, troubleshooting
9:15-10:15 AM Lesson 1: R Boot Camp: the very basics
10:15-10:45 AM break
10:45-11:30 AM Lesson 2: R Boot Camp: working with data frames
11:30-11:45 AM break
11:45 AM-12:45 PM Lesson 3: From linear model to linear mixed model
12:45-2:00 PM lunch break
2:00 PM-4:00 PM office hours

(Day 2 schedule has the same format)

Lesson 1 learning objectives

At the end of this lesson, students will …

  • Know what R is and what it can do.
  • Use the R console to interactively issue R commands.
  • Know the most common data types in R.
  • Know how statistical distributions work in R.
  • Know what R packages are and how to install and load them.

Introduction to R and RStudio

R and RStudio are software tools to help you work with and analyze your data.

What is R?

  • A statistical programming language
  • Users contribute packages
  • Free and open-source

What is RStudio?

  • A tool to help you write and run code in R
  • RStudio is not R, it is an interface for R (you need to also have R installed to run RStudio)
  • We will access RStudio through Posit Cloud for this course
  • Or you can run RStudio locally if you prefer

RStudio panes

  • Console: Enter individual lines of code, see output
  • Scripts: Edit and run scripts (text files containing code)
  • Environment: Shows variables that you have created
  • Files/Plots/Help: Includes several tabs
    • Files: navigate your filesystem
    • Plots: display images generated by your R code
    • Packages: view and install R packages
    • Help: documentation for functions and packages

The basic moving parts of R

  • variable: a structure that holds data. Examples:
    • a vector of integers c(1, 2, 3)
    • a character string "USDA"
    • a data frame with 1000 rows and 10 columns

The basic moving parts of R

  • function: something that takes arguments as input, does something, and returns output.
    • log(10): takes a numeric value as input and returns a numeric value as output
    • c(1, 5, 6): The function c() takes multiple values as input and returns a vector as output.
    • read.csv('myfile.csv'): takes a character string as input and returns a data frame as output.

How to R

  • Let’s start writing our first R code!
  • Enter the example code in the console

Using R as a calculator

  • Use operators: +, -, *, /, ^ to use R as a calculator
2 + 3

The assignment operator

  • The assignment operator <- is used to create a new variable and give it a value. The syntax is variable <- <value>.
  • Variable names can contain . or _ but can’t contain spaces or start with a number.
  • You can also use = as an assignment operator but we will use <- in this workshop. Consistent code is readable code!
x <- 2 + 3
y = 3.5
  • Entering the name of a variable prints that variable’s value to the console.
  • If you assign a value to a new variable, nothing will print to the console. But the variable is now defined in your environment and can be used later.
x
x + y
x * 4

x <- x + 1
z <- x * 4
z

Comments

  • Any line preceded by # is a comment and will not be evaluated.
# This is a comment

Functions with arguments

  • A function followed by an argument in parentheses (), like function(<value>), will input a value to a function and return some output
log(1000)

sin(pi)
  • Functions can take multiple arguments separated by commas ,
  • You can use either 'single quotes' or "double quotes"
my_name <- "Quentin"

paste('Hello,', my_name)

Getting help

  • Use ? to get help about a function
?paste
  • Use ?? to search all help documentation for a term
??sequence

Types of output

  • Usually output prints to the console unless assigned to a variable
  • Some code produces other output as a “side effect,” such as a plot
plot(mpg ~ hp, data = mtcars)

Errors, warnings and notes

Code can produce messages instead of or in addition to output:

  • Errors
  • Warnings
  • Notes

Errors

  • Indicates something went wrong
  • No output is produced
sin(pi))

Warnings

  • Indicates the result may not be what you expected
  • Code still runs and produces output
log(-5)

Notes

  • Just a note. Everything is still fine!
rep(0, 100000)

Data types in R

  • The [1] in the output from earlier indicates it is a vector of length 1
  • Vectors are sequences of one or more elements of the same data type
    • numeric
    • character
    • factor
    • logical

Numeric

  • Here are two ways to make a numeric vector with a sequence of integers 1 to 100
  • The first way uses a function seq() with three named arguments
  • Separate arguments with ,
  • The notation with : is shorthand
seq(from = 1, to = 100, by = 1)

1:100

Character

  • Text values
  • Use single quotes ' or double quotes " to create character vectors
  • We can index vectors with brackets [] containing one or more integer values
c('a', 'b', 'c', 'd', 'e', 'f', 'g')

letters[1:7]

letters[c(1, 18, 19)]

c('USDA', 'ARS', 'SEA')

Issues with numeric and character data types

  • Wrong data type often results in an error
log('hello')
  • Combination of numeric and character is forced to character
  • This is a common problem when reading data from a spreadsheet
c(100, 5.323, 'missing value', 12)

Factor

  • Looks like character but can only contain predefined values (levels)
  • Levels are sorted in a specific order
  • Used for categorical variables in models
  • The first level is usually considered the control or intercept in models
treatment <- factor(c('low', 'low', 'medium', 'medium', 'high', 'high'))

treatment

Sorting factor levels

  • Default order is alphabetical
  • We can sort the levels in a logical order instead of alphabetical
treatment <- factor(treatment, levels = c('low', 'medium', 'high'))

treatment

Logical

  • Can take two values, TRUE and FALSE
  • The result of a comparison is a logical vector
  • Logical operators in R:
    • x == y: is x equal to y?
    • x != y: is x not equal to y?
    • x > y: is x greater than y?
    • x >= y: is x greater than or equal to y?
    • x < y: is x less than y?
    • x <= y: is x less than or equal to y?
    • x > y & x < z: is x greater than y and less than z?
    • x > y | x < z: is x greater than y or less than z?

Examples of comparisons with logical operators

x <- 1:5

x > 4

x <= 2

x == 3

x != 2

x > 1 & x < 5

x <= 1 | x >= 5

The ! operator

  • ! is the negation operator
  • Converts all TRUE values to FALSE and vice versa.
!(x == 3)

The %in% operator

  • %in% is an operator comparing two vectors
  • Goes through the vector on the left-hand side and returns TRUE for the values that appear anywhere in the vector on the right-hand side, and FALSE otherwise
c(1, 5, 6, 7) %in% x

x %in% c(1, 5, 6, 7)

Functions that take vectors as input

  • Some functions take a vector as input and return a vector of the same length.
    • exp(): the exponential of each element in the vector
set.seed(123)

random_numbers <- rnorm(n = 1000, mean = 0, sd = 1)

head(exp(random_numbers))

PROTIP: set.seed() ensures the code produces the same result each time, and head() means only print the first few values of a result

  • Other functions take a vector as input and return only one or a few values
  • length(), mean(), median(), and sd() return a single value.
length(random_numbers)
mean(random_numbers)
median(random_numbers)
sd(random_numbers)
  • range() returns a vector of two values, the minimum and maximum of the vector
  • quantile() takes two vectors as input.
    • First argument is the vector we want the quantiles from
    • The second vector, probs, contains the probabilities we want to calculate the quantiles for
    • The function returns a vector with the same length as probs containing the percentiles
range(random_numbers)
quantile(random_numbers, probs = c(0.025, 0.5, 0.975))

Statistical distributions

  • R has a lot of built-in statistical distributions
  • All of them have four functions beginning with r, d, p, and q and followed by the (abbreviated) name of the distribution.
    • r: random draws from the distribution
    • d: probability density function (what is the y-value of the function given x?)
    • p: cumulative density function: (what is the cumulative probability given x?)
    • q: quantile (what is the x-value given the cumulative probability?); q is the inverse of p.
  • For example, the functions for the normal distribution are rnorm(), dnorm(), pnorm(), and qnorm()
  • Default to the standard normal distribution with mean = 0 and sd = 1
  • You can change those parameters by modifying the mean and sd arguments

What does dnorm do?

What does qnorm do?

Other distributions you might work with

  • Binomial (rbinom(), dbinom(), pbinom(), qbinom())
  • Uniform (runif(), dunif(), punif(), qunif())
  • Student’s t (rt(), dt(), pt(), qt())
  • The list goes on …

Type ?Distributions in your console to see help documentation about all the built-in distributions.

Common pitfalls

If you get an error or your code doesn’t work, here are some things to check.

  • Punctuation: close all parentheses, brackets, and quotation marks.
(5+3))/2 # Nope

(5+3)/2 # Yep
  • Spelling: are the functions and variables spelled correctly?
my_variable <- 100000

myvariable
  • Spaces
    • Spaces are good for making code more readable
    • Compare x<-log(500,base=2) and x <- log(500, base = 2)
    • But you can’t put spaces in the middle of the name of a function or variable
some_numbers <- 1:5

( some_numbers + 3 ) ^ 2

(some_numbers+3)^2

(some numbers + 3)^2
  • Case: R is CASE-SENSITIVE (unlike SAS)
sum(1:10)
Sum(1:10)

R packages

  • So far we have only used code from “base R.”
  • But almost any R script requires one or more packages
  • Packages are sets of functions contributed by R users that are available for download on CRAN

Installing a package

  • Install a package for the first time either via the RStudio dialog or with the function install.packages()
  • This only needs to be done once!
install.packages('cowsay')

PROTIP: You can specify the location of the library the package will install into. This means you can specify one that doesn’t require administrator level access.

Loading and using a package

  • Load a package from the code library where packages are installed using the function library()
  • This needs to be done every time you load a package!
library(cowsay)
say('USDA statisticians are the best!', by = 'cow')

 ----- 
USDA statisticians are the best! 
 ------ 
    \   ^__^ 
     \  (oo)\ ________ 
        (__)\         )\ /\ 
             ||------w|
             ||      ||
  • You can also use the package name followed by :: to be explicit
cowsay::say("Don't forget to close your parentheses", by = 'chicken')

 ----- 
Don't forget to close your parentheses 
 ------ 
    \   
     \
         _
       _/ }
      `>' \
      `|   \
       |   /'-.     .-.
        \'     ';`--' .'
         \'.    `'-./
          '.`-..-;`
            `;-..'
            _| _|
            /` /` [nosig]
  
  • To access all the help documentation for a package, use help(package = 'packagename').

Learning R best practices

How do I get help?

  • Google is your friend (copy and paste your error message)
  • StackOverflow is your friend too
  • stats.stackexchange.com if you have a question about stats that isn’t specific to R programming

Console versus script editor

  • Typing and running individual lines of code is great for exploring
  • It is not as good when you are doing complex data wrangling and analysis
  • You can save scripts (text files of code) to run again later
  • Run individual lines or selected blocks of code from the script editor by pressing Ctrl+Enter (Win) or Cmd+Enter (Mac)

Hey! What about … ?

  • Functions
  • Lists
  • Flow control (if, else, for)

Those are really important things but we aren’t going to cover them in this lesson. I strongly encourage you to explore the R resources I’ve provided to learn more. And maybe I’ll discuss them in a future workshop.

Exercises

Go to the lesson page and try out the exercises!