# Lesson 2: Working with data frames

## Lesson 2 learning objectives

At the end of this lesson, students will …

• Read a CSV file of data into the R workspace.
• Know what a data frame is and understand what constitutes “tidy” data.
• Do basic manipulations and reshaping of data frames.
• Calculate summary statistics by group.

## What is a data frame?

• A data frame corresponds to a table or spreadsheet of data
• Each column has its own data type
• Ideally, each row will correspond to an individual observation
• Workhorses of data manipulation and analysis

• This lesson features functions from the tidyverse group of packages
• tidyverse is are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R
• Alternatives would be using base R code or data.table
• tidyverse is probably easiest for beginners and very widespread
• I recommend starting here and learning other ways as you get better at R
library(tidyverse)

• If you are using RStudio on the cloud, the file small_mammals.csv is included in the datasets folder
• If running locally, replace datasets with the file path
• For example C:/Users/yourname/Documents/ r_workshop_datasets/small_mammals.csv
• Notice that / (forward slash) is used, even on Windows!
• Use read_csv() from the tidyverse package readr
• This returns a “tibble,” tidyverse’s name for a special class of data frame
dat <- read_csv('datasets/small_mammals.csv')

## Small mammal dataset

• Body measurements of small mammals from the NEON ecological observatory network

## Summary information about a data frame

• Type the name of the data frame into our console
• Prints the first few rows and some information about dimensions
dat
• We can print all the values of dat if we explicitly tell R to do this
• Or print a specified number of rows
print(dat, n = 50)

## Other ways to summarize data frames

• summary() and str() from base R
• glimpse() from tidyverse
• Try calling them and see what output you get
• We can see we have 1629 rows and 7 columns.
• Some character columns (chr) and numeric columns (dbl for double-precision)
• summary(dat) shows counts of missing (NA) values

## Subsetting values from a data frame

• Subset by indexing rows and columns
• Use square brackets []
• Syntax is dataframe[row, column]
• This code returns the 2nd row, 3rd column (actually a 1x1 data frame with that value)
dat[2, 3]

## Subsetting entire rows and columns

• Get an entire row by leaving the column part blank
• Get entire column by leaving the row part blank
• This gives us the 5th row:
dat[5, ]
• This gives us the 2nd column:
dat[, 2]

## Subsetting with ranges

• We can also subset with ranges
• For example this gives us the 6th through 10th rows, including all columns:
dat[6:10, ]
• This gives us the first 5 rows, but only columns 1 and 2.
dat[1:5, 1:2]

## Subsetting columns by name

• You can also get columns by name
• Pass the vector of column names using the c() function, and put quotes around the names
• This gives us the 20th row and the columns called taxonID and sex.
dat[20, c('taxonID', 'sex')]

## The $operator • To extract a single column from a data frame as a vector, you can use the $ operator.
dat$siteName • If the data frame column name contains special characters, surround the name with backticks like dat$column 10.

## Passing data frame columns to functions

• Use the $ operator to pass a single data frame column to a function as input • For example, let’s look at all unique values in the siteName column using the unique() function unique(dat$siteName)
• table() gives you the number of values of each category within a data frame column
table(dat\$sex)

## Subsetting with tidyverse functions: filter()

• The tidyverse function filter() gives you subsets of rows
• A common use of filter() is to subset rows by a condition
• For example we can use filter() to get all rows where weight is less than 10
filter(dat, weight < 10)

## Subsetting with tidyverse functions: select()

• select() gives you subsets of columns
• We do not need to use quotes on column names within select()
select(dat, taxonID, weight)

## Calculating summary statistics

• Next we’ll see how to get summary statistics for entire data frames and subgroups of data
• First we need to learn about how to deal with missing values

## Dealing with missing data

• Summary methods for data frames tell you how many missing values there are
• Many functions that calculate summary statistics return a NA value if there are any missing values
x <- c(5.3, 12.2, -8.6)

max(x)

x <- NA

max(x)

## Ignoring missing values

• Where there’s data, there’s missing data
• There are lots of ways to deal with missing data; the simplest is ignoring it
• Functions mean(), min(), max(), and sum() have an argument na.rm
• If na.rm = TRUE it removes NA values before computing the summary statistic
• Default is FALSE so you need to explicitly set it to TRUE
max(x, na.rm = TRUE)

## The pipe operator

• The pipe operator, or %>%, “pipes” the output of one function into the input of the next function
• Useful for composing expressions with many functions that are complicated but still readable

## Nested versus piped

Nested function call:

sqrt(log(abs(x)))

Piped version of the same function call:

x %>%
abs() %>%
log() %>%
sqrt()
• Can be split across multiple lines
• Order of operations read from top to bottom instead of inside to outside
• A lot easier for people, especially “future you,” to read

## Using pipe to calculate summary statistics

• Summary statistics for entire column of data frame
• The pipe %>% passes dat to the next line of code
dat %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))
• We get only NA because there is at least one missing value in the weight column

### Remove missing values with filter()

• We can use filter() to keep only rows where weight is not missing
• is.na(x) returns a vector of logical values, TRUE if x is missing and FALSE if it is not
• We negate this with the ! operator to get a vector that is TRUE if weight is not missing
• Use the %>% pipe to do two operations on a data frame sequentially.
dat %>%
filter(!is.na(weight)) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))

### Remove missing values with na.rm

• na.rm argument will also remove missing values
• Result will be identical to the previous
dat %>%
summarize(mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE))

## Calculating summary statistics by group

• Use group_by() before summarize()
• Here we remove missing values, group by the taxonID column, and get mean and SD of body weight for each taxon
dat %>%
filter(!is.na(weight)) %>%
group_by(taxonID) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))

## More complex operations

• filter() by more than one condition and group_by() more than one column
• Put many summary statistics in summarize()
• n() function means number of values
• Example: all non-missing weight values for PELE (Peromyscus leucopus, the white-footed mouse), grouped by sex and life stage.
dat %>%
filter(!is.na(weight), taxonID == 'PELE') %>%
group_by(sex, lifeStage) %>%
summarize(n_individuals = n(),
mean_weight = mean(weight),
sd_weight = sd(weight))
• Notice NA values in the grouping columns (missing value treated as group)
• sd_weight is NA if you try to calculate SD of a single value (undefined)

### More pipe!

Pipe %>% can chain arbitrarily many functions together

dat %>%
filter(!is.na(weight), taxonID == 'PELE') %>%
mutate(weight_mg = weight * 1000) %>%
group_by(siteName) %>%
summarize(mean_weight = mean(weight_mg)) %>%
arrange(mean_weight)

This is equivalent to arrange(summarize(group_by(mutate(filter(dat))))) but much more readable

• mutate() changes the value of a column or creates a new column.
• Here we convert the units of weight, making a new column, weight_mg
• mutate(weight_mg = weight * 1000)
• arrange() sorts the data frame in ascending order by a given column or columns
• arrange(mean_weight) for ascending order
• arrange(-mean_weight) for descending order

## Tidy data and reshaping data frames

Two more functions to learn:

• pivot_longer()

• pivot_wider()

• Next example borrowed from Hadley Wickham’s book R for Data Science (Chapter 12: Tidy data)

## What is tidy data?

• If a dataset is tidy,
• Each observation has its own row
• Each variable has its own column
• Each value has its own cell
• Most model fitting functions in R require the data to be in this format (and many SAS procedures)

### Which of these data frames contain tidy data?

• Example datasets from the tidyr package (part of tidyverse)
• All contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000)
table1
table2
table3

• Table 1 is the only “tidy” data frame
• Count of disease cases and total population in a country in one specific year is one observation
• Each row of Table 1 includes both grouping columns and both variable columns from one observation

## Reshaping data: pivot_wider()

• pivot_wider() will turn table2 into table1
• Three arguments needed to reshape data from long to wide
• id_cols column(s): identifying information for each observation (country and year). Use c() and unquoted names.
• names_from column(s): the labels that will become column names in wide data (type)
• values_from column(s): the data that will be spread across multiple columns
table2_tidied <- pivot_wider(table2, id_cols = c(country, year), names_from = type, values_from = count)
• all.equal() confirms that we have recreated table1
all.equal(table2_tidied, table1)
• Reshaping table3 to table1 requires more complex operations that we will not cover today

## Reshaping data: pivot_longer()

• Lots of data begins as wide form in a paper data sheet or Excel spreadsheet
• Easier to enter, but needs to be reshaped to long form for analysis
• Example: Billboard top 100 song rankings by week for the year 2000 (also an example dataset from the tidyr package)
• Use the data frame summary tools you now know to explore this dataset
billboard

## Tidying the wide form data

• We want tidy data where each row is an observation:
• observation = a song
• identifying columns: artist name, track name, the date it entered the top 100 ranking, and the week at which the ranking was observed
• data value: the ranking between 1 and 100

## Arguments of pivot_longer()

• cols: columns that need to be pivoted to a longer format
• Shorthand wk1:wk76: all columns from wk1 to wk76
• Or use - sign to specify which columns will not be pivoted (the identifying columns for each observation)
• names_to: the name of the column into which the column names will be put, as a quoted character string
• Defaults to name if not specified
• values_to: the name of the column into which the data values in the cells will be put, as a quoted character string
• Defaults to value if not specified

Putting it together, …

pivot_longer(billboard, cols = c(wk1:wk76), names_to = 'week', values_to = 'ranking')

Alternative, specifying columns not to pivot

pivot_longer(billboard, cols = -c(artist, track, date.entered), names_to = 'week', values_to = 'ranking')