Lesson 2: Working with data frames

Lesson 2 learning objectives

At the end of this lesson, students will …

  • Read a CSV file of data into the R workspace.
  • Know what a data frame is and understand what constitutes “tidy” data.
  • Do basic manipulations and reshaping of data frames.
  • Calculate summary statistics by group.

What is a data frame?

  • A data frame corresponds to a table or spreadsheet of data
  • Each column has its own data type
  • Ideally, each row will correspond to an individual observation
  • Workhorses of data manipulation and analysis

Note about tidyverse

  • This lesson features functions from the tidyverse group of packages
  • tidyverse is are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R
  • Alternatives would be using base R code or data.table
  • tidyverse is probably easiest for beginners and very widespread
  • I recommend starting here and learning other ways as you get better at R
library(tidyverse)

Reading in data

  • If you are using RStudio on the cloud, the file small_mammals.csv is included in the datasets folder
  • If running locally, replace datasets with the file path
  • For example C:/Users/yourname/Documents/ r_workshop_datasets/small_mammals.csv
  • Notice that / (forward slash) is used, even on Windows!
  • Use read_csv() from the tidyverse package readr
  • This returns a “tibble,” tidyverse’s name for a special class of data frame
dat <- read_csv('datasets/small_mammals.csv')

Small mammal dataset

  • Body measurements of small mammals from the NEON ecological observatory network

Summary information about a data frame

  • Type the name of the data frame into our console
  • Prints the first few rows and some information about dimensions
dat
  • We can print all the values of dat if we explicitly tell R to do this
  • Or print a specified number of rows
print(dat, n = 50)

Other ways to summarize data frames

  • summary() and str() from base R
  • glimpse() from tidyverse
  • Try calling them and see what output you get
  • We can see we have 1629 rows and 7 columns.
  • Some character columns (chr) and numeric columns (dbl for double-precision)
  • summary(dat) shows counts of missing (NA) values

Subsetting values from a data frame

  • Subset by indexing rows and columns
  • Use square brackets []
  • Syntax is dataframe[row, column]
  • This code returns the 2nd row, 3rd column (actually a 1x1 data frame with that value)
dat[2, 3]

Subsetting entire rows and columns

  • Get an entire row by leaving the column part blank
  • Get entire column by leaving the row part blank
  • This gives us the 5th row:
dat[5, ]
  • This gives us the 2nd column:
dat[, 2]

Subsetting with ranges

  • We can also subset with ranges
  • For example this gives us the 6th through 10th rows, including all columns:
dat[6:10, ]
  • This gives us the first 5 rows, but only columns 1 and 2.
dat[1:5, 1:2]

Subsetting columns by name

  • You can also get columns by name
  • Pass the vector of column names using the c() function, and put quotes around the names
  • This gives us the 20th row and the columns called taxonID and sex.
dat[20, c('taxonID', 'sex')]

The $ operator

  • To extract a single column from a data frame as a vector, you can use the $ operator.
dat$siteName
  • If the data frame column name contains special characters, surround the name with backticks like dat$`column 10`.

Passing data frame columns to functions

  • Use the $ operator to pass a single data frame column to a function as input
  • For example, let’s look at all unique values in the siteName column using the unique() function
unique(dat$siteName)
  • table() gives you the number of values of each category within a data frame column
table(dat$sex)

Subsetting with tidyverse functions: filter()

  • The tidyverse function filter() gives you subsets of rows
  • A common use of filter() is to subset rows by a condition
  • For example we can use filter() to get all rows where weight is less than 10
filter(dat, weight < 10)

Subsetting with tidyverse functions: select()

  • select() gives you subsets of columns
  • We do not need to use quotes on column names within select()
select(dat, taxonID, weight)

Calculating summary statistics

  • Next we’ll see how to get summary statistics for entire data frames and subgroups of data
  • First we need to learn about how to deal with missing values

Dealing with missing data

  • Summary methods for data frames tell you how many missing values there are
  • Many functions that calculate summary statistics return a NA value if there are any missing values
x <- c(5.3, 12.2, -8.6)

max(x)

x[2] <- NA

max(x)

Ignoring missing values

  • Where there’s data, there’s missing data
  • There are lots of ways to deal with missing data; the simplest is ignoring it
  • Functions mean(), min(), max(), and sum() have an argument na.rm
  • If na.rm = TRUE it removes NA values before computing the summary statistic
  • Default is FALSE so you need to explicitly set it to TRUE
max(x, na.rm = TRUE)

The pipe operator

  • The pipe operator, or %>%, “pipes” the output of one function into the input of the next function
  • Useful for composing expressions with many functions that are complicated but still readable

Nested versus piped

Nested function call:

sqrt(log(abs(x)))

Piped version of the same function call:

x %>%
  abs() %>%
  log() %>%
  sqrt()
  • Can be split across multiple lines
  • Order of operations read from top to bottom instead of inside to outside
  • A lot easier for people, especially “future you,” to read

Using pipe to calculate summary statistics

  • Summary statistics for entire column of data frame
  • The pipe %>% passes dat to the next line of code
dat %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))
  • We get only NA because there is at least one missing value in the weight column

Remove missing values with filter()

  • We can use filter() to keep only rows where weight is not missing
  • is.na(x) returns a vector of logical values, TRUE if x is missing and FALSE if it is not
  • We negate this with the ! operator to get a vector that is TRUE if weight is not missing
  • Use the %>% pipe to do two operations on a data frame sequentially.
dat %>%
  filter(!is.na(weight)) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))

Remove missing values with na.rm

  • na.rm argument will also remove missing values
  • Result will be identical to the previous
dat %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE), 
            sd_weight = sd(weight, na.rm = TRUE))

Calculating summary statistics by group

  • Use group_by() before summarize()
  • Here we remove missing values, group by the taxonID column, and get mean and SD of body weight for each taxon
dat %>%
  filter(!is.na(weight)) %>%
  group_by(taxonID) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))

More complex operations

  • filter() by more than one condition and group_by() more than one column
  • Put many summary statistics in summarize()
  • n() function means number of values
  • Example: all non-missing weight values for PELE (Peromyscus leucopus, the white-footed mouse), grouped by sex and life stage.
dat %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  group_by(sex, lifeStage) %>%
  summarize(n_individuals = n(),
            mean_weight = mean(weight), 
            sd_weight = sd(weight))
  • Notice NA values in the grouping columns (missing value treated as group)
  • sd_weight is NA if you try to calculate SD of a single value (undefined)

More pipe!

Pipe %>% can chain arbitrarily many functions together

dat %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  mutate(weight_mg = weight * 1000) %>%
  group_by(siteName) %>%
  summarize(mean_weight = mean(weight_mg)) %>%
  arrange(mean_weight)

This is equivalent to arrange(summarize(group_by(mutate(filter(dat))))) but much more readable

  • mutate() changes the value of a column or creates a new column.
    • Here we convert the units of weight, making a new column, weight_mg
    • mutate(weight_mg = weight * 1000)
  • arrange() sorts the data frame in ascending order by a given column or columns
    • arrange(mean_weight) for ascending order
    • arrange(-mean_weight) for descending order

Tidy data and reshaping data frames

Two more functions to learn:

  • pivot_longer()

  • pivot_wider()

  • Next example borrowed from Hadley Wickham’s book R for Data Science (Chapter 12: Tidy data)

What is tidy data?

  • If a dataset is tidy,
    • Each observation has its own row
    • Each variable has its own column
    • Each value has its own cell
  • Most model fitting functions in R require the data to be in this format (and many SAS procedures)

Which of these data frames contain tidy data?

  • Example datasets from the tidyr package (part of tidyverse)
  • All contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000)
table1
table2
table3

Answer

  • Table 1 is the only “tidy” data frame
  • Count of disease cases and total population in a country in one specific year is one observation
  • Each row of Table 1 includes both grouping columns and both variable columns from one observation

Reshaping data: pivot_wider()

  • pivot_wider() will turn table2 into table1
  • Three arguments needed to reshape data from long to wide
    • id_cols column(s): identifying information for each observation (country and year). Use c() and unquoted names.
    • names_from column(s): the labels that will become column names in wide data (type)
    • values_from column(s): the data that will be spread across multiple columns
table2_tidied <- pivot_wider(table2, id_cols = c(country, year), names_from = type, values_from = count)
  • all.equal() confirms that we have recreated table1
all.equal(table2_tidied, table1)
  • Reshaping table3 to table1 requires more complex operations that we will not cover today

Reshaping data: pivot_longer()

  • Lots of data begins as wide form in a paper data sheet or Excel spreadsheet
  • Easier to enter, but needs to be reshaped to long form for analysis
  • Example: Billboard top 100 song rankings by week for the year 2000 (also an example dataset from the tidyr package)
  • Use the data frame summary tools you now know to explore this dataset
billboard

Tidying the wide form data

  • We want tidy data where each row is an observation:
    • observation = a song
    • identifying columns: artist name, track name, the date it entered the top 100 ranking, and the week at which the ranking was observed
    • data value: the ranking between 1 and 100

Arguments of pivot_longer()

  • cols: columns that need to be pivoted to a longer format
    • Shorthand wk1:wk76: all columns from wk1 to wk76
    • Or use - sign to specify which columns will not be pivoted (the identifying columns for each observation)
  • names_to: the name of the column into which the column names will be put, as a quoted character string
    • Defaults to name if not specified
  • values_to: the name of the column into which the data values in the cells will be put, as a quoted character string
    • Defaults to value if not specified

Putting it together, …

pivot_longer(billboard, cols = c(wk1:wk76), names_to = 'week', values_to = 'ranking')

Alternative, specifying columns not to pivot

pivot_longer(billboard, cols = -c(artist, track, date.entered), names_to = 'week', values_to = 'ranking')

Hey! What about …

  • Joining two or more data frames together
  • Operations on lists of data frames or data frames containing lists

Check out the resources I posted on the workshop site, especially R for Data Science, to learn these essential parts of data wrangling.