At the end of this lesson, students will …

- Read a CSV file of data into the R workspace.
- Know what a data frame is and understand what constitutes “tidy” data.
- Do basic manipulations and reshaping of data frames.
- Calculate summary statistics by group.

- A data frame corresponds to a table or spreadsheet of data
- Each column has its own data type
- Ideally, each row will correspond to an individual observation
- Workhorses of data manipulation and analysis

- This lesson features functions from the
**tidyverse**group of packages - tidyverse is are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R
- Alternatives would be using base R code or
**data.table** - tidyverse is probably easiest for beginners and very widespread
- I recommend starting here and learning other ways as you get better at R

- If you are using RStudio on the cloud, the file
`small_mammals.csv`

is included in the`datasets`

folder - If running locally, replace
`datasets`

with the file path - For example
`C:/Users/yourname/Documents/`

`r_workshop_datasets/small_mammals.csv`

- Notice that
`/`

(forward slash) is used, even on Windows! - Use
`read_csv()`

from the tidyverse package**readr** - This returns a “tibble,” tidyverse’s name for a special class of data frame

- Body measurements of small mammals from the NEON ecological observatory network

- Type the name of the data frame into our console
- Prints the first few rows and some information about dimensions

- We can print all the values of
`dat`

if we explicitly tell R to do this - Or print a specified number of rows

`summary()`

and`str()`

from base R`glimpse()`

from tidyverse- Try calling them and see what output you get
- We can see we have 1629 rows and 7 columns.
- Some character columns (
`chr`

) and numeric columns (`dbl`

for double-precision) `summary(dat)`

shows counts of missing (`NA`

) values

- Subset by indexing rows and columns
- Use square brackets
`[]`

- Syntax is
`dataframe[row, column]`

- This code returns the 2nd row, 3rd column (actually a 1x1 data frame with that value)

- Get an entire row by leaving the
`column`

part blank - Get entire column by leaving the
`row`

part blank - This gives us the 5th row:

- This gives us the 2nd column:

- We can also subset with ranges
- For example this gives us the 6th through 10th rows, including all columns:

- This gives us the first 5 rows, but only columns 1 and 2.

- You can also get columns by name
- Pass the vector of column names using the
`c()`

function, and put quotes around the names - This gives us the 20th row and the columns called
`taxonID`

and`sex`

.

- To extract a single column from a data frame as a vector, you can use the
`$`

operator.

- If the data frame column name contains special characters, surround the name with backticks like
`dat$`column 10``

.

- Use the
`$`

operator to pass a single data frame column to a function as input - For example, let’s look at all unique values in the
`siteName`

column using the`unique()`

function

`table()`

gives you the number of values of each category within a data frame column

- The tidyverse function
`filter()`

gives you subsets of rows - A common use of
`filter()`

is to subset rows by a condition - For example we can use
`filter()`

to get all rows where`weight`

is less than 10

`select()`

gives you subsets of columns- We do not need to use quotes on column names within
`select()`

- Next we’ll see how to get summary statistics for entire data frames and subgroups of data
- First we need to learn about how to deal with missing values

- Summary methods for data frames tell you how many missing values there are
- Many functions that calculate summary statistics return a
`NA`

value if there are any missing values

- Where there’s data, there’s missing data
- There are lots of ways to deal with missing data; the simplest is ignoring it
- Functions
`mean()`

,`min()`

,`max()`

, and`sum()`

have an argument`na.rm`

- If
`na.rm = TRUE`

it removes`NA`

values before computing the summary statistic - Default is
`FALSE`

so you need to explicitly set it to`TRUE`

- The pipe operator, or
`%>%`

, “pipes” the output of one function into the input of the next function - Useful for composing expressions with many functions that are complicated but still readable

Nested function call:

`sqrt(log(abs(x)))`

Piped version of the same function call:

```
x %>%
abs() %>%
log() %>%
sqrt()
```

- Can be split across multiple lines
- Order of operations read from top to bottom instead of inside to outside
- A lot easier for people, especially “future you,” to read

- Summary statistics for entire column of data frame
- The pipe
`%>%`

passes`dat`

to the next line of code

- We get only
`NA`

because there is at least one missing value in the`weight`

column

- We can use
`filter()`

to keep only rows where weight is not missing `is.na(x)`

returns a vector of logical values,`TRUE`

if`x`

is missing and`FALSE`

if it is not- We negate this with the
`!`

operator to get a vector that is`TRUE`

if`weight`

is not missing - Use the
`%>%`

pipe to do two operations on a data frame sequentially.

`na.rm`

argument will also remove missing values- Result will be identical to the previous

- Use
`group_by()`

before`summarize()`

- Here we remove missing values, group by the
`taxonID`

column, and get mean and SD of body weight for each taxon

`filter()`

by more than one condition and`group_by()`

more than one column- Put many summary statistics in
`summarize()`

`n()`

function means number of values- Example: all non-missing weight values for
`PELE`

(*Peromyscus leucopus*, the white-footed mouse), grouped by sex and life stage.

- Notice
`NA`

values in the grouping columns (missing value treated as group) `sd_weight`

is`NA`

if you try to calculate SD of a single value (undefined)

Pipe `%>%`

can chain arbitrarily many functions together

This is equivalent to `arrange(summarize(group_by(mutate(filter(dat)))))`

but much more readable

`mutate()`

changes the value of a column or creates a new column.- Here we convert the units of weight, making a new column,
`weight_mg`

`mutate(weight_mg = weight * 1000)`

- Here we convert the units of weight, making a new column,
`arrange()`

sorts the data frame in ascending order by a given column or columns`arrange(mean_weight)`

for ascending order`arrange(-mean_weight)`

for descending order

Two more functions to learn:

`pivot_longer()`

`pivot_wider()`

Next example borrowed from Hadley Wickham’s book

*R for Data Science*(Chapter 12: Tidy data)

- If a dataset is tidy,
- Each observation has its own row
- Each variable has its own column
- Each value has its own cell

- Most model fitting functions in R require the data to be in this format (and many SAS procedures)

- Example datasets from the
**tidyr**package (part of tidyverse) - All contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000)

- Table 1 is the only “tidy” data frame
- Count of disease cases and total population in a country in one specific year is one observation
- Each row of Table 1 includes both grouping columns and both variable columns from one observation

`pivot_wider()`

will turn`table2`

into`table1`

- Three arguments needed to reshape data from long to wide
`id_cols`

column(s): identifying information for each observation (country and year). Use`c()`

and unquoted names.`names_from`

column(s): the labels that will become column names in wide data (type)`values_from`

column(s): the data that will be spread across multiple columns

`all.equal()`

confirms that we have recreated`table1`

- Reshaping
`table3`

to`table1`

requires more complex operations that we will not cover today

- Lots of data begins as wide form in a paper data sheet or Excel spreadsheet
- Easier to enter, but needs to be reshaped to long form for analysis

- Example: Billboard top 100 song rankings by week for the year 2000 (also an example dataset from the
**tidyr**package) - Use the data frame summary tools you now know to explore this dataset

- We want tidy data where each row is an observation:
- observation = a song
- identifying columns: artist name, track name, the date it entered the top 100 ranking, and the week at which the ranking was observed
- data value: the ranking between 1 and 100

`cols`

: columns that need to be pivoted to a longer format- Shorthand
`wk1:wk76`

: all columns from`wk1`

to`wk76`

- Or use
`-`

sign to specify which columns will*not*be pivoted (the identifying columns for each observation)

- Shorthand
`names_to`

: the name of the column into which the column names will be put, as a quoted character string- Defaults to
`name`

if not specified

- Defaults to
`values_to`

: the name of the column into which the data values in the cells will be put, as a quoted character string- Defaults to
`value`

if not specified

- Defaults to

Putting it together, …

Alternative, specifying columns *not* to pivot

- Joining two or more data frames together
- Operations on lists of data frames or data frames containing lists

Check out the resources I posted on the workshop site, especially *R for Data Science*, to learn these essential parts of data wrangling.