# Data viz basics with R and ggplot2

## Introduction

• How to visualize your data with the R package ggplot2
• Originally developed by Hadley Wickham, now developed and maintained by a team of experts
• Good for everything from quick exploratory plots to carefully formatted publication-quality graphics
• This lesson assumes beginner-level knowledge of R (functions, data frames)

## Learning objectives

At the end of this course, you will know …

• what the “grammar of graphics” is and how ggplot2 uses it
• how to map a variable in a data frame to a graphical element in a plot using aes
• how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
• how to add trendlines to a plot
• how to compute summary statistics and plot them with stat functions
• how to make plots for different subsets of your data using facets
• how to change the style and appearance of your plots

## What is the grammar of graphics?

• Theory underlying ggplot2 is the “grammar of graphics” (Leland Wilkinson)
• Formal way of mapping variables in a dataset to graphical elements of a plot
• For example, a dataset has age, weight, and sex of many individuals. Make a scatterplot:
• age variable in the data maps to the x axis of the dataset
• weight variable maps to the y axis
• sex variable maps to the color of the points

## The data

Three datasets from Kaggle

• We will use only the ggplot2 package in this tutorial
• Use read.csv() to read in each of the three datasets from the web
library(ggplot2)

olympics <- read.csv('https://github.com/qdread/data-viz-basics/raw/main/datasets/olympics.csv')
• You can use head(), summary(), or str() to examine each dataset

## Our first ggplot

• Scatterplot: does money buy happiness?
• WHR dataset has a row for each country
• Plot GDP per capita on the x axis and happiness score on the y axis
• Start by calling the ggplot() function with data argument
ggplot(data = WHR)

• Specify which columns of the data will map to the x and y elements of the plot
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score))
• We see the two axes and coordinate system
• But no data appears

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_point()
• Notice we use + to add each new piece of the plotting code.

### Modifying the plot: changing the geom

• Changing the geom plots the same data in a different way
• Try replacing geom_point() with geom_line()
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_line()
• Doesn’t make a lot of sense in this case but it is great for time series data
• We can add multiple geoms if we want
• Plot a smoothing trendline (geom_smooth()) overlaid on the scatterplot
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
geom_smooth()
• The geoms are drawn in the order they are added to the plot

Putting each piece on its own line makes the code easier to read!

### Linear trendline

• By default, geom_smooth() plots a locally weighted regression with standard error as a shaded area
• Make trend linear by specifying method = lm
• Get rid of the standard error shading by specifying se = FALSE
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)

### Modifying the plot: changing the aes

• Adding to or changing the aes argument modifies what data are used to plot
• Let’s add a color aesthetic to the point plot to color each country’s point by continent
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score, color = Continent)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
• We automatically get a legend. But this also automatically groups the trendline by region as well
• If we have multiple geoms, we can add an aes argument to a single geom
• For example if we want the points to be colored by continent but a single overall trendline, add aes(color = Continent) inside geom_point()
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent)) +
geom_smooth(method = lm, se = FALSE)
• If we add arguments to the geoms outside aes(), it will modify their appearance without mapping back to the data
• For example bigger points (size = 2), and black trendline (color = 'black')
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent), size = 2) +
geom_smooth(method = lm, se = FALSE, color = 'black')

The default for geom_point is size = 1

### Modifying the plot: changing the theme

• Themes change the overall look of the plot
happy_gdp_plot <- ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent), size = 2) +
geom_smooth(method = lm, se = FALSE, color = 'black')

happy_gdp_plot + theme_bw()
• theme_bw() changes the default theme to a black-and-white theme.
• Note: we can assign the result of ggplot() to an object, in this case happy_gdp_plot
• We can add things to an existing ggplot object and print it
• You can set a global theme for all plots created in your current R session by using theme_set()
• Do this now so we don’t have to look at the ugly gray default theme!
theme_set(theme_bw())
• Use theme() to add specific theme arguments to a plot
• Here we move the legend to the bottom, remove the gridlines, and change appearance of different text elements
happy_gdp_plot <- happy_gdp_plot +
theme(panel.grid = element_blank(),
legend.position = 'bottom',
axis.text = element_text(color = 'black', size = 14),
axis.title = element_text(face = 'bold', size = 14))

happy_gdp_plot

### Modifying the plot: changing scales

• Scales modify how data are mapped to graphics
• For example, change the range, breaks, and labels of the x and y axes
• Or change the color palette used to color the points
• You can optionally add a scale for each aes mapping in your plot
• Add x scale with a title and specific labeled breaks
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5))
• Add y scale with a title and specific range
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_y_continuous(name = 'happiness score', limits = c(2, 8))
• Change the color palette for continents
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_y_continuous(name = 'happiness score', limits = c(2, 8)) +
scale_color_viridis_d()

scale_color_viridis_d() is a color scheme that can be distinguished by colorblind people.

## Boxplots

• Visualize distributions grouped by discrete categories
• geom_boxplot() to see distribution of happiness by continent
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot()
• Modify appearance of boxplots
• Fill color (fill), line color (color), line thickness (size = 1.5 where the default is 1)
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5)
• Modify the outlier points
• Bigger (outlier.size = 2) and 70% transparent (outlier.alpha = 0.7)
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5,
outlier.size = 2, outlier.alpha = 0.7) 

## Histograms and density plots

• Now using the cereal dataset
• Distribution of grams of sugar per serving in each type of breakfast cereal (sugars)
• We only need to map an x aesthetic
• y value computed internally by ggplot()
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram()
• Reduce the number of bins to 10 (default is 30)
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10)
• By default, the y axis has a small gap between the highest and lowest value and the edge of the plot
• Great for scatterplots but doesn’t look good for histograms
• Change this by adding a y axis scale with an expand argument
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10) +
scale_y_continuous(expand = expansion(add = c(0, 1)))
• expansion(add = c(0, 1)) indicates 0 units of padding at the low end of the axis, and 1 unit at the high end
• Change the fill color of the histogram bar
• Add a title and subtitle to the plot with ggtitle()
sugar_hist <- ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10, fill = 'slateblue') +
scale_y_continuous(expand = expansion(add = c(0, 1))) +
ggtitle('Distribution of grams of sugar per serving', 'for 77 brands of breakfast cereal')

sugar_hist
• Use labs() function to change axis label(s) without having to specify the entire scale
sugar_hist +
labs(x = 'sugar (g/serving)')

### Kernel density plot

• An alternative to the histogram is a smoothed kernel density plot
• Look at the distribution of calories per serving with a density plot
ggplot(data = cereal, aes(x = calories)) +
stat_density()

Functions beginning stat_*() will compute some summary statistic or function based on the data and plot it.

• Tweak the bandwidth parameter (adjust) of the density smoothing algorithm
• Higher adjust (default is 1) gives you a smoother curve
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2)
• We can specify the geom to be used for stat_density()
• Default is 'polygon' but we can change it to 'line'
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2, geom = 'line')
• Get rid of the gaps on either end of the x-axis and below the 0 line on the y-axis
• Change the width and color of the line
• Add a title for the plot and x-axis
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2, geom = 'line', linewidth = 1.2, color = 'forestgreen') +
scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
scale_x_continuous(expand = expansion(mult = c(0, 0)), name = 'calories per serving') +
ggtitle('Distribution of calories per serving', 'for 77 brands of breakfast cereal')

## Facets

• ggplot2 can easily group data and create multiple subplots
• Subset of the olympics track and field medals dataset (four countries)
olympics_best <- subset(olympics, Country %in% c('USA', 'GBR', 'URS', 'JAM'))
• Make a bar plot of the three types of medals awarded
• geom_bar() function used to compute counts within each category
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count')
• Use facet_wrap() function with a one-sided formula ~ Country to make a separate plot for each country
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_wrap(~ Country)
• By default, all the facets have the same axis limits
• We can make each facet have its own limits by setting scales = 'free_y'
• (You can also use scales = 'free_x', or scales = 'free' to let both x and y limits vary)
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_wrap(~ Country, scales = 'free_y')
• Facet in both directions using facet_grid() with a two-sided formula
• We can show the medal tally for each country by gender in a 4x2 plot:
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y')
• To change order of medals from alphabetical to Bronze, Silver, Gold, we change the underlying data
• Make Medal into a factor and specify order of factor levels
• Now when you plot, the bars will appear in the order we set
olympics_best$Medal <- factor(olympics_best$Medal, levels = c('Bronze', 'Silver', 'Gold'))

ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y')
• Add theme elements, title, and subtitle
• Add custom fill scale to fill each medal bar with the appropriate color
ggplot(olympics_best, aes(x = Medal, fill = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y') +
ggtitle('Track and field medal tallies for selected countries, 1896-2014', subtitle = 'data source: The Guardian') +
scale_y_continuous(expand = expansion(add = c(0, 2))) +
scale_fill_manual(values = c(Bronze = 'chocolate4', Silver = 'gray60', Gold = 'goldenrod')) +
theme(legend.position = 'none',
strip.background = element_blank(),
panel.grid = element_blank())

## Writing your ggplot to a file

• ggsave() writes your plot to a file.
• Automatically determines the file type based on extension on the file name
• Sets size and resolution of the image by default
# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist)
• Specify the resolution in dpi (dots per inch)
• Size of image using height and width
• Default units are inches but we can change this
# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 4, width = 4)
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 9, width = 9, units = 'cm')

We can use other file types such as PDF (vector graphics, so you don’t supply a resolution)

# Not run
ggsave('~/plots/sugar_histogram.pdf', sugar_hist, height = 4, width = 4)

If you don’t specify a plot object, ggsave saves the last ggplot you plotted in the plotting window to the file

ggsave('~/plots/my_plot.pdf', height = 4, width = 4)

## Going further

• We’ve only scratched the surface of ggplot2
• ggplot2 itself does not support every possible visualization
• open-source ecosystem of extension packages people have written as add-ons

### ggthemes

• package with a bunch of themes not included in base ggplot2
• example: draw our sugar histogram plot in the style of The Economist, FiveThirtyEight.com, and Edward Tufte
library(ggthemes)

sugar_hist + theme_economist()
sugar_hist + theme_fivethirtyeight()
sugar_hist + theme_tufte()

### gghighlight

• Highlight outliers or other specific bits of our data automatically
• example: happiness versus GDP plot, “happiest” and “saddest” countries highlighted and labeled
library(gghighlight)

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
gghighlight(Happiness.Score < 3 | Happiness.Score > 7.55, label_key = Country, label_params = list(size = 5))

### ggpubr

• simplifies the production of publication-ready plots
• example: happiness by continent boxplot, with raw data as jittered points, and manually added significance indicators
library(ggpubr)

ggboxplot(data = WHR, x = 'Continent', y = 'Happiness.Score', color = 'Continent', add = 'jitter',
palette = unname(palette.colors(4)),
ylab = 'Happiness Score', ylim = c(2.5, 8.5)) +
geom_bracket(xmin = 'Europe', xmax = 'Africa', label = '*', y.position = 8.3, label.size = 8) +
geom_bracket(xmin = 'Americas', xmax = 'Africa', label = '*', y.position = 7.7, label.size = 8)

## Conclusion

What did we just learn? Let’s revisit the learning objectives!

• what the “grammar of graphics” is and how ggplot2 uses it
• how to map a variable in a data frame to a graphical element in a plot using aes
• how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
• how to add trendlines to a plot
• how to compute summary statistics and plot them with stat functions
• how to make plots for different subsets of your data using facets
• how to change the style and appearance of your plots

Now get out there and visualize some data!