Exploring Data with tidyverse

Data Manipulation and Basic Charts

1. Data Description

We will use two data sets for illustration.

1.1 Chicago Redlining

Insurance redlining refers to the practice of refusing to issue insurance to certain types of people or within some geographic area. The name comes from the act of drawing a red line around an area on a map. Now few would quibble with an insurance company refusing to sell auto insurance to a frequent drunk driver, but other forms of discrimination would be unacceptable.

In the late 1970s, the US Commission on Civil Rights examined charges by several Chicago community organizations that insurance companies were redlining their neighborhoods. Because comprehensive information about individuals being refused homeowners insurance was not available, the number of FAIR plan policies written and renewed in Chicago by ZIP code for the months of December 1977 through May 1978 was recorded. The FAIR plan was offered by the city of Chicago as a default policy to homeowners who had been rejected by the voluntary market. Information on other variables that might affect insurance writing such as fire and theft rates was also collected at the ZIP code level. The variables are:

  • involact new FAIR plan policies and renewals per 100 housing units,
  • fire fires per 100 housing units,
  • theft thefts per 1000 population,
  • race racial composition in percentage of minority,
  • age (of housing) percentage of housing units built before 1939,
  • income median family income in thousands of dollars,
  • side north or south side of Chicago.

The variable involact acts as a measure of insurance availability in the voluntary market, since most FAIR plan policyholders secure such coverage only after they have been rejected by the voluntary market. Insurance companies claim to reject insurances based on their past losses (captured in variables theft and fire). The U.S. Commission on Civil Rights in 1979 was interested in how much income, age of housing, and in particular race affect insurance availability.

library(faraway)
data(chredlin, package="faraway") # attaches the data from the faraway package
head(chredlin)
      race fire theft  age involact income side
60626 10.0  6.2    29 60.4      0.0 11.744    n
60640 22.2  9.5    44 76.5      0.1  9.323    n
60613 19.6 10.5    36 73.5      1.2  9.948    n
60657 17.3  7.7    37 66.9      0.5 10.656    n
60614 24.5  8.6    53 81.4      0.7  9.730    n
60610 54.0 34.1    68 52.6      0.3  8.231    n

1.2 Flights

The following data set contains about 1/4 million flights that departed from New York City in 2014.

library(data.table)
flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
# 1/4 million rows, so subsample

The variables are self-explanatory in this case

Since the data has so many rows, let’s sub-sample it.

# flights <- flights[sample(1:dim(flights)[1], 5000),] # base R version of sub-sampling
library(tidyverse)
flights <- flights %>%
  slice_sample(n=5000)                               # tidyverse version of sub-sampling
head(flights)
   year month day dep_delay arr_delay carrier origin dest air_time distance
1: 2014     7  26        -6       -11      MQ    LGA  CMH       71      479
2: 2014     8   3        24        17      UA    EWR  ORD      100      719
3: 2014     1   7        66        54      DL    JFK  LAS      318     2248
4: 2014    10  23        -2       -26      AA    LGA  ORD      106      733
5: 2014     1  19         9       -10      B6    EWR  FLL      141     1065
6: 2014    10  20        49        41      MQ    LGA  ORF       55      296
   hour
1:    8
2:   10
3:   21
4:   17
5:   13
6:   16

2. Data Manipulation

The basic functions which will allow us to do most of the data manipulation tasks are

  • filter picks observations by chosen values
    • select picks variables (by their names), hence is a poor “transpose” of filter
  • mutate creates new variables as functions of existing ones
  • arrange orders the observations
  • group_by allows all of the above to work locally
    • summarize collapses values down to a single summary – only useful together with group_by

First, let us drop the non-numerical variables. We can either select all but what we want to drop or, more conveniently, use the minus-notation to specify directly what we want to drop.

flights <- select(flights, -carrier, -origin, -dest)
head(flights)
   year month day dep_delay arr_delay air_time distance hour
1: 2014     7  26        -6       -11       71      479    8
2: 2014     8   3        24        17      100      719   10
3: 2014     1   7        66        54      318     2248   21
4: 2014    10  23        -2       -26      106      733   17
5: 2014     1  19         9       -10      141     1065   13
6: 2014    10  20        49        41       55      296   16

As an example, let us filter flights that departed on the April Fool’s day:

fools <- filter(flights, month == 4, day == 1) # all flights on April 1st
summarize(fools, n()) # in our sub-sample, we have 17 flights on April 1st
  n()
1  17

When working with tidyverse, it is customary to use the “pipe” operator %>%. Basically, f(x,y) is equivalent to x %>% f(y). This is useful when chaining operations (composing functions), e.g.,

g(f(x,y),z)  # is equivalent to
x %>% f(y) %>% g(z)

While mathematically we are composing functions, for data manipulation it is often useful to think about chaining operations, i.e., using the pipe %>%.

For example, the two lines of code above can be tweaked to not store the fools variable by either composing functions or chaining operations:

summarize(filter(flights, month == 4, day == 1), n()) # or
flights %>% filter(month ==4, day == 1) %>% summarize(n())

What if we want to know the number of flights on any given day?

flights %>% 
  group_by(month, day) %>% 
  summarize(n()) %>% 
  head()
# A tibble: 6 × 3
# Groups:   month [1]
  month   day `n()`
  <int> <int> <int>
1     1     1     9
2     1     2    11
3     1     3     5
4     1     4     9
5     1     5    16
6     1     6    12

The original data set was arranged by month, day and hour, but we have lost this ordering by sub-sampling.

head(flights)
   year month day dep_delay arr_delay air_time distance hour
1: 2014     7  26        -6       -11       71      479    8
2: 2014     8   3        24        17      100      719   10
3: 2014     1   7        66        54      318     2248   21
4: 2014    10  23        -2       -26      106      733   17
5: 2014     1  19         9       -10      141     1065   13
6: 2014    10  20        49        41       55      296   16

So let’s fix this by arrange().

flights <- flights %>% 
  arrange(month, day, hour)
head(flights)
   year month day dep_delay arr_delay air_time distance hour
1: 2014     1   1        -2         3      145      937    8
2: 2014     1   1         0        -8       96      529    9
3: 2014     1   1        -4       -16      247     1504    9
4: 2014     1   1         0        22      144      711   12
5: 2014     1   1         3         1       39      187   12
6: 2014     1   1         8        -3      347     2475   18

Let’s read and sub-sample the data again

flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
flights <- flights %>%
  slice_sample(n=5000)
str(flights)
Classes 'data.table' and 'data.frame':  5000 obs. of  11 variables:
 $ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
 $ month    : int  2 9 7 7 6 6 1 4 1 10 ...
 $ day      : int  27 26 5 1 28 11 23 10 28 21 ...
 $ dep_delay: int  -2 -3 63 -6 -7 72 65 3 64 13 ...
 $ arr_delay: int  -12 -18 49 10 -23 60 38 -1 54 -1 ...
 $ carrier  : chr  "US" "DL" "B6" "MQ" ...
 $ origin   : chr  "EWR" "EWR" "JFK" "LGA" ...
 $ dest     : chr  "PHX" "DTW" "SFO" "CMH" ...
 $ air_time : int  301 77 320 87 113 190 225 152 116 96 ...
 $ distance : int  2133 488 2586 479 799 1598 1620 1069 719 647 ...
 $ hour     : int  10 5 21 17 15 21 17 6 19 13 ...
 - attr(*, ".internal.selfref")=<externalptr> 

We may notice that carrier, origin and dest are characters, but they should actually be factors. We can use mutate() to rectify that:

flights <- flights %>%
  mutate(carrier = as.factor(carrier),
          origin = as.factor(origin),
            dest = as.factor(dest))
str(flights) # to see the overall structure of the object flights
Classes 'data.table' and 'data.frame':  5000 obs. of  11 variables:
 $ year     : int  2014 2014 2014 2014 2014 2014 2014 2014 2014 2014 ...
 $ month    : int  2 9 7 7 6 6 1 4 1 10 ...
 $ day      : int  27 26 5 1 28 11 23 10 28 21 ...
 $ dep_delay: int  -2 -3 63 -6 -7 72 65 3 64 13 ...
 $ arr_delay: int  -12 -18 49 10 -23 60 38 -1 54 -1 ...
 $ carrier  : Factor w/ 14 levels "AA","AS","B6",..: 12 4 3 9 5 4 14 3 11 9 ...
 $ origin   : Factor w/ 3 levels "EWR","JFK","LGA": 1 1 2 3 1 2 3 2 1 3 ...
 $ dest     : Factor w/ 99 levels "ABQ","ACK","ALB",..: 69 32 85 25 57 87 29 34 64 98 ...
 $ air_time : int  301 77 320 87 113 190 225 152 116 96 ...
 $ distance : int  2133 488 2586 479 799 1598 1620 1069 719 647 ...
 $ hour     : int  10 5 21 17 15 21 17 6 19 13 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The verb mutate() can also be used to create new variables from existing variables. For example, we might want to calculate what would the air-time be if there were no delays, or we might want to create a single departure time column using day, month and hour.

flights <- flights %>%
  mutate(no_delay_air_time = air_time + dep_delay - arr_delay,
         HHDDMMYYYY = 10^8*hour + 10^6*day + 10^4*month + year)

Note: the way we created the HHDDMMYYYY is only for illustration purposes, there are better ways to handle time variables!

3. Basic Charts

3.1 Histograms

Basic bar plot showing number of flights for different carriers:

ggplot(data = flights) + 
  geom_bar(mapping = aes(x = carrier))

We can split every bar by departure airport (i.e., variable origin) by adding the fill argument.

ggplot(data = flights) + 
  geom_bar(mapping = aes(x = carrier, fill = origin))

If we wanted one histogram for each NY airport instead, we could use facet_wrap:

ggplot(data = flights) + 
  geom_bar(mapping = aes(x = carrier)) +
  facet_wrap(~ origin)

What if we wanted to add numbers denoting the bin sizes?

ggplot(data = flights) + 
  geom_bar(mapping = aes(x = carrier)) +
  stat_count(mapping = aes(x = carrier, label=..count.., y=..count.. + 40), geom="text", size=4, color="blue")

To show labels for split bars above, one just adds group=origin in the stat_count’s aes, but it is impossible to vertically align the text properly. We bypass it by creating a new variable veralign, which controls the vertical alignment.

n_flights <- flights %>%
  group_by(origin, carrier) %>%
  summarise(count = n()) %>% # now we have a table with the hist cell counts
  ungroup() %>%
  group_by(carrier) %>%
  arrange(desc(origin)) %>% # needed because histogram ordered alphabetically from the top
  mutate(veralign = cumsum(count) - count/2) %>% # here we calculate the vertical alignment and append it to the data set
  ungroup() # drop the grouping

ggplot() + 
  geom_bar(data = flights, mapping = aes(x = carrier, fill = origin)) +
  geom_text(data=n_flights, mapping = aes(x = carrier, label=count, y=veralign), size=4)

As long as we are plotting frequencies of occurrence, there is little difference between bar plots and histograms. If the variable on the x-axis is categorical, we speak of bar plots, while if the variable is numerical and some binning has to be done, we speak of histograms. For example:

library(viridis)
flights %>%
  ggplot( aes(x=arr_delay_log, fill=origin)) +
    geom_histogram( color="#e9ecef", alpha=0.6) +
    scale_fill_viridis(discrete = TRUE) +
    labs(fill="")

But we may also scale the y-axis to show probabilities (total area equal to one), instead of frequencies (total area equal to the number of observations). In such a case, histogram can be considered as an estimator of density function (more on that later in the semester). Hence histograms are often overlaid with other density estimators. We will see that later.

3.2 Scatterplots

Here we take a look again at the insurance redlining data, where we are interested in the effect of race on the response variable. Let’s start with the marginal relationship of the response on race, which is what we are mostly interested in. We can distinguish

ggplot(data = chredlin,
       mapping = aes(x = race, y = involact, color = side, shape = side)) +
  geom_point() + scale_color_manual(values = c("blue","red")) +
  scale_shape_manual(values = c(1,2))

ggplot(data = chredlin, mapping = aes(x = race, y = involact,
                                      shape = side, color = side)) +
  geom_point() + # adds a layer to the empty plot above
  stat_smooth(method=lm) +# adds a regression line too 
  scale_color_manual(values = c("blue","red")) +
  scale_shape_manual(values = c(1,2))

There is probably no difference between north and south w.r.t. the relationship between the response and race, but it might be a different story for fire

ggplot(data = chredlin, mapping = aes(x = fire, y = involact, shape = side, color = side)) +
  geom_point() + # adds a layer to the empty plot above
  stat_smooth(method=lm)+ # adds a regression line too
  scale_color_manual(values = c("blue","red")) +
  scale_shape_manual(values = c(1,2))

What if I want only a single regression line but ability to distinguish between points at the same time? I could either use different mappings for every layer

ggplot(data = chredlin) +
  geom_point(mapping = aes(x = fire, y = involact, shape = side, color = side)) +
  stat_smooth(method=lm, mapping=aes(x = fire, y = involact), color = "purple")+
  scale_color_manual(values = c("blue","red")) +
  scale_shape_manual(values = c(1,2))

or, equivalently, use one global mapping in ggplot (the most general one) and then specify it a bit more for some layers

ggplot(data = chredlin, mapping=aes(x = fire, y = involact)) +
  geom_point(mapping = aes(shape = side, color = side)) +
  stat_smooth(method=lm, color = "purple")

If we want to split into multiple plots, use the facet_wrap function

ggplot(data = chredlin, mapping = aes(x = fire, y = involact, shape = side, color = side)) +
  geom_point() +
  stat_smooth(method=lm, color = "purple") +
  facet_wrap(~ side) # split by a value of a factor

For faceting by two factor variables, we can use facet_grid instead, such as if we wanted to create scatterplots for our flights data, split by carrier and origin (we filter only some carriers so we don’t have too many plots).

flights %>%
  filter(carrier %in% c("AA", "UA", "DL", "US")) %>%
  ggplot(mapping = aes(x = dep_delay, y = arr_delay)) +
  geom_point() +
  stat_smooth(method=lm) +
  facet_grid(origin ~ carrier)

Exercise: create histograms of dep_delay for all three origins and seven consecutive days in the flights data set.

Back to the chredlin data set, we take a closer look at some observations in the scatterplot of involact and fire, that might be outliers or leverage points:

outliers <- c(3,6,35)
leverage_pts <- c(7,24)
outliers <- chredlin[outliers,]
leverage_pts <- chredlin[leverage_pts,]

Let’s now make some scatterplots with the outliers and leverage points highlighted by overploting.

ggplot() +
  geom_point(data = chredlin, mapping = aes(x = fire, y = involact)) +
  geom_point(data = outliers, mapping = aes(x = fire, y = involact), col = "blue", pch = 17, size=3) +
  geom_point(data = leverage_pts, mapping = aes(x = fire, y = involact), col = "red", pch = 18, size=3)

Alternatively, we could incorporate the information about which points are outliers and which are leverage points into the data set as a new variable and use it for plotting. The code is quite ugly (notice how we start with booleans, then turn them into numericals, and then re-type them into factors), but it does the job.

outliers <- I(1:nrow(chredlin) %in% c(3,6,35))
leverage_pts <- I(1:nrow(chredlin) %in% c(7,24))
chredlin <- chredlin %>%
  mutate(out_lev =  as.factor(outliers + 2*leverage_pts))
levels(chredlin$out_lev) <- c("normal obs", "outlier", "leverage point")

ggplot(data = chredlin) +
  geom_point(mapping = aes(x = fire, y = involact, shape = out_lev, color = out_lev, size = out_lev)) +
  scale_size_manual(values = c(2,4,4))

Another alternative would be to use fortify() to add residuals and Cook’s distances from a linear model fit directly to the data set. Then we could use mutate() more naturally to decide which observations are outliers and leverage points.

3.3 Boxplots

Similarly to histograms, boxplots can give us some idea about distributions. Unlike histograms, boxplots do not really capture an underlying density shape, but only visually give several summary statistics and points in the tails.

flights %>%
  filter(carrier %in% c("AA", "B6", "DL", "EV")) %>%
  ggplot( aes(x=carrier, y= arr_delay_log, fill=carrier)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE) +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Boxplot by factor") +
    xlab("Carrier")

Using the fill argument here creates a grouped boxplot, where grouping is given by the fill variable. To reduce the size of the plot, we filter only four carriers.

flights %>%
  filter(carrier %in% c("AA", "B6", "DL", "EV")) %>%
  ggplot( aes(x=carrier, y= arr_delay_log, fill=origin)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE) +
    theme(
      plot.title = element_text(size=11)
    ) +
    ggtitle("Boxplot by two factors") +
    xlab("Carrier")

3.4 Multiple Plots

There are numerous ways how to display multiple plots. The default is to set par(mfrow = c(n_rows, n_cols)) before plotting, an alternative is the lattice package, but these do not work with ggplot. With ggplot, one can utilize ggarrange() from the ggpubr package. However, these will require you to do a lot of copy-pasting since you still need to create plots individually, but sometimes we want to create numerous plots similar in nature. To this point, the ggplot2’s facet_wrap function allows you to split into numerous plots by a value of a factor.

But what if we wish to create one plot for every variable? This is doable with facet_wrap provided we have:

  • all the values (in our data frame) given as a single variable
  • another variable linking the original variables to the values.

This is exactly what we get by using pivot_longer() below. The argument everything() specifies we want to keep all the variables, otherwise we could use the variable names like with select() (just here, multiple variables have to be wrapped in c() due to the function’s syntax) to keep only some or discard some variables. pivot_longer() will give us a long-format data frame with two columns: value and name. Then we plot value and wrap the facets based on name.

data(chredlin, package="faraway")
chredlin %>% mutate(side = as.numeric(side)) %>% 
  pivot_longer(everything()) %>%
  ggplot(aes(value)) + facet_wrap(~ name, scales = "free") + geom_histogram()

The complement of pivot_longer() is pivot_wider(). We may want to use it, e.g., after group_by() and summarize(), which naturally lead to a long table. For example, assume we want to produce a heatmap of the average departure delay of flights on different days (x-axis) and months (y-axis), in order to see average departure delay in a calendar-like fashion. While this is not terribly useful, it is done below.

new_dat <- flights %>% 
  group_by(month, day) %>%
  summarize(count = mean(dep_delay)) %>%
  pivot_wider(names_from = day, values_from = count) %>%
  ungroup() %>% select(-month)

library(lattice)
library(viridisLite)
coul <- plasma(100)
levelplot(t(as.matrix(new_dat)), col.regions = rev(coul), xlab="Day", ylab="Month", main="Departure delay")

While we could create a heatmap using ggplot2’s geom_tile(), I personally find levelplot() of the lattice package more useful for quick plotting. One can do great heatmaps with ggplot2 and geom_tile(), but it requires some work to set up the color schemes and to magnify the colorkey.

While the previous heatmap is not so useful, we will later see that heatmaps are quite handy e.g. when probing performance of a certain method based on two parameters.

3.5 Marginal Scatterplots with Regression Lines and Outliers

If we fit a linear model to the response variable involact with all available variables (except side) as covariates, we notice that there are some outliers and leverage points (looking at the Cook’s distance). Wouldn’t it be nice to add these into the marginal scatterplots directly to see which scatter points belong to the outliers?

Firstly, If we wish to have scatterplots with the response variable involact on the y-axes, we naturally need one more column with involact values in the long format. This is actually done automatically by dropping involact from pivoting.

chredlin %>% mutate(side = as.numeric(side), income=log(income)) %>% 
  pivot_longer(-involact) %>%
  ggplot(aes(y = involact, x = value)) + facet_wrap(~ name, scales = "free") +
  geom_point() + stat_smooth(method=lm)

Secondly, let’s denote the outliers. We will use the second extra-variable approach (as opposed to the over-plotting approach). We create an additional variable which tells us, which of the observations are outliers.

outliers <- I(1:nrow(chredlin) %in% c(3,6,35))
leverage_pts <- I(1:nrow(chredlin) %in% c(7,24))
chredlin <- chredlin %>%
  mutate(out_lev =  as.factor(outliers + 2*leverage_pts))
levels(chredlin$out_lev) <- c("normal obs", "outlier", "leverage point")

chredlin %>% mutate(side = as.numeric(side), income=log(income)) %>% 
  pivot_longer(cols=c(-involact, -out_lev)) %>%
  ggplot() + facet_wrap(~ name, scales = "free") +
  geom_point(mapping = aes(y = involact, x = value, color = out_lev)) +
  stat_smooth(mapping = aes(y = involact, x = value), method=lm)

References

Wickham & Grolemund (2016) R for data science

Faraway (2016) Linear Models with R (2nd Edition)