18  Data Wrangling with the tidyverse

18.1 Topics Covered

  1. Basic tidyverse philosophy and syntax
  2. Using the forward-pipe operator
  3. Reading data with readr
  4. Working with dplyr
    • Renaming columns
    • Selecting columns and filtering/querying rows
    • Obtaining global and group-based summarizations
    • Mutating tables and adding new columns
    • Random sampling and shuffling
    • Merging by row or column
    • Comparing tables
    • Table joins
  5. Manipulating strings with stringr
  6. Manipulating factors with forcats

18.2 Introduction

18.2.1 About the tidyverse

Now that we have discussed base R syntax, we will turn our attention to the tidyverse. The tidyverse is a set of R packages that share a common syntax and can be used for data preparation, cleaning, querying, summarization, and wrangling. These packages simplify data science and data manipulation using R, and they are often conceptualized as a more modern way to work with data in the R environment. We feel strongly that the tidyverse is a great addition to R. The common syntax and philosophy makes it easy to use these packages collectively. The following are the core packages that constitute the tidyverse:

  • tibble: expand upon the base R data frame data model and implements the tibble data model
  • readr: read tabulated data
  • dplyr: data cleaning, querying, summarization, wrangling, and manipulation
  • tidyr: prepare, clean, and tidy data
  • ggplot2: graphing and data visualization
  • stringr: work with character strings
  • forcats: work with factors
  • lubridate: work with date/time data
  • purrr: work with functions and vectors; implement functional programming

There are other packages that are associated with the tidyverse; for example, glue is used to combine/concatenate data and strings. It serves a similar purpose as f-strings in Python or template literals in JavaScript.

Here, we primarily explore dplyr, stringr, and forcats since our focus is specifically data query and manipulation. When the tidyverse is installed (e.g., install.packages("tidyverse")), all of the core packages are installed. So, there is no need to install them separately. Similarly, when the library is loaded (library(tidyverse)), as in the code block below, all core packages are loaded, and they do not need to be loaded one-by-one.

Throughout this chapter, we use the gt() function from the gt package. This is not necessary. We primarily use it here to improve the rendering of tables when generating webpages. The gt package for table generation is discussed in Chapter 22.

18.2.2 Forward Pipe Operators

Before beginning to explore our examples, we will provide a brief explanation of the forward-pipe operator. The original forward-pipe operator, stated as %>%, is provided by the magrittr package. However, in R version 4.1.0 a native forward-pipe operators was introduced, which is stated as |>. We use the base R version in this text.

The purpose of this operator is very simple: it passes the input to the first argument of the first function called or the output from the prior function to the current function. For example v1 |> mean(na.rm=TRUE) is the same as mean(v1, na.rm=TRUE). The key feature here is that this allows you to string a series of operations together as opposed to running them one-by-one. This, in our opinion, results in much more readable code. We demo the use of this operator throughout this chapter and, more generally, throughout the entire text.

18.2.3 Read in Data with readr

In this chapter, we use some county-level summary data for the contiguous United States (us_county_data.csv) that was created by the authors. These data have been provided as a CSV file. We have also provided a shapefile version that includes the polygon geometry; however, we just use the CSV version in this chapter. We have also provided a PDF file explaining the provided columns and the sources of these data: us_county_data_description.pdf.

We begin by loading in the data using read_csv() from readr as opposed to the base R read.csv() function. We are also using the str_glue() function from stringr to concatenate the folder path and file name. A version of this function is also available in the glue package. Within str_glue(), variables are placed inside of {} while regular character strings are not. The entire string is place inside of quotes. You may need to update the folder path to reflect the path on your local machine.

The file is read in as a tibble, as defined by the tibble package, as opposed to a base R data frame. Since character columns are not converted to factors by default, we use mutate_if() to convert all character columns to factors. We discuss data mutation with dplyr later in this chapter. Lastly, we use the tibble glimpse() function to preview the data table. This serves a similar purpose as the base R str() function. The tibble contains 3,104 records, or counties, and 37 columns, each representing different information about the counties.

Note the use of the forward-pipe operator to pass the read table to the mutate_if() function.

fldrPth <- "gslrData/chpt18/data/"

cntyD <- read_csv(str_glue("{fldrPth}us_county_data.csv")) |> 
  mutate_if(is.character, as.factor)

glimpse(cntyD)
Rows: 3,104
Columns: 37
$ NAME            <fct> Autauga County, Baldwin County, Barbour County, Bibb C…
$ STATE_NAME      <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, …
$ STATE_ABBR      <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL…
$ STATE_FIPS      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ COUNTY_FIP      <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29,…
$ FIPS            <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, …
$ POPULATION      <dbl> 58805, 231767, 25223, 22293, 59134, 10357, 19051, 1164…
$ POP_SQMI        <dbl> 97.3, 141.9, 27.9, 35.6, 90.9, 16.6, 24.5, 190.2, 57.7…
$ SQMI            <dbl> 604.37, 1633.14, 904.52, 626.17, 650.63, 625.14, 777.8…
$ SUB_REGION      <fct> E S Cen, E S Cen, E S Cen, E S Cen, E S Cen, E S Cen, …
$ med_income      <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 44850, 50128…
$ households      <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572, 13…
$ per_desk_lap    <dbl> 74.13609, 77.73865, 52.45655, 56.17854, 66.86631, 55.2…
$ per_smartphone  <dbl> 81.00561, 84.63598, 68.69770, 73.61896, 76.43952, 69.2…
$ per_no_comp     <dbl> 8.571826, 8.239437, 20.220983, 16.779171, 14.963452, 2…
$ per_internet    <dbl> 82.79605, 85.52358, 64.99678, 76.16752, 80.03301, 62.7…
$ per_broadband   <dbl> 82.70792, 85.06907, 64.63205, 76.12619, 79.62273, 60.6…
$ per_no_internet <dbl> 15.33930, 11.91952, 28.96374, 20.93952, 18.48621, 30.3…
$ dem             <dbl> 105.61176, 31.40693, 115.25758, 125.28409, 246.04255, …
$ precip          <dbl> 1412.167, 1712.929, 1400.435, 1422.227, 1492.224, 1397…
$ tempmn          <dbl> 17.95529, 19.31788, 18.39159, 17.46602, 16.07128, 18.0…
$ tempmin         <dbl> 11.758941, 13.763203, 11.941061, 11.060341, 10.058404,…
$ tempmax         <dbl> 24.15424, 24.87489, 24.84402, 23.87443, 22.08702, 24.4…
$ dptmn           <dbl> 12.50141, 14.37255, 12.73129, 12.08955, 11.05468, 12.4…
$ vpdmin          <dbl> 0.7703529, 0.6897403, 0.8700758, 0.7509091, 0.5855319,…
$ vpdmax          <dbl> 16.62553, 15.45636, 18.27205, 16.81307, 14.45106, 17.4…
$ sol             <dbl> 15.63682, 16.28429, 15.98689, 15.46614, 15.07543, 15.9…
$ per_for         <dbl> 49.71597, 33.13072, 56.58767, 70.28501, 54.95027, 54.6…
$ per_dev         <dbl> 6.815302, 9.449728, 4.227699, 5.258896, 8.577787, 2.98…
$ per_wet         <dbl> 10.8697204, 30.6357571, 8.8715967, 6.5578020, 0.634172…
$ per_crop        <dbl> 3.14176012, 9.19548790, 4.53634268, 0.06282062, 1.1841…
$ per_past_grass  <dbl> 20.795712, 10.806887, 12.823742, 10.078160, 30.432964,…
$ strm_length     <dbl> 49.599790, 74.558900, 57.307449, 45.351825, 4.118788, …
$ strm_den        <dbl> 0.031686857, 0.017626996, 0.024462179, 0.027964351, 0.…
$ per_karst       <dbl> 0.00000000, 0.00000000, 14.91452991, 11.05620754, 18.1…
$ rail_den        <dbl> 0.09007046, 0.02088778, 0.04713053, 0.09748651, 0.0374…
$ road_den        <dbl> 1.890543, 1.784091, 1.090589, 1.792521, 1.935772, 1.24…

18.3 Data Wrangling with dplyr

In this section, we focus on basic data wrangling using dplyr. We specifically discuss:

  1. Renaming columns
  2. Selecting a subset of columns
  3. Filtering and querying rows

18.3.1 Rename Columns

The rename() dplyr function is used to rename columns. In our example, the “dem” column, which represents the mean county elevation in meters, is being renamed to “ELEV” while the “per_wet” column, which represent the percentage of the county’s land area that is wetland, is renamed to “pWetland”. It is not necessary to place quotes around the new or old column names unless there are spaces in the names. The syntax reads as NEW NAME = OLD NAME.

cntyD2 <- cntyD |>
  rename(ELEV = dem,
         pWetland = per_wet)

18.3.2 Select Columns

A subset of columns can be extracted to a new tibble, in this case cntyD2, using the dplyr select() function, which accepts the names of the columns that we want to maintain. In the first example, we are maintaining only four columns. It is also possible to drop columns by using - in front of the column name. In the second example, we are removing two columns. It is also possible to use helper functions to select columns. In the last example, we use the where() and is.factor() functions to select only columns that have the factor data type. Again, there are many is.() functions that operate as tests in R. The statement within where() must evaluate to TRUE or FALSE.

cntyD2 <- cntyD |> 
  select(NAME, STATE_NAME, POPULATION, SUB_REGION)

cntyD2 <- cntyD |> 
  select(-FIPS, -STATE_ABBR)

cntyD2 <- cntyD |>
  select(where(is.factor))

18.3.3 Filter and Query Rows

Selecting a subset of rows is generally termed querying or filtering the data. In other words, only rows or records that meet a certain criteria are returned to the new object. With dplyr, this is accomplished with the filter() function. In our first example, we are filtering the data to create a subset of counties that have a mean elevation greater than 1,000 meters (dem > 1000). These counties are written to a new tibble called cntyD1000. We then calculate the percentage of counties that have a mean elevation greater than 1,000 by dividing the number of rows in the new tibble by the number of rows in the original tibble using the base R nrow() function. We also use round() to round off the percentage to 1 decimal place. 11.1% of the counties have a mean elevation greater than 1,000 meters.

select() is used to subset out columns while filter() is used to subset or query rows.

cntyD1000 <- cntyD |> 
  filter(dem > 1000) 
round((nrow(cntyD1000)/nrow(cntyD))*100, 1)
[1] 11.1

In the next example, we implement a compound query using & to filter out all counties that have an elevation greater than 1,000 meters (dem > 1000) AND that have a percent forest cover greater than 75% (per_for > 75). Only 0.4% of the counties meet these two criteria.

cntyD2 <- cntyD |> 
  filter(dem > 1000 & per_for > 75) 
round((nrow(cntyD2)/nrow(cntyD))*100,1)
[1] 0.4

Using the forward-pipe operator, it is possible to perform a query using filter() and select the columns to include in the final output using select() as opposed to including all of the original columns. If you use select() prior to using filter(), only the selected columns will be available for use in the query statement. In our example, the order would not matter since the columns used in the query are included in the selection. So, we could have used select() before filter().

cntyD2 <- cntyD |> 
  filter(dem > 1000 & per_for > 75) |> 
  select(NAME, STATE_NAME, dem, per_for)

The next example repeats the prior example but using OR (|) as opposed to AND (&). This results in 16.3% of the counties being returned to the new object. In other words, 16.3% of the counties had a mean elevation greater than 1,000 m, OR a percent forest cover greater than 75%, OR both a mean elevation greater than 1,000 m AND a forest cover greater than 75%. The available operators to combine multiple queries are as follows:

  • &: A AND B (meets both criteria)
  • |: A OR B (meets at least one or both criteria)
  • !: A NOT B (meets the first criteria but not the second)
  • xor: A OR B but NOT A AND B (meets one of the criteria but not both)
cntyD2 <- cntyD |> 
  filter(dem > 1000 | per_for > 75) 
round((nrow(cntyD2)/nrow(cntyD))*100, 1)
[1] 16.3

The %in% operator can be used to test whether a character string, number, date, or logical occurs within a vector of options. In the example below, we are extracting all Lincoln, Washington, and Jefferson counties by testing to see if the name of the county from the “Name” field occurs in a vector of these three county names. This is much more concise than using a series of OR (|) statements.

cntyNmSub <- c("Lincoln County", 
               "Washington County", 
               "Jefferson County")
cntyD |> 
  filter(NAME %in% cntyNmSub) |> 
  count() |> gt()
n
78

In this example, we are creating a more complex query that requires the use of parenthesis to clarify the order of operations or logic. Specifically, counties within the state of Utah OR Colorado are selected if they have a mean elevation greater than 1,000 meters OR a percent forest cover greater than 75%. 93 counties meet these criteria.

cntyD2 <- cntyD |> 
  filter((dem > 1000 | per_for > 75) & 
           (STATE_ABBR %in% c("UT", "CO")))
  
cntyD2 |> count() |> gt()
n
93

18.3.4 Slicing

Slicing can be used to select a given number of features or rows with the largest or smallest values relative to an attribute. In this first example, we obtain the county with the highest mean elevation using slice_max(). The county with the highest mean elevation is San Juan County, Colorado with a mean elevation of 3,478 meters. Similar to slice_max(), top_n() can be used to select the n number of records with the largest values for a specific attribute.

cntyD |> 
  slice_max(dem, n=1) |> 
  select(NAME, STATE_ABBR, dem) |> gt()
NAME STATE_ABBR dem
San Juan County CO 3478.088
cntyD |> 
  top_n(dem, n=5) |> 
  select(NAME, STATE_ABBR, dem) |> gt()
NAME STATE_ABBR dem
Hinsdale County CO 3351.181
Lake County CO 3328.862
Mineral County CO 3197.444
San Juan County CO 3478.088
Summit County CO 3208.059

To obtain records with the smallest values, you can use slice_min(). In the example below, we first obtain the county with the lowest population (Loving County, Texas with a population of 64) followed by the five counties with the lowest populations. arrange() is used to sort the counties ascending based on population while arrange() and desc() are used to sort the data descending.

cntyD |> 
  slice_min(POPULATION, n=1) |> 
  select(NAME, STATE_ABBR, POPULATION) |> gt()
NAME STATE_ABBR POPULATION
Loving County TX 64
cntyD |> 
  slice_min(POPULATION, n=5) |> 
  select(NAME, STATE_ABBR, POPULATION) |>
  arrange(POPULATION) |> gt()
NAME STATE_ABBR POPULATION
Loving County TX 64
King County TX 265
Kenedy County TX 350
McPherson County NE 399
Blaine County NE 431
cntyD |> 
  slice_min(POPULATION, n=5) |> 
  select(NAME, STATE_ABBR, POPULATION) |>
  arrange(desc(POPULATION)) |> gt()
NAME STATE_ABBR POPULATION
Blaine County NE 431
McPherson County NE 399
Kenedy County TX 350
King County TX 265
Loving County TX 64

18.3.5 Data Summarization with dplyr

18.3.5.1 Global Summarization

A variety of summary statistics can be calculated using the dplyr summarize() function combined with base R functions including:

If there are missing values, this can impact some of the summary calculations. To avoid this issue, you can use the na.rm=TRUE argument, which indicates to ignore missing data in the calculations.

Below, we demonstrate how to calculate the mean population using the mean() base R function and the dplyr summarize() function. This results in a single mean that aggregates the data. It is not necessary to use na.rm=TRUE here since there are no missing population values in the dataset.

The tidyverse uses both American and British spellings. So, summarization can be performed using the equivalent summarize() or summarise() functions.

cntyD |> 
  summarize(pop = mean(POPULATION)) |> gt()
pop
106055.8

It is also possible to calculate multiple summary metrics and/or summary metrics for multiple columns within a single call to summarize() as demonstrated below. We also use round() to round off the generated metrics. Note that the syntax for each calculation is NEW COLUMN NAME = STAT FUNCTION(COLUMN NAME, ADDITIONAL AGRUMENTS AS NEEDED).

cntySummary <- cntyD |> 
  summarize(pop = round(mean(POPULATION), 0),
            dem = round(mean(dem), 1),
            perCrop = round(mean(per_crop), 1),
            perFor = round(mean(per_for), 1))
cntySummary |> gt()
pop dem perCrop perFor
106056 440.4 22.3 30.2

It is also possible to filter or query the data prior to calculating summary metrics so that the summary is obtained for only a subset of the data. In this case, the filtering must be applied before the summarization.

cntySummary <- cntyD |> 
  filter(STATE_ABBR %in% c("NE", "KS", "OK", "MS")) |>
  summarize(pop = round(mean(POPULATION), 0),
            dem = round(mean(dem), 1),
            perCrop = round(mean(per_crop), 1),
            perFor = round(mean(per_for), 1))
cntySummary |> gt()
pop dem perCrop perFor
33109 440.9 32.1 16.9

18.3.5.2 Summarize by Group

Instead of obtaining a single, global summary statistic, you may want to obtain summaries for each group as defined by a grouping variable. The grouping variable is generally a factor column. In our first example, we are obtaining a count of counties in each sub-region by first grouping the data using group_by() then using count() to obtain the count of records by group. Once group_by() is used, all following operations occur at the group-level. The grouping information will be maintained in any resulting tibbles. This grouping information can be removed using ungroup().

cntyD |> 
  group_by(SUB_REGION) |>
  count() |>
  ungroup() |> gt()
SUB_REGION n
E N Cen 437
E S Cen 364
Mid Atl 150
Mtn 281
N Eng 67
Pacific 133
S Atl 584
W N Cen 618
W S Cen 470

In this second example, we obtain total population counts by sub-region using group_by() and summarize(). We then print the data in descending order relative to total population. The South Atlantic sub-region has the highest total population while New England has the lowest. The following examples demonstrates obtaining multiple, grouped summary metrics.

cntyD |> 
  group_by(SUB_REGION) |>
  summarize(totalPop = sum(POPULATION)) |>
  arrange(desc(totalPop))|>
  ungroup() |> gt()
SUB_REGION totalPop
S Atl 66026391
Pacific 51480760
E N Cen 47368533
Mid Atl 42492943
W S Cen 40774139
Mtn 24919150
W N Cen 21616921
E S Cen 19402234
N Eng 15116205
cntyD |> 
  group_by(SUB_REGION) |>
  summarize(totalPop = round(sum(POPULATION), 0),
            avgPop = round(mean(POPULATION), 0),
            totalArea = round(sum(SQMI), 1))|>
  ungroup() |> gt()
SUB_REGION totalPop avgPop totalArea
E N Cen 47368533 108395 247927.6
E S Cen 19402234 53303 181826.6
Mid Atl 42492943 283286 101316.4
Mtn 24919150 88680 863563.8
N Eng 15116205 225615 65456.0
Pacific 51480760 387073 322419.6
S Atl 66026391 113059 271336.7
W N Cen 21616921 34979 517800.9
W S Cen 40774139 86753 433058.9

18.3.6 Mutate

The mutate() function from dplyr allows you to add new columns to a tibble or manipulate existing columns. In contrast, transmute(), which is not demonstrated here, performs a calculation or generates a new column using existing columns then removes the columns on which the calculation was based. In our first example, we calculate the county area in square kilometers (km2) using the area in square miles (mi2), which is already included in the table, and the required conversion factor. We also calculate the total percentage of the county mapped to low vegetation land cover by adding the crop and pasture/grass percentages. Since we used mutate() here instead of transmute(), the original columns are maintained.

cntyD <- cntyD |>
  mutate(SQKM = SQMI*2.58999)

cntyD <- cntyD |>
  mutate(per_low_veg = per_crop + per_past_grass)

In this example, we are calculating the forest area by converting the percent forest estimate to a proportion then multiplying by the county land area. Once this is accomplished, the data are grouped by sub-regions using group_by() then the total forest area per sub-region is calculated. The result is printed in descending order based on total forest area.

cntyD |> 
  mutate(forArea = (per_for/100)*SQMI) |>
  group_by(SUB_REGION) |>
  summarize(totalForest = sum(forArea)) |> 
  arrange(desc(totalForest)) |>
  ungroup() |> gt()
SUB_REGION totalForest
Mtn 157567.00
S Atl 113851.75
Pacific 97132.54
E S Cen 89329.46
W S Cen 76039.24
E N Cen 67518.49
Mid Atl 55930.47
W N Cen 54214.86
N Eng 43165.03

Mutation can be generalized using mutate_if(), where mutation is performed for all columns that meet a query. This is how we converted all character columns to a factor data type when reading in the data at the beginning of the chapter, which is replicated below.

Another generalization of mutate() is mutate_at(). This allows you to list a set of columns for which you want to perform a mutation. Here, we convert all land cover percentages to proportions, and the original columns are overwritten.

cntyD <- read_csv(str_glue("{fldrPth}us_county_data.csv")) |> 
  mutate_if(is.character, as.factor)

cntyD2 <- cntyD  |> 
  mutate_at(vars(per_for, per_dev, per_wet, per_crop, per_past_grass), ~  (./100))

18.3.7 Random Sampling

A random sample can be drawn from a dataset using one of the following functions:

  • sample_n(): select a given number of random samples
  • sample_frac(): randomly select a given proportion of the samples

If replace=FALSE is used, random sampling without replacement is implemented where a single record or row cannot be sampled more than once. In contrast, random sampling with replacement is implemented when replace=TRUE, which allows for a single record or row to be potentially sampled multiple times. set.seed() can be used to make the random selection deterministic and reproducible.

set.seed(42)
cntyDSamp1 <- cntyD |>
  sample_n(size=100, replace=FALSE)
set.seed(42)
cntyDSamp2 <- cntyD |>
  sample_frac(frac=.2, replace=FALSE)

In order to implement stratified random sampling with replacement or stratified random sampling without replacement, you can combine group_by() and sample_n(). The variable used for the stratification must define the groupings. In the example below, ten random counties without replacement are drawn for each sub-region.

set.seed(42)
cntyDSRS <- cntyD |>
  group_by(SUB_REGION) |>
  sample_n(size=10, replace=FALSE) |>
  ungroup()

We sometimes use sample_frac() to shuffle data. This may be desired to reduce autocorrelation in the data. Using frac=1 and replace=FALSE, all rows will be returned but in a new order.

set.seed(42)
cntyDSamp3 <- cntyD |>
  sample_frac(frac=1, replace=FALSE)

18.3.8 Binding Data

Multiple tibbles or data frames can be merged to a single object via binding. bind_cols() binds along columns. For example, a table with 100 records and 15 columns could be combined with a table with 100 records and 20 different columns to create a table with 100 records/rows and 35 columns. Note that bind_cols() assumes that the records are in the same order. In other words, binding is based on row indices only. If the rows may not be in the same order, then a table join is more appropriate. Table joins are discussed below. If the same column names are used in both data sets, new names will be generated to differentiate them.

This example demonstrates the use of bind_cols(). We first create subsets of the columns using select(). The two subsets are then merged using bind_cols(). Again, it is assumed that the records are in the same order in the two tibbles. This is true in this case since select() does not alter the order of the rows.

colSub1 <- cntyD |>
  select(NAME, STATE_NAME, STATE_ABBR, FIPS)
colSub2 <- cntyD |>
  select(per_for, per_dev, per_wet, per_crop, per_past_grass)

colSubMerge <- bind_cols(colSub1, colSub2)

bind_rows() binds along rows. For example, a table with 50 records could be combined with a table with another 50 records but with the same columns or attributes to generate a table with 100 records.

In this example, we randomly selecting 100 rows to include in a new tibble called cntySamp1. We then randomly select 20% of the rows to include in another tibble called cntyDSamp2. We next merge these subsets together using bind_rows(). Since some of the samples may have been included in both datasets, we use distinct() to remove duplicate rows.

cntyDSamp1 <- cntyD |>
  sample_n(size=100, replace=FALSE)

cntyDSamp2 <- cntyD |>
  sample_frac(frac=.2, replace=FALSE)

cntyDSamp4 <- bind_rows(cntyDSamp1, cntyDSamp2) |> 
  distinct()

18.3.9 Intersect, Union, and SetDiff

It is possible to compare two tibbles to return a subset of the records. Specifically:

  • intersect(): maintains only rows that occur in both tables
  • union(): maintains rows that occur in Table 1 or Table 2 but removes duplicates
  • setdiff(): returns rows that occur in Table 1 but not Table 2

Below, we demonstrate the use of setdiff(). We first select 30 counties from each sub-region. We next find the counties that were not selected by comparing the subset to the full dataset using setdiff().

selectedDF <- cntyD |>
  group_by(SUB_REGION) |>
  sample_n(30, replace=FALSE) |>
  ungroup()

notSelectedDF <- setdiff(cntyD, selectedDF)

18.3.10 Table Joins

Table joins allow for merging data based on a common or shared attribute as opposed to the order of the records, as is the case with bind_cols(). There are several types of joins implemented within dplyr:

  • left_join(): maintains all records from the Table 1 even if they have no match in Table 2
  • right_join(): maintains all records from Table 2 even if they have no match in Table 1
  • outer_join(): maintains all records from both tables
  • inner_join(): maintains only matching records between both tables

To demonstrate a table join, we first create subsets of the columns stored in two separate tibbles. Note that both tibbles contain the “FIPS” column, which will be used as a unique ID. We next perform an inner join where rows are matched based on the shared or common attribute (by="FIPS"). Again, rows are matched based on a shared attribute as opposed to their order in the table; as a result, it is not required that the rows be in the same order in both tables. Since we are using an inner join, only rows that occur in both tables are maintained. Since all counties were included in this case, all four types of joins would yield the same result. In the second example, we create a random subset of rows to include in one of the tibbles. Since this subset is the first table in the join and a left join is used, only the matching 100 records are returned to the new object.

colSub1 <- cntyD |>
  select(NAME, 
         STATE_NAME,
         STATE_ABBR,
         FIPS)

colSub2 <- cntyD |>
  select(FIPS, 
         per_for, 
         per_dev, 
         per_wet, 
         per_crop, 
         per_past_grass)

colSubMerge <- inner_join(colSub1, 
                          colSub2, 
                          by="FIPS")
set.seed(42)
colSub1 <- cntyD |>
  select(NAME, 
         STATE_NAME, 
         STATE_ABBR, 
         FIPS) |>
  sample_n(size=100, 
           replace=FALSE)

colSub2 <- cntyD |>
  select(FIPS, 
         per_for, 
         per_dev, 
         per_wet, 
         per_crop, 
         per_past_grass)

colSubMerge <- left_join(colSub1, 
                         colSub2, 
                         by="FIPS")

18.4 Manipulating Strings with stringr

We have primarily focused on data preparation, cleaning, manipulation, summarization, wrangling, and querying with dplyr in this chapter. In the last two sections of this chapter, we explore character string manipulation with stringr followed by factor manipulation with forcats.

The stringr package provides functions for working with and manipulating character strings. In the first example, we are creating a new column named “NAME2” and populating it with a lower case version of the county names from the “NAME” column. Note that stringr functions are prefixed with “str_”.

cntyD2 <- cntyD

cntyD2$NAME2 <- cntyD2$NAME |> 
  str_to_lower()

In this example, we are removing the word “County” from the county names using str_replace() where the first argument is the character string to be replaced while the second argument is the character string to replace it with: in this case an empty string. Note that we include the space before “County” so that the empty space is also removed. The result is county names without ” County” included.

cntyD2$NAME <- cntyD2$NAME |> 
  str_replace(" County", "")

In the example below, we use str_length() to calculate the length of the county name. We then select out the county with the longest name, which is St. John the Baptist Parish. The count includes spaces.

cntyD2 |>
  mutate(nameLength = str_length(NAME)) |>
  select(NAME, nameLength) |>
  slice_max(nameLength, n=1) |> gt()
NAME nameLength
St. John the Baptist Parish 27

When there are potentially multiple occurrences of a character string to be replaced, you can use str_replace_all() as opposed to str_replace(). Here, we use this function to replace all spaces in the sub-region names. We convert the result to a factor data type and print the levels to check the result.

cntyD2$SUB_REGION <- cntyD2$SUB_REGION |>
  str_replace_all(" ", "") |> 
  as.factor()
levels(cntyD2$SUB_REGION)
[1] "ENCen"   "ESCen"   "MidAtl"  "Mtn"     "NEng"    "Pacific" "SAtl"   
[8] "WNCen"   "WSCen"  

What if we wanted to find the longest county name excluding spaces? This could be accomplished by combining str_length() and str_replace_all() as demonstrated below. The same county is still selected but with a length of 23 characters as opposed to 27.

cntyD2 |>
  mutate(nameLength = str_length(str_replace_all(NAME, " ", ""))) |>
  select(NAME, nameLength) |>
  slice_max(nameLength, n=1) |> gt()
NAME nameLength
St. John the Baptist Parish 23

str_detect() is used to detect a provided sequence of characters within a string. In our example, we are searching for the string “Lincoln”. This results in a logical output, so counties that have “Lincoln” in their name yield TRUE while all others yield FALSE. The logical data are written to a new column named “hasLincoln”. We then filter out and count the number of records that contain “Lincoln”. 24 counties have the character string “Lincoln” included in their name.

cntyD2 |> 
  mutate(hasLincoln = str_detect(NAME, "Lincoln")) |> 
  filter(hasLincoln == TRUE) |> 
  count() |> gt()
n
24

As discussed above, str_glue() is used to concatenate character strings. We demo this again below using file paths. str_glue_data() allows for performing concatenations for each record or row in a table. In our example, we are generating a new column, “NAME3”, with the following format: “NAME COUNTY, STATE_NAME”. {} indicates data arguments or column names.

fldPath1 <- "C:/Data/tidyverse/"
fName <- "dataset1.csv"
str_glue("{fldPath1}{fName}")
C:/Data/tidyverse/dataset1.csv
cntyD2$NAME3 <- cntyD2 |> str_glue_data("{NAME} County, {STATE_NAME}")

18.5 Manipulating Factors with forcats

The forcats package is used to manipulate factors and associated factor levels. Remember that the base R levels() function returns the factor levels.

cntyD2 <- cntyD

cntyD2$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn"     "N Eng"   "Pacific" "S Atl"  
[8] "W N Cen" "W S Cen"

All forcats functions are prefixed with “fct_”. For example, the fct_count() function counts the number of rows or records associated with each factor level.

cntyD2$SUB_REGION |> fct_count() |> gt()
f n
E N Cen 437
E S Cen 364
Mid Atl 150
Mtn 281
N Eng 67
Pacific 133
S Atl 584
W N Cen 618
W S Cen 470

Remember that all factor levels are maintained even if the data are filtered and there are no records from each factor level in the subset. This is demonstrated below.

cntyD3 <- cntyD2 |>
  filter(SUB_REGION %in% c("N Eng", "SAtl", "Mid Atl"))
cntyD3$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn"     "N Eng"   "Pacific" "S Atl"  
[8] "W N Cen" "W S Cen"

Similar to the base R droplevels() function, fct_drop() is used to drop unused factor levels.

cntyD3 <- cntyD2 |>
  filter(SUB_REGION %in% c("N Eng", "S Atl", "Mid Atl"))

cntyD3$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn"     "N Eng"   "Pacific" "S Atl"  
[8] "W N Cen" "W S Cen"
cntyD3$SUB_REGION <- cntyD3$SUB_REGION |> fct_drop()

cntyD3$SUB_REGION |> levels()
[1] "Mid Atl" "N Eng"   "S Atl"  

Factor levels can be renamed using fct_recode(). The structure is NEW NAME = OLD NAME. Quotes are only necessary if there are spaces in the names. Below, we have renamed the factor levels to use the full sub-region names as opposed to abbreviations.

cntyD3$SUB_REGION <- cntyD3$SUB_REGION |>
  fct_recode("South Atlantic" = "S Atl", 
               "Mid-Atlantic" = "Mid Atl", 
               "New England" = "N Eng")
cntyD3$SUB_REGION |> levels()
[1] "Mid-Atlantic"   "New England"    "South Atlantic"

Similarly, fct_collapse() can be used to merge factor levels into a smaller set with or without renaming. Below, we have merged the factor levels into the categories: “Central”, “East”, and “West”.

cntyD2$SUB_REGION <- cntyD2$SUB_REGION |>
  fct_collapse("Central" = c("E N Cen", "E S Cen", "W N Cen", "W S Cen"),
               "East" = c("Mid Atl", "N Eng", "S Atl"),
               "West" = c("Mtn", "Pacific"))

cntyD2$SUB_REGION |> 
  levels()
[1] "Central" "East"    "West"   

18.6 Concluding Remarks

We use the tidyverse for data preparation, cleaning, querying, summarization, and wrangling throughout this text. We opted to use the tidyverse and the forward-pipe operator since we feel that the code is often more concise and readable in comparison to base R syntax that accomplishes similar tasks. If you struggle with tidyverse syntax as you engage with this text, you may find it useful to refer to this chapter or the tidyverse documentation. Since the tidyverse has a large user base, there is also a lot of help available online. You may also want to check out the cheatsheets made available by Posit.

18.7 Questions

  1. Explain the difference between filter() and select().
  2. Explain the difference between bind_rows() and bind_cols().
  3. Explain the difference between transmute() and mutate().
  4. Explain the difference between left_join(), inner_join(), and full_join().
  5. Explain the difference between intersect(), union(), and setdiff().
  6. What is the purpose of the na.rm parameter in summarization functions, such as mean()?
  7. What is the purpose of the ungroup() function?
  8. What is the purpose of the fct_collapse() function from forcats?

18.8 Exercises

You have been provided with a urban tree dataset for Portland, Oregon in the exercise folder for the chapter. These data are available here. Use the tidyverse to complete the following tasks and answer the following questions.

  1. Read in portlandTrees.csv
  2. Select out following columns: “Common_nam”, “Genus_spec”, “Family”, “Genus”, “DBH”, “TreeHeight”, “CrownWidth”, “CrownBaseH”, and “Condition
  3. Convert all character columns to factors
  4. Rename columns as follows:
    • “Common_nam =”Common”
    • “Genus_spec” = “Scientific”
    • “Family =”Family”
    • “Genus” = “Genus”
    • “DBH” = “DBH”
    • “TreeHeight” = “Height”
    • “CrownWidth” = “CrownWidth”
    • “CrownBaseH” = “CrownBaseHeight”
    • “Condition” = “Condition”
  5. How many different families are included in the dataset?
  6. How many different genera are included in the dataset?
  7. Which family has the most unique genera included in the dataset?
  8. How many different species from the Quercus genus are included?
  9. Within the Quercus genus, which species has the largest number of records?
  10. What is the median DBH of all Quercus trees included in the dataset?
  11. Which genus has a larger average crown base height: Quercus or Acer?
  12. How many trees total from the Tilia, Pinus, or Ulmus genera have a “Dead” or”Poor” condition?
  13. How many trees are from the Quercus, Pinus, Acer, or Ulmus genera, have a height larger than 100, a DBH larger than 15, and a crown width larger than 25?
  14. Generate code to create a random sample without replacement of 15 trees from each of the following genera (60 trees total): Quercus, Pinus, Acer, or Ulmus.
  15. How many trees in the dataset have a common name that includes the word “red” (ignoring case)?
  16. How many trees in the dataset have a common name that includes the word “western” (ignoring case)?
  17. How many trees have a common name longer than 30 characters ignoring spaces?
  18. How many trees have a common name that is shorter than its scientific name, ignoring spaces?
  19. Which species has the largest range of heights in the dataset?
  20. Which species has the largest count of “Poor” condition trees in the dataset?