18 Data Wrangling with the tidyverse
18.1 Topics Covered
- Basic tidyverse philosophy and syntax
- Using the forward-pipe operator
- Reading data with readr
- Working with dplyr
- Renaming columns
- Selecting columns and filtering/querying rows
- Obtaining global and group-based summarizations
- Mutating tables and adding new columns
- Random sampling and shuffling
- Merging by row or column
- Comparing tables
- Table joins
- Manipulating strings with stringr
- Manipulating factors with forcats
18.2 Introduction
18.2.1 About the tidyverse
Now that we have discussed base R syntax, we will turn our attention to the tidyverse. The tidyverse is a set of R packages that share a common syntax and can be used for data preparation, cleaning, querying, summarization, and wrangling. These packages simplify data science and data manipulation using R, and they are often conceptualized as a more modern way to work with data in the R environment. We feel strongly that the tidyverse is a great addition to R. The common syntax and philosophy makes it easy to use these packages collectively. The following are the core packages that constitute the tidyverse:
- tibble: expand upon the base R data frame data model and implements the tibble data model
- readr: read tabulated data
- dplyr: data cleaning, querying, summarization, wrangling, and manipulation
- tidyr: prepare, clean, and tidy data
- ggplot2: graphing and data visualization
- stringr: work with character strings
- forcats: work with factors
- lubridate: work with date/time data
- purrr: work with functions and vectors; implement functional programming
There are other packages that are associated with the tidyverse; for example, glue is used to combine/concatenate data and strings. It serves a similar purpose as f-strings in Python or template literals in JavaScript.
Here, we primarily explore dplyr, stringr, and forcats since our focus is specifically data query and manipulation. When the tidyverse is installed (e.g., install.packages("tidyverse")
), all of the core packages are installed. So, there is no need to install them separately. Similarly, when the library is loaded (library(tidyverse)
), as in the code block below, all core packages are loaded, and they do not need to be loaded one-by-one.
Throughout this chapter, we use the gt()
function from the gt package. This is not necessary. We primarily use it here to improve the rendering of tables when generating webpages. The gt package for table generation is discussed in Chapter 22.
18.2.2 Forward Pipe Operators
Before beginning to explore our examples, we will provide a brief explanation of the forward-pipe operator. The original forward-pipe operator, stated as %>%
, is provided by the magrittr package. However, in R version 4.1.0 a native forward-pipe operators was introduced, which is stated as |>
. We use the base R version in this text.
The purpose of this operator is very simple: it passes the input to the first argument of the first function called or the output from the prior function to the current function. For example v1 |> mean(na.rm=TRUE)
is the same as mean(v1, na.rm=TRUE)
. The key feature here is that this allows you to string a series of operations together as opposed to running them one-by-one. This, in our opinion, results in much more readable code. We demo the use of this operator throughout this chapter and, more generally, throughout the entire text.
18.2.3 Read in Data with readr
In this chapter, we use some county-level summary data for the contiguous United States (us_county_data.csv) that was created by the authors. These data have been provided as a CSV file. We have also provided a shapefile version that includes the polygon geometry; however, we just use the CSV version in this chapter. We have also provided a PDF file explaining the provided columns and the sources of these data: us_county_data_description.pdf.
We begin by loading in the data using read_csv()
from readr as opposed to the base R read.csv()
function. We are also using the str_glue()
function from stringr to concatenate the folder path and file name. A version of this function is also available in the glue package. Within str_glue()
, variables are placed inside of {}
while regular character strings are not. The entire string is place inside of quotes. You may need to update the folder path to reflect the path on your local machine.
The file is read in as a tibble, as defined by the tibble package, as opposed to a base R data frame. Since character columns are not converted to factors by default, we use mutate_if()
to convert all character columns to factors. We discuss data mutation with dplyr later in this chapter. Lastly, we use the tibble glimpse()
function to preview the data table. This serves a similar purpose as the base R str()
function. The tibble contains 3,104 records, or counties, and 37 columns, each representing different information about the counties.
Note the use of the forward-pipe operator to pass the read table to the mutate_if()
function.
fldrPth <- "gslrData/chpt18/data/"
cntyD <- read_csv(str_glue("{fldrPth}us_county_data.csv")) |>
mutate_if(is.character, as.factor)
glimpse(cntyD)
Rows: 3,104
Columns: 37
$ NAME <fct> Autauga County, Baldwin County, Barbour County, Bibb C…
$ STATE_NAME <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama, …
$ STATE_ABBR <fct> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL…
$ STATE_FIPS <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ COUNTY_FIP <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29,…
$ FIPS <dbl> 1001, 1003, 1005, 1007, 1009, 1011, 1013, 1015, 1017, …
$ POPULATION <dbl> 58805, 231767, 25223, 22293, 59134, 10357, 19051, 1164…
$ POP_SQMI <dbl> 97.3, 141.9, 27.9, 35.6, 90.9, 16.6, 24.5, 190.2, 57.7…
$ SQMI <dbl> 604.37, 1633.14, 904.52, 626.17, 650.63, 625.14, 777.8…
$ SUB_REGION <fct> E S Cen, E S Cen, E S Cen, E S Cen, E S Cen, E S Cen, …
$ med_income <dbl> 57982, 61756, 34990, 51721, 48922, 33866, 44850, 50128…
$ households <dbl> 21559, 84047, 9322, 7259, 21205, 3429, 6649, 44572, 13…
$ per_desk_lap <dbl> 74.13609, 77.73865, 52.45655, 56.17854, 66.86631, 55.2…
$ per_smartphone <dbl> 81.00561, 84.63598, 68.69770, 73.61896, 76.43952, 69.2…
$ per_no_comp <dbl> 8.571826, 8.239437, 20.220983, 16.779171, 14.963452, 2…
$ per_internet <dbl> 82.79605, 85.52358, 64.99678, 76.16752, 80.03301, 62.7…
$ per_broadband <dbl> 82.70792, 85.06907, 64.63205, 76.12619, 79.62273, 60.6…
$ per_no_internet <dbl> 15.33930, 11.91952, 28.96374, 20.93952, 18.48621, 30.3…
$ dem <dbl> 105.61176, 31.40693, 115.25758, 125.28409, 246.04255, …
$ precip <dbl> 1412.167, 1712.929, 1400.435, 1422.227, 1492.224, 1397…
$ tempmn <dbl> 17.95529, 19.31788, 18.39159, 17.46602, 16.07128, 18.0…
$ tempmin <dbl> 11.758941, 13.763203, 11.941061, 11.060341, 10.058404,…
$ tempmax <dbl> 24.15424, 24.87489, 24.84402, 23.87443, 22.08702, 24.4…
$ dptmn <dbl> 12.50141, 14.37255, 12.73129, 12.08955, 11.05468, 12.4…
$ vpdmin <dbl> 0.7703529, 0.6897403, 0.8700758, 0.7509091, 0.5855319,…
$ vpdmax <dbl> 16.62553, 15.45636, 18.27205, 16.81307, 14.45106, 17.4…
$ sol <dbl> 15.63682, 16.28429, 15.98689, 15.46614, 15.07543, 15.9…
$ per_for <dbl> 49.71597, 33.13072, 56.58767, 70.28501, 54.95027, 54.6…
$ per_dev <dbl> 6.815302, 9.449728, 4.227699, 5.258896, 8.577787, 2.98…
$ per_wet <dbl> 10.8697204, 30.6357571, 8.8715967, 6.5578020, 0.634172…
$ per_crop <dbl> 3.14176012, 9.19548790, 4.53634268, 0.06282062, 1.1841…
$ per_past_grass <dbl> 20.795712, 10.806887, 12.823742, 10.078160, 30.432964,…
$ strm_length <dbl> 49.599790, 74.558900, 57.307449, 45.351825, 4.118788, …
$ strm_den <dbl> 0.031686857, 0.017626996, 0.024462179, 0.027964351, 0.…
$ per_karst <dbl> 0.00000000, 0.00000000, 14.91452991, 11.05620754, 18.1…
$ rail_den <dbl> 0.09007046, 0.02088778, 0.04713053, 0.09748651, 0.0374…
$ road_den <dbl> 1.890543, 1.784091, 1.090589, 1.792521, 1.935772, 1.24…
18.3 Data Wrangling with dplyr
In this section, we focus on basic data wrangling using dplyr. We specifically discuss:
- Renaming columns
- Selecting a subset of columns
- Filtering and querying rows
18.3.1 Rename Columns
The rename()
dplyr function is used to rename columns. In our example, the “dem” column, which represents the mean county elevation in meters, is being renamed to “ELEV” while the “per_wet” column, which represent the percentage of the county’s land area that is wetland, is renamed to “pWetland”. It is not necessary to place quotes around the new or old column names unless there are spaces in the names. The syntax reads as NEW NAME = OLD NAME
.
cntyD2 <- cntyD |>
rename(ELEV = dem,
pWetland = per_wet)
18.3.2 Select Columns
A subset of columns can be extracted to a new tibble, in this case cntyD2
, using the dplyr select()
function, which accepts the names of the columns that we want to maintain. In the first example, we are maintaining only four columns. It is also possible to drop columns by using -
in front of the column name. In the second example, we are removing two columns. It is also possible to use helper functions to select columns. In the last example, we use the where()
and is.factor()
functions to select only columns that have the factor data type. Again, there are many is.()
functions that operate as tests in R. The statement within where()
must evaluate to TRUE
or FALSE
.
18.3.3 Filter and Query Rows
Selecting a subset of rows is generally termed querying or filtering the data. In other words, only rows or records that meet a certain criteria are returned to the new object. With dplyr, this is accomplished with the filter()
function. In our first example, we are filtering the data to create a subset of counties that have a mean elevation greater than 1,000 meters (dem > 1000
). These counties are written to a new tibble called cntyD1000
. We then calculate the percentage of counties that have a mean elevation greater than 1,000 by dividing the number of rows in the new tibble by the number of rows in the original tibble using the base R nrow()
function. We also use round()
to round off the percentage to 1 decimal place. 11.1% of the counties have a mean elevation greater than 1,000 meters.
In the next example, we implement a compound query using &
to filter out all counties that have an elevation greater than 1,000 meters (dem > 1000
) AND that have a percent forest cover greater than 75% (per_for > 75
). Only 0.4% of the counties meet these two criteria.
[1] 0.4
Using the forward-pipe operator, it is possible to perform a query using filter()
and select the columns to include in the final output using select()
as opposed to including all of the original columns. If you use select()
prior to using filter()
, only the selected columns will be available for use in the query statement. In our example, the order would not matter since the columns used in the query are included in the selection. So, we could have used select()
before filter()
.
The next example repeats the prior example but using OR (|
) as opposed to AND (&
). This results in 16.3% of the counties being returned to the new object. In other words, 16.3% of the counties had a mean elevation greater than 1,000 m, OR a percent forest cover greater than 75%, OR both a mean elevation greater than 1,000 m AND a forest cover greater than 75%. The available operators to combine multiple queries are as follows:
-
&
: A AND B (meets both criteria) -
|
: A OR B (meets at least one or both criteria) -
!
: A NOT B (meets the first criteria but not the second) -
xor
: A OR B but NOT A AND B (meets one of the criteria but not both)
[1] 16.3
The %in%
operator can be used to test whether a character string, number, date, or logical occurs within a vector of options. In the example below, we are extracting all Lincoln, Washington, and Jefferson counties by testing to see if the name of the county from the “Name” field occurs in a vector of these three county names. This is much more concise than using a series of OR (|
) statements.
cntyNmSub <- c("Lincoln County",
"Washington County",
"Jefferson County")
cntyD |>
filter(NAME %in% cntyNmSub) |>
count() |> gt()
n |
---|
78 |
In this example, we are creating a more complex query that requires the use of parenthesis to clarify the order of operations or logic. Specifically, counties within the state of Utah OR Colorado are selected if they have a mean elevation greater than 1,000 meters OR a percent forest cover greater than 75%. 93 counties meet these criteria.
18.3.4 Slicing
Slicing can be used to select a given number of features or rows with the largest or smallest values relative to an attribute. In this first example, we obtain the county with the highest mean elevation using slice_max()
. The county with the highest mean elevation is San Juan County, Colorado with a mean elevation of 3,478 meters. Similar to slice_max()
, top_n()
can be used to select the n number of records with the largest values for a specific attribute.
NAME | STATE_ABBR | dem |
---|---|---|
San Juan County | CO | 3478.088 |
NAME | STATE_ABBR | dem |
---|---|---|
Hinsdale County | CO | 3351.181 |
Lake County | CO | 3328.862 |
Mineral County | CO | 3197.444 |
San Juan County | CO | 3478.088 |
Summit County | CO | 3208.059 |
To obtain records with the smallest values, you can use slice_min()
. In the example below, we first obtain the county with the lowest population (Loving County, Texas with a population of 64) followed by the five counties with the lowest populations. arrange()
is used to sort the counties ascending based on population while arrange()
and desc()
are used to sort the data descending.
NAME | STATE_ABBR | POPULATION |
---|---|---|
Loving County | TX | 64 |
18.3.5 Data Summarization with dplyr
18.3.5.1 Global Summarization
A variety of summary statistics can be calculated using the dplyr summarize()
function combined with base R functions including:
If there are missing values, this can impact some of the summary calculations. To avoid this issue, you can use the na.rm=TRUE
argument, which indicates to ignore missing data in the calculations.
Below, we demonstrate how to calculate the mean population using the mean()
base R function and the dplyr summarize()
function. This results in a single mean that aggregates the data. It is not necessary to use na.rm=TRUE
here since there are no missing population values in the dataset.
The tidyverse uses both American and British spellings. So, summarization can be performed using the equivalent summarize()
or summarise()
functions.
It is also possible to calculate multiple summary metrics and/or summary metrics for multiple columns within a single call to summarize()
as demonstrated below. We also use round()
to round off the generated metrics. Note that the syntax for each calculation is NEW COLUMN NAME = STAT FUNCTION(COLUMN NAME, ADDITIONAL AGRUMENTS AS NEEDED)
.
cntySummary <- cntyD |>
summarize(pop = round(mean(POPULATION), 0),
dem = round(mean(dem), 1),
perCrop = round(mean(per_crop), 1),
perFor = round(mean(per_for), 1))
cntySummary |> gt()
pop | dem | perCrop | perFor |
---|---|---|---|
106056 | 440.4 | 22.3 | 30.2 |
It is also possible to filter or query the data prior to calculating summary metrics so that the summary is obtained for only a subset of the data. In this case, the filtering must be applied before the summarization.
18.3.5.2 Summarize by Group
Instead of obtaining a single, global summary statistic, you may want to obtain summaries for each group as defined by a grouping variable. The grouping variable is generally a factor column. In our first example, we are obtaining a count of counties in each sub-region by first grouping the data using group_by()
then using count()
to obtain the count of records by group. Once group_by()
is used, all following operations occur at the group-level. The grouping information will be maintained in any resulting tibbles. This grouping information can be removed using ungroup()
.
SUB_REGION | n |
---|---|
E N Cen | 437 |
E S Cen | 364 |
Mid Atl | 150 |
Mtn | 281 |
N Eng | 67 |
Pacific | 133 |
S Atl | 584 |
W N Cen | 618 |
W S Cen | 470 |
In this second example, we obtain total population counts by sub-region using group_by()
and summarize()
. We then print the data in descending order relative to total population. The South Atlantic sub-region has the highest total population while New England has the lowest. The following examples demonstrates obtaining multiple, grouped summary metrics.
cntyD |>
group_by(SUB_REGION) |>
summarize(totalPop = sum(POPULATION)) |>
arrange(desc(totalPop))|>
ungroup() |> gt()
SUB_REGION | totalPop |
---|---|
S Atl | 66026391 |
Pacific | 51480760 |
E N Cen | 47368533 |
Mid Atl | 42492943 |
W S Cen | 40774139 |
Mtn | 24919150 |
W N Cen | 21616921 |
E S Cen | 19402234 |
N Eng | 15116205 |
cntyD |>
group_by(SUB_REGION) |>
summarize(totalPop = round(sum(POPULATION), 0),
avgPop = round(mean(POPULATION), 0),
totalArea = round(sum(SQMI), 1))|>
ungroup() |> gt()
SUB_REGION | totalPop | avgPop | totalArea |
---|---|---|---|
E N Cen | 47368533 | 108395 | 247927.6 |
E S Cen | 19402234 | 53303 | 181826.6 |
Mid Atl | 42492943 | 283286 | 101316.4 |
Mtn | 24919150 | 88680 | 863563.8 |
N Eng | 15116205 | 225615 | 65456.0 |
Pacific | 51480760 | 387073 | 322419.6 |
S Atl | 66026391 | 113059 | 271336.7 |
W N Cen | 21616921 | 34979 | 517800.9 |
W S Cen | 40774139 | 86753 | 433058.9 |
18.3.6 Mutate
The mutate()
function from dplyr allows you to add new columns to a tibble or manipulate existing columns. In contrast, transmute()
, which is not demonstrated here, performs a calculation or generates a new column using existing columns then removes the columns on which the calculation was based. In our first example, we calculate the county area in square kilometers (km2) using the area in square miles (mi2), which is already included in the table, and the required conversion factor. We also calculate the total percentage of the county mapped to low vegetation land cover by adding the crop and pasture/grass percentages. Since we used mutate()
here instead of transmute()
, the original columns are maintained.
In this example, we are calculating the forest area by converting the percent forest estimate to a proportion then multiplying by the county land area. Once this is accomplished, the data are grouped by sub-regions using group_by()
then the total forest area per sub-region is calculated. The result is printed in descending order based on total forest area.
cntyD |>
mutate(forArea = (per_for/100)*SQMI) |>
group_by(SUB_REGION) |>
summarize(totalForest = sum(forArea)) |>
arrange(desc(totalForest)) |>
ungroup() |> gt()
SUB_REGION | totalForest |
---|---|
Mtn | 157567.00 |
S Atl | 113851.75 |
Pacific | 97132.54 |
E S Cen | 89329.46 |
W S Cen | 76039.24 |
E N Cen | 67518.49 |
Mid Atl | 55930.47 |
W N Cen | 54214.86 |
N Eng | 43165.03 |
Mutation can be generalized using mutate_if()
, where mutation is performed for all columns that meet a query. This is how we converted all character columns to a factor data type when reading in the data at the beginning of the chapter, which is replicated below.
Another generalization of mutate()
is mutate_at()
. This allows you to list a set of columns for which you want to perform a mutation. Here, we convert all land cover percentages to proportions, and the original columns are overwritten.
18.3.7 Random Sampling
A random sample can be drawn from a dataset using one of the following functions:
-
sample_n()
: select a given number of random samples -
sample_frac()
: randomly select a given proportion of the samples
If replace=FALSE
is used, random sampling without replacement is implemented where a single record or row cannot be sampled more than once. In contrast, random sampling with replacement is implemented when replace=TRUE
, which allows for a single record or row to be potentially sampled multiple times. set.seed()
can be used to make the random selection deterministic and reproducible.
set.seed(42)
cntyDSamp2 <- cntyD |>
sample_frac(frac=.2, replace=FALSE)
In order to implement stratified random sampling with replacement or stratified random sampling without replacement, you can combine group_by()
and sample_n()
. The variable used for the stratification must define the groupings. In the example below, ten random counties without replacement are drawn for each sub-region.
We sometimes use sample_frac()
to shuffle data. This may be desired to reduce autocorrelation in the data. Using frac=1
and replace=FALSE
, all rows will be returned but in a new order.
set.seed(42)
cntyDSamp3 <- cntyD |>
sample_frac(frac=1, replace=FALSE)
18.3.8 Binding Data
Multiple tibbles or data frames can be merged to a single object via binding. bind_cols()
binds along columns. For example, a table with 100 records and 15 columns could be combined with a table with 100 records and 20 different columns to create a table with 100 records/rows and 35 columns. Note that bind_cols()
assumes that the records are in the same order. In other words, binding is based on row indices only. If the rows may not be in the same order, then a table join is more appropriate. Table joins are discussed below. If the same column names are used in both data sets, new names will be generated to differentiate them.
This example demonstrates the use of bind_cols()
. We first create subsets of the columns using select()
. The two subsets are then merged using bind_cols()
. Again, it is assumed that the records are in the same order in the two tibbles. This is true in this case since select()
does not alter the order of the rows.
bind_rows()
binds along rows. For example, a table with 50 records could be combined with a table with another 50 records but with the same columns or attributes to generate a table with 100 records.
In this example, we randomly selecting 100 rows to include in a new tibble called cntySamp1
. We then randomly select 20% of the rows to include in another tibble called cntyDSamp2
. We next merge these subsets together using bind_rows()
. Since some of the samples may have been included in both datasets, we use distinct()
to remove duplicate rows.
cntyDSamp1 <- cntyD |>
sample_n(size=100, replace=FALSE)
cntyDSamp2 <- cntyD |>
sample_frac(frac=.2, replace=FALSE)
cntyDSamp4 <- bind_rows(cntyDSamp1, cntyDSamp2) |>
distinct()
18.3.9 Intersect, Union, and SetDiff
It is possible to compare two tibbles to return a subset of the records. Specifically:
-
intersect()
: maintains only rows that occur in both tables -
union()
: maintains rows that occur in Table 1 or Table 2 but removes duplicates -
setdiff()
: returns rows that occur in Table 1 but not Table 2
Below, we demonstrate the use of setdiff()
. We first select 30 counties from each sub-region. We next find the counties that were not selected by comparing the subset to the full dataset using setdiff()
.
18.3.10 Table Joins
Table joins allow for merging data based on a common or shared attribute as opposed to the order of the records, as is the case with bind_cols()
. There are several types of joins implemented within dplyr:
-
left_join()
: maintains all records from the Table 1 even if they have no match in Table 2 -
right_join()
: maintains all records from Table 2 even if they have no match in Table 1 -
outer_join()
: maintains all records from both tables -
inner_join()
: maintains only matching records between both tables
To demonstrate a table join, we first create subsets of the columns stored in two separate tibbles. Note that both tibbles contain the “FIPS” column, which will be used as a unique ID. We next perform an inner join where rows are matched based on the shared or common attribute (by="FIPS"
). Again, rows are matched based on a shared attribute as opposed to their order in the table; as a result, it is not required that the rows be in the same order in both tables. Since we are using an inner join, only rows that occur in both tables are maintained. Since all counties were included in this case, all four types of joins would yield the same result. In the second example, we create a random subset of rows to include in one of the tibbles. Since this subset is the first table in the join and a left join is used, only the matching 100 records are returned to the new object.
colSub1 <- cntyD |>
select(NAME,
STATE_NAME,
STATE_ABBR,
FIPS)
colSub2 <- cntyD |>
select(FIPS,
per_for,
per_dev,
per_wet,
per_crop,
per_past_grass)
colSubMerge <- inner_join(colSub1,
colSub2,
by="FIPS")
18.4 Manipulating Strings with stringr
We have primarily focused on data preparation, cleaning, manipulation, summarization, wrangling, and querying with dplyr in this chapter. In the last two sections of this chapter, we explore character string manipulation with stringr followed by factor manipulation with forcats.
The stringr package provides functions for working with and manipulating character strings. In the first example, we are creating a new column named “NAME2” and populating it with a lower case version of the county names from the “NAME” column. Note that stringr functions are prefixed with “str_”.
cntyD2 <- cntyD
cntyD2$NAME2 <- cntyD2$NAME |>
str_to_lower()
In this example, we are removing the word “County” from the county names using str_replace()
where the first argument is the character string to be replaced while the second argument is the character string to replace it with: in this case an empty string. Note that we include the space before “County” so that the empty space is also removed. The result is county names without ” County” included.
cntyD2$NAME <- cntyD2$NAME |>
str_replace(" County", "")
In the example below, we use str_length()
to calculate the length of the county name. We then select out the county with the longest name, which is St. John the Baptist Parish. The count includes spaces.
cntyD2 |>
mutate(nameLength = str_length(NAME)) |>
select(NAME, nameLength) |>
slice_max(nameLength, n=1) |> gt()
NAME | nameLength |
---|---|
St. John the Baptist Parish | 27 |
When there are potentially multiple occurrences of a character string to be replaced, you can use str_replace_all()
as opposed to str_replace()
. Here, we use this function to replace all spaces in the sub-region names. We convert the result to a factor data type and print the levels to check the result.
cntyD2$SUB_REGION <- cntyD2$SUB_REGION |>
str_replace_all(" ", "") |>
as.factor()
levels(cntyD2$SUB_REGION)
[1] "ENCen" "ESCen" "MidAtl" "Mtn" "NEng" "Pacific" "SAtl"
[8] "WNCen" "WSCen"
What if we wanted to find the longest county name excluding spaces? This could be accomplished by combining str_length()
and str_replace_all()
as demonstrated below. The same county is still selected but with a length of 23 characters as opposed to 27.
cntyD2 |>
mutate(nameLength = str_length(str_replace_all(NAME, " ", ""))) |>
select(NAME, nameLength) |>
slice_max(nameLength, n=1) |> gt()
NAME | nameLength |
---|---|
St. John the Baptist Parish | 23 |
str_detect()
is used to detect a provided sequence of characters within a string. In our example, we are searching for the string “Lincoln”. This results in a logical output, so counties that have “Lincoln” in their name yield TRUE
while all others yield FALSE
. The logical data are written to a new column named “hasLincoln”. We then filter out and count the number of records that contain “Lincoln”. 24 counties have the character string “Lincoln” included in their name.
cntyD2 |>
mutate(hasLincoln = str_detect(NAME, "Lincoln")) |>
filter(hasLincoln == TRUE) |>
count() |> gt()
n |
---|
24 |
As discussed above, str_glue()
is used to concatenate character strings. We demo this again below using file paths. str_glue_data()
allows for performing concatenations for each record or row in a table. In our example, we are generating a new column, “NAME3”, with the following format: “NAME COUNTY, STATE_NAME”. {}
indicates data arguments or column names.
fldPath1 <- "C:/Data/tidyverse/"
fName <- "dataset1.csv"
str_glue("{fldPath1}{fName}")
C:/Data/tidyverse/dataset1.csv
cntyD2$NAME3 <- cntyD2 |> str_glue_data("{NAME} County, {STATE_NAME}")
18.5 Manipulating Factors with forcats
The forcats package is used to manipulate factors and associated factor levels. Remember that the base R levels()
function returns the factor levels.
cntyD2 <- cntyD
cntyD2$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn" "N Eng" "Pacific" "S Atl"
[8] "W N Cen" "W S Cen"
All forcats functions are prefixed with “fct_”. For example, the fct_count()
function counts the number of rows or records associated with each factor level.
f | n |
---|---|
E N Cen | 437 |
E S Cen | 364 |
Mid Atl | 150 |
Mtn | 281 |
N Eng | 67 |
Pacific | 133 |
S Atl | 584 |
W N Cen | 618 |
W S Cen | 470 |
Remember that all factor levels are maintained even if the data are filtered and there are no records from each factor level in the subset. This is demonstrated below.
cntyD3 <- cntyD2 |>
filter(SUB_REGION %in% c("N Eng", "SAtl", "Mid Atl"))
cntyD3$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn" "N Eng" "Pacific" "S Atl"
[8] "W N Cen" "W S Cen"
Similar to the base R droplevels()
function, fct_drop()
is used to drop unused factor levels.
cntyD3 <- cntyD2 |>
filter(SUB_REGION %in% c("N Eng", "S Atl", "Mid Atl"))
cntyD3$SUB_REGION |> levels()
[1] "E N Cen" "E S Cen" "Mid Atl" "Mtn" "N Eng" "Pacific" "S Atl"
[8] "W N Cen" "W S Cen"
[1] "Mid Atl" "N Eng" "S Atl"
Factor levels can be renamed using fct_recode()
. The structure is NEW NAME = OLD NAME
. Quotes are only necessary if there are spaces in the names. Below, we have renamed the factor levels to use the full sub-region names as opposed to abbreviations.
cntyD3$SUB_REGION <- cntyD3$SUB_REGION |>
fct_recode("South Atlantic" = "S Atl",
"Mid-Atlantic" = "Mid Atl",
"New England" = "N Eng")
cntyD3$SUB_REGION |> levels()
[1] "Mid-Atlantic" "New England" "South Atlantic"
Similarly, fct_collapse()
can be used to merge factor levels into a smaller set with or without renaming. Below, we have merged the factor levels into the categories: “Central”, “East”, and “West”.
cntyD2$SUB_REGION <- cntyD2$SUB_REGION |>
fct_collapse("Central" = c("E N Cen", "E S Cen", "W N Cen", "W S Cen"),
"East" = c("Mid Atl", "N Eng", "S Atl"),
"West" = c("Mtn", "Pacific"))
cntyD2$SUB_REGION |>
levels()
[1] "Central" "East" "West"
18.6 Concluding Remarks
We use the tidyverse for data preparation, cleaning, querying, summarization, and wrangling throughout this text. We opted to use the tidyverse and the forward-pipe operator since we feel that the code is often more concise and readable in comparison to base R syntax that accomplishes similar tasks. If you struggle with tidyverse syntax as you engage with this text, you may find it useful to refer to this chapter or the tidyverse documentation. Since the tidyverse has a large user base, there is also a lot of help available online. You may also want to check out the cheatsheets made available by Posit.
18.7 Questions
- Explain the difference between
filter()
andselect()
. - Explain the difference between
bind_rows()
andbind_cols()
. - Explain the difference between
transmute()
andmutate()
. - Explain the difference between
left_join()
,inner_join()
, andfull_join()
. - Explain the difference between
intersect()
,union()
, andsetdiff()
. - What is the purpose of the
na.rm
parameter in summarization functions, such asmean()
? - What is the purpose of the
ungroup()
function? - What is the purpose of the
fct_collapse()
function from forcats?
18.8 Exercises
You have been provided with a urban tree dataset for Portland, Oregon in the exercise folder for the chapter. These data are available here. Use the tidyverse to complete the following tasks and answer the following questions.
- Read in portlandTrees.csv
- Select out following columns: “Common_nam”, “Genus_spec”, “Family”, “Genus”, “DBH”, “TreeHeight”, “CrownWidth”, “CrownBaseH”, and “Condition”
- Convert all character columns to factors
- Rename columns as follows:
- “Common_nam =”Common”
- “Genus_spec” = “Scientific”
- “Family =”Family”
- “Genus” = “Genus”
- “DBH” = “DBH”
- “TreeHeight” = “Height”
- “CrownWidth” = “CrownWidth”
- “CrownBaseH” = “CrownBaseHeight”
- “Condition” = “Condition”
- How many different families are included in the dataset?
- How many different genera are included in the dataset?
- Which family has the most unique genera included in the dataset?
- How many different species from the Quercus genus are included?
- Within the Quercus genus, which species has the largest number of records?
- What is the median DBH of all Quercus trees included in the dataset?
- Which genus has a larger average crown base height: Quercus or Acer?
- How many trees total from the Tilia, Pinus, or Ulmus genera have a “Dead” or”Poor” condition?
- How many trees are from the Quercus, Pinus, Acer, or Ulmus genera, have a height larger than 100, a DBH larger than 15, and a crown width larger than 25?
- Generate code to create a random sample without replacement of 15 trees from each of the following genera (60 trees total): Quercus, Pinus, Acer, or Ulmus.
- How many trees in the dataset have a common name that includes the word “red” (ignoring case)?
- How many trees in the dataset have a common name that includes the word “western” (ignoring case)?
- How many trees have a common name longer than 30 characters ignoring spaces?
- How many trees have a common name that is shorter than its scientific name, ignoring spaces?
- Which species has the largest range of heights in the dataset?
- Which species has the largest count of “Poor” condition trees in the dataset?