Working with Strings and Factors
Working with Strings and Factors
Objectives
- Manipulate and work with string data using the stringr package
- Manipulate an work with factors using the forcats package.
Overview
In this short section, I will provide some additional demonstrations for working with string and factor data in R using the stringr and forcats packages. These packages are part of the tidyverse and make working with string and factor data much easier than using the base R methods. Cheat sheets for stringr and forcats can be found here. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.
library(stringr)
library(forcats)
library(dplyr)
Working with Strings
Before I can manipulate strings, I need to read in some data that contains strings. To demonstrate, I will read in matts_movies.csv. Using summary(), you can see that five columns are provided: “Name”“,”Director”“,”Year”, “Genre”“, and”Rating”. I have set the stringsAsFactors argument to FALSE so that all character or text columns are read in as character data as opposed to factors. Later, I will convert specific columns to factors as needed. For this exercise, we will treat the “Name”” and “Director”” columns as character strings and the “Genre” column as a factor.
setwd("D:/mydata/strings_and_factors")
<- read.csv("matts_movies.csv", sep=",", header=TRUE, stringsAsFactors=FALSE)
movies summary(movies)
Movie.Name Director Release.Year My.Rating :1852 Length:1852 Min. :1921 Min. :0.670
Length:character Class :character 1st Qu.:1995 1st Qu.:6.320
Class :character Mode :character Median :2004 Median :7.020
Mode :1999 Mean :6.953
Mean :2009 3rd Qu.:7.850
3rd Qu.:2014 Max. :9.990
Max.
Genre Own :1852 Length:1852
Length:character Class :character
Class :character Mode :character
Mode
Let’s start by finding all titles that contain the word “River” in them. This can be accomplished using the str_detect() function. Instead of creating a new data object, I am saving the result as a new column in the table. I then use the result to filter out the titles that contain this word. TRUE indicates that the title contains “River” while FALSE indicates that it does not.
$river <- (str_detect(movies$Movie.Name, "River"))
movies%>% filter(river==TRUE)
movies
Movie.Name Director Release.Year My.Rating1 Mystic River Clint Eastwood 2003 9.56
2 Frozen River Courtney Hunt 2008 7.69
3 A River Runs Through It Robert Redford 1992 7.66
4 Joan Rivers: A Piece of Work Ricki Stern 2010 6.43
5 The River Wild Curtis Hanson 1994 5.12
Genre Own river1 Drama Yes TRUE
2 Independent No TRUE
3 Drama Yes TRUE
4 Documentary No TRUE
5 Thriller No TRUE
The str_length() function returns the length of a string. As demonstrated in the result below, this count will included spaces. In the second example, I am obtaining the length of each title as a new column. I am then using the result to find all titles that have a length greater than 20. 472 titles have a length greater than 20.
str_length("A B C D E")
1] 9 [
$len <- str_length(movies$Movie.Name)
movies%>% filter(len > 20) %>% count()
movies
n1 324
str_to_upper() will convert the string to all upper case as demonstrated in the next example. There are similar operations available for converting to lower case and title case.
head(str_to_upper(movies$Movie.Name))
1] "ALMOST FAMOUS" "THE SHAWSHANK REDEMPTION"
[3] "GROUNDHOG DAY" "DONNIE DARKO"
[5] "CHILDREN OF MEN" "ANNIE HALL" [
One complexity of string manipulation in R, and also in other coding languages, is that some characters, such as “\”, have special meaning and cannot be directly interpreted within a string. So, escape characters must be used to make sure the symbol is interpreted as plain text. In this case, “\” is used as an escape character. This is another way to reformat a file path in Windows other than converting backslashes to forward slashes (e.g., “C:/Data” and “C:\Data” are both acceptable).
#This will yield an error.
#path <- "C:\R_Examples"
#This will not.
<- "C:\\R_Examples"
path print(path)
1] "C:\\R_Examples" [
Working with Factors
Remember that factors in R are actually represented by numeric codes. These codes are then linked to the text description. In the example below I have used the fct_count() function from forcats to count the number of movies in each defined genre. Note, that I first have to convert the column from a character to a factor data type. Using nlevels() I obtain the number of defined levels for the factor. Matt has differentiated 6 different genres.
$Genre <- as.factor(movies$Genre)
movieshead(fct_count(movies$Genre))
# A tibble: 6 x 2
f n<fct> <int>
1 Action 197
2 Classic 47
3 comedy 1
4 Comedy 275
5 Documentary 78
6 Drama 321
nlevels(movies$Genre)
1] 18 [
The fct_infreq() function allows you to reorder factors based on frequency. Here, you can see that the most common genres were drama, comedy, and action.
<- fct_infreq(movies$Genre)
common_genres head(fct_count(common_genres))
# A tibble: 6 x 2
f n<fct> <int>
1 Drama 321
2 Comedy 275
3 Action 197
4 Independent 190
5 Thriller 164
6 Foreign 157
In the example below, I have filtered out only movies that are in the top three most common genres. However, if I call nlevels(), there are still 454 levels defined. So, I will need to drop unused levels. This can be accomplished with fct_drop(). Printing the number of levels after applying this function will confirm that unused levels have been removed.
<- movies %>% filter(Genre == "Drama" | Genre == "Documentary" | Genre == "Comedy")
movies2 nlevels(movies2$Genre)
1] 18
[$Genre <- fct_drop(movies2$Genre)
movies2nlevels(movies2$Genre)
1] 3 [
Factor levels can be recoded and/or combined using fct_collapse(). Here, I am combining the “drama” and “comedy” levels into a new level called “fiction” and recoding “documentary” to “nonfiction”. Recoding can also be accomplished using the recode() function from dplyr.
$Genre <- fct_collapse(movies2$Genre, fiction = c("Drama", "Comedy"), nonfiction = c("Documentary"))
movies2nlevels(movies2$Genre)
1] 2 [
My preferred method for changing the names of factor levels is the recode() function from dplyr, which is demonstrated below.
$Genre <- recode(movies2$Genre, fiction = "F", nonfiction = "NF")
movies2levels(movies2$Genre)
1] "F" "NF" [
Regular Expressions
If you work with character or string data, it is worth learning about regular expressions, which is a language for describing patterns in strings. The stringr R cheat sheet includes a page devoted to regular expressions. This language is also used for searching strings in other computational environments, such as Python.
Concluding Remarks
There are many additional functions, analyses, and tasks that can be applied to strings and factors in R. If you are interested in this topic, please consult the stringr and forcats documentation for additional use cases and examples.