Working with Strings and Factors

Objectives

Manipulate and work with string data using the stringr package
Manipulate an work with factors using the forcats package.

Overview

In this short section, I will provide some additional demonstrations for working with string and factor data in R using the stringr and forcats packages. These packages are part of the tidyverse and make working with string and factor data much easier than using the base R methods. Cheat sheets for stringr and forcats can be found here. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.

library(stringr)
library(forcats)
library(dplyr)

Working with Strings

Before I can manipulate strings, I need to read in some data that contains strings. To demonstrate, I will read in matts_movies.csv. Using summary(), you can see that five columns are provided: “Name”“,”Director”“,”Year”, “Genre”“, and”Rating”. I have set the stringsAsFactors argument to FALSE so that all character or text columns are read in as character data as opposed to factors. Later, I will convert specific columns to factors as needed. For this exercise, we will treat the “Name”” and “Director”” columns as character strings and the “Genre” column as a factor.

setwd("D:/mydata/strings_and_factors")
movies <- read.csv("matts_movies.csv", sep=",", header=TRUE, stringsAsFactors=FALSE)
summary(movies)
  Movie.Name          Director          Release.Year    My.Rating    
 Length:1852        Length:1852        Min.   :1921   Min.   :0.670  
 Class :character   Class :character   1st Qu.:1995   1st Qu.:6.320  
 Mode  :character   Mode  :character   Median :2004   Median :7.020  
                                       Mean   :1999   Mean   :6.953  
                                       3rd Qu.:2009   3rd Qu.:7.850  
                                       Max.   :2014   Max.   :9.990  
    Genre               Own           
 Length:1852        Length:1852       
 Class :character   Class :character  
 Mode  :character   Mode  :character

Let’s start by finding all titles that contain the word “River” in them. This can be accomplished using the str_detect() function. Instead of creating a new data object, I am saving the result as a new column in the table. I then use the result to filter out the titles that contain this word. TRUE indicates that the title contains “River” while FALSE indicates that it does not.

movies$river <- (str_detect(movies$Movie.Name, "River"))
movies %>% filter(river==TRUE)
                    Movie.Name       Director Release.Year My.Rating
1                 Mystic River Clint Eastwood         2003      9.56
2                 Frozen River  Courtney Hunt         2008      7.69
3      A River Runs Through It Robert Redford         1992      7.66
4 Joan Rivers: A Piece of Work    Ricki Stern         2010      6.43
5               The River Wild  Curtis Hanson         1994      5.12
        Genre Own river
1       Drama Yes  TRUE
2 Independent  No  TRUE
3       Drama Yes  TRUE
4 Documentary  No  TRUE
5    Thriller  No  TRUE

The str_length() function returns the length of a string. As demonstrated in the result below, this count will included spaces. In the second example, I am obtaining the length of each title as a new column. I am then using the result to find all titles that have a length greater than 20. 472 titles have a length greater than 20.

str_length("A B C D E")
[1] 9

movies$len <- str_length(movies$Movie.Name)
movies %>% filter(len > 20) %>% count()
    n
1 324

str_to_upper() will convert the string to all upper case as demonstrated in the next example. There are similar operations available for converting to lower case and title case.

head(str_to_upper(movies$Movie.Name))
[1] "ALMOST FAMOUS"            "THE SHAWSHANK REDEMPTION"
[3] "GROUNDHOG DAY"            "DONNIE DARKO"            
[5] "CHILDREN OF MEN"          "ANNIE HALL"

One complexity of string manipulation in R, and also in other coding languages, is that some characters, such as “\”, have special meaning and cannot be directly interpreted within a string. So, escape characters must be used to make sure the symbol is interpreted as plain text. In this case, “\” is used as an escape character. This is another way to reformat a file path in Windows other than converting backslashes to forward slashes (e.g., “C:/Data” and “C:\Data” are both acceptable).

#This will yield an error. 
#path <- "C:\R_Examples"
#This will not. 
path <- "C:\\R_Examples"
print(path)
[1] "C:\\R_Examples"

Working with Factors

Remember that factors in R are actually represented by numeric codes. These codes are then linked to the text description. In the example below I have used the fct_count() function from forcats to count the number of movies in each defined genre. Note, that I first have to convert the column from a character to a factor data type. Using nlevels() I obtain the number of defined levels for the factor. Matt has differentiated 6 different genres.

movies$Genre <- as.factor(movies$Genre)
head(fct_count(movies$Genre))
# A tibble: 6 x 2
  f               n
  <fct>       <int>
1 Action        197
2 Classic        47
3 comedy          1
4 Comedy        275
5 Documentary    78
6 Drama         321
nlevels(movies$Genre)
[1] 18

The fct_infreq() function allows you to reorder factors based on frequency. Here, you can see that the most common genres were drama, comedy, and action.

common_genres <- fct_infreq(movies$Genre)
head(fct_count(common_genres))
# A tibble: 6 x 2
  f               n
  <fct>       <int>
1 Drama         321
2 Comedy        275
3 Action        197
4 Independent   190
5 Thriller      164
6 Foreign       157

In the example below, I have filtered out only movies that are in the top three most common genres. However, if I call nlevels(), there are still 454 levels defined. So, I will need to drop unused levels. This can be accomplished with fct_drop(). Printing the number of levels after applying this function will confirm that unused levels have been removed.

movies2 <- movies %>% filter(Genre == "Drama" | Genre == "Documentary" | Genre == "Comedy")
nlevels(movies2$Genre)
[1] 18
movies2$Genre <- fct_drop(movies2$Genre)
nlevels(movies2$Genre)
[1] 3

Factor levels can be recoded and/or combined using fct_collapse(). Here, I am combining the “drama” and “comedy” levels into a new level called “fiction” and recoding “documentary” to “nonfiction”. Recoding can also be accomplished using the recode() function from dplyr.

movies2$Genre <- fct_collapse(movies2$Genre, fiction = c("Drama", "Comedy"), nonfiction = c("Documentary"))
nlevels(movies2$Genre)
[1] 2

My preferred method for changing the names of factor levels is the recode() function from dplyr, which is demonstrated below.

movies2$Genre <- recode(movies2$Genre, fiction = "F", nonfiction = "NF")
levels(movies2$Genre)
[1] "F"  "NF"

Regular Expressions

If you work with character or string data, it is worth learning about regular expressions, which is a language for describing patterns in strings. The stringr R cheat sheet includes a page devoted to regular expressions. This language is also used for searching strings in other computational environments, such as Python.

Concluding Remarks

There are many additional functions, analyses, and tasks that can be applied to strings and factors in R. If you are interested in this topic, please consult the stringr and forcats documentation for additional use cases and examples.