Working with Strings and Factors

Objectives

  1. Manipulate and work with string data using the stringr package.
  2. Manipulate an work with factors using the forcats package.

Overview

This short section I will provide some additional demonstrations for working with string and factor data in R using the stringr and forcats packages. These packages are part of the tidyverse and make working with string and factor data much easier than using the base R methods. Cheat sheets for stringr and forcats can be found here.

Working with Strings

Before I can manipulate strings, I need to read in some data that contains strings. To demonstrate, I will read in matts_movies.csv that was used in a prior module. Using summary(), you can see that five columns are provided: “Name”“,”Director"“,”Year“,”Genre"“, and”Rating“. All of the text columns have been read in as factors. So, I will need to convert columns to strings using as.character() if I want to manipulate them as strings. For this exercise, we will convert the”Name"" and “Director”" columns to character strings and leave the “Genre”" column as a factor.

Let’s start by finding all titles that contain the word “River” in them. This can be accomplished using the str_detect() function. Instead of creating a new data object, I am saving the result as a new column in the table. I then use the result to filter out the titles that contain this word. TRUE indicates that the title contained “River” while FALSE indicates that it did not.

The str_length() function returns the length of a string. As demonstrated in the result below, this count will included spaces. In the second example, I am obtaining the length of each title as a new column. I am then using the result to find all titles that have a length greater than 20. 472 titles have a length greater than 20.

str_to_upper() will convert the string to all upper case as demonstrated in the next example. There are similar operations available for converting to lower case and title case.

One comlexity of string manipualtion in R, and also in other coding languages, is that some characters, such as “\”, have special meaning and cannot be directly interpreted within a string. So, special characters must be used. The example below provides an example of this issue.

Working with Factors

Remember that factors in R are actually represented by numeric codes. These codes are then linked to the text description. In the example below I have used the fct_count() function from forcats to count the number of movies in each defined genre. Using, nlevels() I obtain the number of defined levels for the factor. Matt has differentiated 454 different generas.

The fct_infreq() function allows you to reorder factors based on frequency. Here, we can see that the most common genres were drama, documentary, and comedy.

In the example below, I have filtered out only movies that are in the top three most common genres. However, if I call nlevels(), ther are still 454 levels defined. So, I will need to drop unused levels. This can be accomplished with fct_drop(). Printing the number of levels after applying this function will confirm that unused levels were removed.

Factor levels can be recoded and/or combined using fct_collapse(). Here, I am combinning the drama and comedy levels into a new level called fiction and recoding documentary to nonfiction. Recoding can also be accomplished using the recode() function from dplyr.

My preferred method for changing the names of factor levels is the recode() function from dplyr, which is demonstrated below.

Back to Course Page

Back to WV View

Download Data