Data Manipulation in R

Data Manipulation in R

Objectives

  1. Perform mathematical calculations and logical operations in R
  2. Query, filter, and subset data
  3. Recode and classify data
  4. Deal with missing values
  5. Generate random samples
  6. Join data objects
  7. Use base R and the dplyr package to manipulate and prepare data for an analysis

Overview

Before you can perform any analyses in R, you will likely need to clean or manipulate your data to prep it for the desired experimentation. I find that I spend a lot of time finding and preparing data. In fact, this is the most time consuming component of most of my projects. Luckily, R has a variety of great functions and techniques for manipulating data. In fact, it is one of my preferred environments to perform these tasks.

We will start off with a discussion of data manipulation methods available in base R. We will then move on to discuss the methods made available in the dplyr package. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.

Data Manipulation Using Base R

Math and Logic

Once numeric data are read into R (for example, as a vector or a column in a data frame), a variety of arithmetic operations can be performed. In the example, I have provided a series of operations on a numeric vector (a). The specific operation will be performed for each data point in the vector separately or independently. So, it is very easy to perform simple arithmetic on vectors and columns in data frames, even those that contain many data points. Here is a list of arithmetic operators in R:

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentiation: ^ or **
  • Modulus (remainder after division): %%

These operators are very similar to those used in other languages, such as Python. For a list of mathematical functions in R, have a look at this reference card.

a <- sample(1:255, 14, replace=TRUE)
print(a)
 [1] 207  15  16 244 166  37  10  65  40 206  13  49 145 237
b <- a + 3
print(b)
 [1] 210  18  19 247 169  40  13  68  43 209  16  52 148 240
c <- a - 2
print(c)
 [1] 205  13  14 242 164  35   8  63  38 204  11  47 143 235
d <- a/4
print(d)
 [1] 51.75  3.75  4.00 61.00 41.50  9.25  2.50 16.25 10.00 51.50  3.25 12.25
[13] 36.25 59.25
e <- a^3
print(e)
 [1]  8869743     3375     4096 14526784  4574296    50653     1000   274625
 [9]    64000  8741816     2197   117649  3048625 13312053
f <- a%%2
print(f)
 [1] 1 1 0 0 0 1 0 1 0 0 1 1 1 1

Logical operators are also available. The result of a logical operation will be TRUE or FALSE. So, you can think of this as a test. I have provided some example use cases below.

a <- sample(1:255, 14, replace=TRUE)
print(a)
 [1]  40  89 251  43 187  37 174  52 249 249 198 159   2 207
b <- a > 50
print(b)
 [1] FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13] FALSE  TRUE
c <- a == 42
print(c)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE
d <- a != 42 
print(d)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
e = a >= 101 & a <= 211
print(e)
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE
[13] FALSE  TRUE

Here are the common logical operators used in R:

  • <: Less than
  • <=: Less than or equal to
  • >: Greater than
  • >=: Greater than or equal to
  • ==: Equal to
  • !=: Not equal to
  • &: x AND y
  • |: x OR y, including x AND y
  • xor: x or y, excluding x AND y
  • !x: Not x

You may have noticed that == is used as opposed to = for “equal to”. This is because = can be used for variable assignment, similar to <-, so it cannot be used as a logical operator. You may have also noticed that you got a different result than I did for the example vector (a). This is because you used the sample() function to generate random numbers (in this case 14 random values between 1 and 255 with replacement). So, you likely obtained a different random set. If you run this again, you will get a new set of numbers.

Working with Strings

It is also commonly necessary to work with and manipulate strings. Similar to other coding languages, R has many functions for string manipulation. Here are a few examples. First, I create a vector of characters. I then convert them to upper case and lower case. I use the paste() function to add “Topic:” to each element in the vector. This is know as concatenation. You can use paste0() if you don’t want to include a space between the concatenated strings. There are many other options for working with strings. Specifically, I would suggest checking out the stringr package.

str1 <- c("GIS", "Remote Sensing", "Spatial Analysis", "Spatial Modeling", "Spatial Analytics")
print(toupper(str1))
[1] "GIS"               "REMOTE SENSING"    "SPATIAL ANALYSIS" 
[4] "SPATIAL MODELING"  "SPATIAL ANALYTICS"
print(tolower(str1))
[1] "gis"               "remote sensing"    "spatial analysis" 
[4] "spatial modeling"  "spatial analytics"
print(paste("Topic:", str1, sep= " "))
[1] "Topic: GIS"               "Topic: Remote Sensing"   
[3] "Topic: Spatial Analysis"  "Topic: Spatial Modeling" 
[5] "Topic: Spatial Analytics"

I find string manipulation to be particularly useful when I am working with file names and file paths. For example, the list.files() function can be used to obtain a list of files in a folder or directory. It will return a vector of all file names. If you just want to list files of a defined type, this can be accomplished by providing a pattern argument. Here, I just want to list the TIFF files in the folder. Next, I use the paste0() function to concatenate the path and the file name. Lastly, I use the file_path_sans_ext() function from the tools package and the basename() function to extract just the file name without the extension.

#You won't be able to run this since I didn't provide this folder. 
setwd("D:/some_grids")
file_list <- list.files(path = "D:/some_grids")
print(file_list)
 [1] "KY_Adams_708051_1971_24000_geo.tif"       
 [2] "KY_Ano_708081_1979_24000_geo.tif"         
 [3] "KY_Argillite_708084_1972_24000_geo.tif"   
 [4] "KY_Artemus_708079_1974_24000_geo.tif"     
 [5] "KY_Ault_708101_1962_24000_geo.tif"        
 [6] "KY_Balkan_803302_1974_24000_geo.tif"      
 [7] "KY_Barbourville_708122_1974_24000_geo.tif"
 [8] "KY_Barcreek_708125_1979_24000_geo.tif"    
 [9] "KY_Barthell_803308_1954_24000_geo.tif"    
[10] "KY_Beattyville_708136_1961_24000_geo.tif" 
[11] "KY_Belfry_708145_1992_24000_geo.tif"      
[12] "KY_Belfry_708147_1978_24000_geo.tif"      
[13] "KY_Bernstadt_708158_1969_24000_geo.tif"   
[14] "KY_Beverly_708161_1979_24000_geo.tif"     
[15] "KY_Billows_708177_1952_24000_geo.tif"     
[16] "KY_Blackey_708183_1954_24000_geo.tif"     
[17] "KY_Blackwater_708188_1979_24000_geo.tif"  
[18] "KY_Blaine_708191_1953_24000_geo.tif"      
[19] "KY_Bledsoe_708195_1980_24000_geo.tif"     
[20] "KY_Boltsfork_708203_1971_24000_geo.tif"   
[21] "KY_Booneville_708207_1961_24000_geo.tif"  
[22] "KY_Booneville_708209_1961_24000_geo.tif"  
[23] "KY_Brushart_708255_1949_24000_geo.tif"    
[24] "KY_Buckhorn_708264_1961_24000_geo.tif"    
raster_list <- list.files(path = "D:/some_grids", pattern = "\\.tif$")
print(raster_list)
 [1] "KY_Adams_708051_1971_24000_geo.tif"       
 [2] "KY_Ano_708081_1979_24000_geo.tif"         
 [3] "KY_Argillite_708084_1972_24000_geo.tif"   
 [4] "KY_Artemus_708079_1974_24000_geo.tif"     
 [5] "KY_Ault_708101_1962_24000_geo.tif"        
 [6] "KY_Balkan_803302_1974_24000_geo.tif"      
 [7] "KY_Barbourville_708122_1974_24000_geo.tif"
 [8] "KY_Barcreek_708125_1979_24000_geo.tif"    
 [9] "KY_Barthell_803308_1954_24000_geo.tif"    
[10] "KY_Beattyville_708136_1961_24000_geo.tif" 
[11] "KY_Belfry_708145_1992_24000_geo.tif"      
[12] "KY_Belfry_708147_1978_24000_geo.tif"      
[13] "KY_Bernstadt_708158_1969_24000_geo.tif"   
[14] "KY_Beverly_708161_1979_24000_geo.tif"     
[15] "KY_Billows_708177_1952_24000_geo.tif"     
[16] "KY_Blackey_708183_1954_24000_geo.tif"     
[17] "KY_Blackwater_708188_1979_24000_geo.tif"  
[18] "KY_Blaine_708191_1953_24000_geo.tif"      
[19] "KY_Bledsoe_708195_1980_24000_geo.tif"     
[20] "KY_Boltsfork_708203_1971_24000_geo.tif"   
[21] "KY_Booneville_708207_1961_24000_geo.tif"  
[22] "KY_Booneville_708209_1961_24000_geo.tif"  
[23] "KY_Brushart_708255_1949_24000_geo.tif"    
[24] "KY_Buckhorn_708264_1961_24000_geo.tif"    
full_path <- paste0("D:/some_grids/", raster_list)
print(full_path)
 [1] "D:/some_grids/KY_Adams_708051_1971_24000_geo.tif"       
 [2] "D:/some_grids/KY_Ano_708081_1979_24000_geo.tif"         
 [3] "D:/some_grids/KY_Argillite_708084_1972_24000_geo.tif"   
 [4] "D:/some_grids/KY_Artemus_708079_1974_24000_geo.tif"     
 [5] "D:/some_grids/KY_Ault_708101_1962_24000_geo.tif"        
 [6] "D:/some_grids/KY_Balkan_803302_1974_24000_geo.tif"      
 [7] "D:/some_grids/KY_Barbourville_708122_1974_24000_geo.tif"
 [8] "D:/some_grids/KY_Barcreek_708125_1979_24000_geo.tif"    
 [9] "D:/some_grids/KY_Barthell_803308_1954_24000_geo.tif"    
[10] "D:/some_grids/KY_Beattyville_708136_1961_24000_geo.tif" 
[11] "D:/some_grids/KY_Belfry_708145_1992_24000_geo.tif"      
[12] "D:/some_grids/KY_Belfry_708147_1978_24000_geo.tif"      
[13] "D:/some_grids/KY_Bernstadt_708158_1969_24000_geo.tif"   
[14] "D:/some_grids/KY_Beverly_708161_1979_24000_geo.tif"     
[15] "D:/some_grids/KY_Billows_708177_1952_24000_geo.tif"     
[16] "D:/some_grids/KY_Blackey_708183_1954_24000_geo.tif"     
[17] "D:/some_grids/KY_Blackwater_708188_1979_24000_geo.tif"  
[18] "D:/some_grids/KY_Blaine_708191_1953_24000_geo.tif"      
[19] "D:/some_grids/KY_Bledsoe_708195_1980_24000_geo.tif"     
[20] "D:/some_grids/KY_Boltsfork_708203_1971_24000_geo.tif"   
[21] "D:/some_grids/KY_Booneville_708207_1961_24000_geo.tif"  
[22] "D:/some_grids/KY_Booneville_708209_1961_24000_geo.tif"  
[23] "D:/some_grids/KY_Brushart_708255_1949_24000_geo.tif"    
[24] "D:/some_grids/KY_Buckhorn_708264_1961_24000_geo.tif"    
just_name <- tools::file_path_sans_ext(basename(full_path))
print(just_name)
 [1] "KY_Adams_708051_1971_24000_geo"       
 [2] "KY_Ano_708081_1979_24000_geo"         
 [3] "KY_Argillite_708084_1972_24000_geo"   
 [4] "KY_Artemus_708079_1974_24000_geo"     
 [5] "KY_Ault_708101_1962_24000_geo"        
 [6] "KY_Balkan_803302_1974_24000_geo"      
 [7] "KY_Barbourville_708122_1974_24000_geo"
 [8] "KY_Barcreek_708125_1979_24000_geo"    
 [9] "KY_Barthell_803308_1954_24000_geo"    
[10] "KY_Beattyville_708136_1961_24000_geo" 
[11] "KY_Belfry_708145_1992_24000_geo"      
[12] "KY_Belfry_708147_1978_24000_geo"      
[13] "KY_Bernstadt_708158_1969_24000_geo"   
[14] "KY_Beverly_708161_1979_24000_geo"     
[15] "KY_Billows_708177_1952_24000_geo"     
[16] "KY_Blackey_708183_1954_24000_geo"     
[17] "KY_Blackwater_708188_1979_24000_geo"  
[18] "KY_Blaine_708191_1953_24000_geo"      
[19] "KY_Bledsoe_708195_1980_24000_geo"     
[20] "KY_Boltsfork_708203_1971_24000_geo"   
[21] "KY_Booneville_708207_1961_24000_geo"  
[22] "KY_Booneville_708209_1961_24000_geo"  
[23] "KY_Brushart_708255_1949_24000_geo"    
[24] "KY_Buckhorn_708264_1961_24000_geo"    

These types of techniques are particularly useful if you want to process a large number of files. For example, you could perform the same set of operations on all tables stored in a folder using the list of files and a for loop, which we will discuss in a later section.

Data Selection and Subsetting

R has a set of data querying methods, which we will now explore. First, I read in a CSV file (us_county_data.csv) using the read.csv() function. This file has a header and commas are used to separate data values. I have used the sringAsFactors argument to read in all character fields as factors. I then use the str() function to inspect the data.

#You will need to set your own working directory here. 
setwd("D:/mydata/data_manipulation")
counties <- read.csv("us_county_data.csv", header=TRUE, sep=",", stringsAsFactors=TRUE)
str(counties)
'data.frame':   3109 obs. of  16 variables:
 $ STATE     : Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY    : Factor w/ 1818 levels "Abbeville","Acadia",..: 79 97 146 161 221 231 289 311 321 325 ...
 $ FIPS      : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
 $ winner12  : Factor w/ 2 levels "democrat","republican": 2 1 2 2 1 2 2 2 2 2 ...
 $ winner08  : Factor w/ 2 levels "democrat","republican": 2 2 2 2 1 2 2 2 2 2 ...
 $ per_dem   : num  25.8 49 26.6 14.5 74.1 43.1 45.5 23.7 20.7 46.1 ...
 $ per_rep   : num  73.6 50.4 72.4 84 25.7 56.5 53.9 74.9 78.5 53.5 ...
 $ per_other : num  0.6 0.6 1 1.5 0.2 0.4 0.6 1.5 0.9 0.4 ...
 $ pop_den   : num  90.5 30.5 36.6 87.8 17.4 ...
 $ per_gt55  : num  0.227 0.274 0.246 0.274 0.264 ...
 $ per_notw  : num  0.2147 0.52 0.2415 0.0742 0.7703 ...
 $ per_vac   : num  0.0865 0.1698 0.1145 0.0967 0.1665 ...
 $ med_income: int  51622 30896 41076 46086 26980 31449 35614 38028 40292 30728 ...
 $ per_pov   : num  10.7 24.5 18.5 13.1 33.6 22.3 18.7 17.7 17.1 22.9 ...
 $ STATE_NAME: Factor w/ 49 levels "Alabama","Arizona",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ SUB_REGION: Factor w/ 9 levels "E N Cen","E S Cen",..: 2 2 2 2 2 2 2 2 2 2 ...

Let’s begin by selecting out a subset of columns from the entire data set. This can be accomplished using square bracket notation and providing a vector of column names.

counties_sub <- counties[,c("STATE", "COUNTY", "FIPS", "winner12", "winner08", "per_dem", "per_rep", "per_other", "pop_den", "per_gt55", "per_vac", "med_income", "per_pov", "STATE_NAME", "SUB_REGION")]

Next, I select all rows or records (in this case counties) in which the democratic presidential candidate won in 2008. There are multiple ways to accomplish this. Here, I will use the subset() function. It requires an input object and a query statement. The select argument allows you to provide a list of columns to include in the output. Here, we only want the county name. Lastly, I use the nrow() function to print the number of rows that were selected. So, the democratic candidate won in 871 counties. I then perform some simple math to get the percentage of counties that the democratic candidate won.

dem_2008 <- subset(counties_sub, winner08=="democrat",
                  select=c("COUNTY"))
nrow(dem_2008)
[1] 871
(nrow(dem_2008)/nrow(counties_sub))*100
[1] 28.01544

It is possible to query using more than one parameter. The provided example shows how to extract all counties in which the democratic candidate won in 2008 and 2012.

dem_08_12 <- subset(counties_sub, winner08=="democrat" & winner12=="democrat", select=c("COUNTY"))
nrow(dem_08_12)
[1] 669
(nrow(dem_08_12)/nrow(counties_sub))*100
[1] 21.51817

This example shows how to select all counties that voted democrat and had a median income higher than $40,000.

dem_08_mi <- subset(counties_sub, winner08=="democrat" & med_income > 40000, select=c("COUNTY"))
nrow(dem_08_mi)
[1] 589
(nrow(dem_08_mi)/nrow(counties_sub))*100
[1] 18.945

As one last example of data subsampling, the code below will select all counties in California or Oregon that voted democrat and have a percent poverty rate greater than 12%. Note the need for parenthesis to define the order of operations.

subset1 <- subset(counties_sub, (STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12,
                  select=c("COUNTY"))
nrow(subset1)
[1] 31
(nrow(subset1)/nrow(counties_sub))*100
[1] 0.9971052

Drop Columns

Dropping columns from a data set is also petty easy and can be accomplished using syntax similar to that shown in the code block below. Here, I am removing the “winner08”” and “winner12”” columns from the counties_sub data frame. First, I create a vector that contains the names of the columns I want to remove. Then I use square brackets to extract all columns that are not in the list. Note that ! implies negation (or, not these columns). The %in% operator is used to determine if an element belongs to or occurs within a vector, in this case the list of fields to delete.

col_to_del <- c("winner08", "winner12")
counties_sub2 <- counties_sub[,!(names(counties_sub) %in% col_to_del)]

Data Recoding

It is also possible to create or alter columns based on data in other columns. In our county data, the percent over 55 (“per_gt55”) data are actually stored as proportions, so we could multiply them by 100 to obtain percentages.

print(mean(counties_sub$per_gt55))
[1] 0.291498
counties_sub$per_gt55 <- counties_sub$per_gt55*100
print(mean(counties_sub$per_gt55))
[1] 29.1498

In the above example, I wrote over or replaced the data in an existing column. It is also possible to write data to a new column. Here, I will categorize the states into different poverty ranges using the percent poverty field as follows:

  • Less than 5% = Low
  • 5% to 15% = Medium
  • Greater than 15% = High

In the example, note that the categories defined are mutually exclusive. Also, the “p_rank”” field did not need to exist before using it; it is created dynamically. Once the field is created, I use factor() to create an ordered factor so that the categories are correctly ordered. I then used the table() function to print the number of records per category.

counties_sub$p_rank[counties_sub$per_pov < 5] <- "Low"
counties_sub$p_rank[counties_sub$per_pov >= 5 & counties_sub$per_pov <= 15] <- "Medium"
counties_sub$p_rank[counties_sub$per_pov > 15] <- "High"
counties_sub$p_rank <- factor(counties_sub$p_rank, ordered=TRUE, levels = c("Low", "Medium", "High"))
print(table(counties_sub$p_rank))

   Low Medium   High 
    34   1673   1402 

Dealing with Missing Data

It is very common to obtain data sets that are incomplete. Also, different software tools commonly define missing data different (for example, -1, -9999, NULL, NODATA, or NA). In R, NA is used to represent missing data. So, there is a need for methods to find and remove NA values or recode a value to NA. Let’s start by generating some false missing values since our county-level data is complete. In this first set of code I am choosing random rows from the percent poverty and median income columns to convert to -9999, an example of a missing data assignment.

counties_na <- counties_sub
na_per_pov <- sample(1:nrow(counties_na), 157, replace=FALSE)
na_med_income <- sample(1:nrow(counties_na), 97, replace=FALSE)
counties_na$per_pov[na_per_pov] <- -9999
counties_na$med_income[na_med_income] <- -9999

Next, I recode -9999 to NA so that R recognizes that this value actually indicates missing.

counties_na$per_pov[counties_na$per_pov==-9999] <- NA
counties_na$med_income[counties_na$med_income==-9999] <- NA

You can find all rows with missing values in specific columns using code similar to that presented in the code block below.

na_list <- subset(counties_na, is.na(med_income) | is.na(per_pov))
print(nrow(na_list))
[1] 251

You can also remove any rows that have NA in any column using the complete.cases() function and square brackets.

print(nrow(counties_na))
[1] 3109
counties_complete <- counties_na[complete.cases(counties_na), ]
print(nrow(counties_complete))
[1] 2858

There are methods available for estimating missing values in a data set. For example, you could fill the missing values with the mean of the available data as shown here. The na.rm argument is used to specify whether missing values should be ignored. If TRUE, NA data points are not used.

mn_pp <- mean(counties_na$per_pov, na.rm=TRUE)
counties_na$per_pov[is.na(counties_na$per_pov)] <- mn_pp

There are more complicated methods for estimating missing values that will not be discussed here.

Data Manipulation with dplyr

In the second half of this module, we will explore the dplyr package, which provides a variety of functions and techniques for data manipulation or wrangling. This package is part of the tidyverse. The tidyverse consists of several R packages for data manipulation, exploration, and visualization that share a common philosophy. Packages in the tidyverse include tibble, tidyr, purrr, magrittr, dplyr, stringr, ggplot2, and many others. More information about the tidyverse can be found here. Many of these packages have cheat sheets to help you use the functions available. For example, the cheat sheet for dplyr can be found here. Alternatively, it can be accessed in RStudio under Help -> Cheat Sheets.

I have found dplyr to be very powerful and intuitive. It is my preferred tool for data prep and manipulation.

Piping

dplyr makes heavy use of the forward-pipe operator (%>%) made available by the magrittr package. This operator allows you to “pipe” an object forward into a function call or expression. I find this operator to be very useful as it makes code more concise and readable. I tend to read it as “then.” Do this, then this, then this. You will see this operator used throughout the remainder of this section and course.

Selecting Rows and Columns

The filter() function in dplyr is similar to the subset() function in base R. Here is an example use of this function to perform one of the queries from above. Here I am “piping” the data frame into the filter() function. Note that I obtain the same results returned above. :: is used to specify the package from which a function is being called and is useful if the same function name is used in multiple packages, in which case the desired function can be ambiguous.

library(dplyr)
subset_row <- counties_sub %>% dplyr::filter((STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12)
nrow(subset_row)
[1] 31
(nrow(subset_row)/nrow(counties_sub))*100
[1] 0.9971052

The select() function can be used to select columns. Here is an example in which three columns are selected from the county data.

library(dplyr)
subset_col <- counties_sub %>% dplyr::select("STATE", "COUNTY", "med_income")
ncol(subset_col)
[1] 3

We could then combine these processes to select desired rows and columns.

library(dplyr)
subset_row_col <- counties_sub %>% dplyr::filter((STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12) %>% dplyr::select("STATE", "COUNTY", "med_income")
nrow(subset_row_col)
[1] 31
ncol(subset_row_col)
[1] 3

Summarize and Count

Data summarization can also be performed using dplyr. We will explore this again in a later section when we talk about summary statistics. Below is an example of obtaining summary statistics for median income for all the counties in the data set.

Note that the output is stored as a tibble, which is similar to a data frame and is defined by the tibble package, which is part of the tidyverse. Tibbles can be created using the tibble() function, and an object can be converted to a tibble using as_tibble(). Tibbles can be converted to standard data frames using as.data.frame(). Tibbles build on top of traditional data frames. In this course, we will primarily work with traditional data frames except when using tidyverse packages. You can read more about tibbles here.

mn_med_income <- counties_sub %>% dplyr::summarise(avg = mean(med_income), stdev = sd(med_income))
print(mn_med_income)
       avg    stdev
1 44044.95 11389.91

You can also obtain summary statistics by group using the group_by() function. In the example below I am obtaining summary statistics for each sub-region in the country. The result is also being converted to a data frame.

subreg_stats <- counties_sub %>% group_by(SUB_REGION) %>% dplyr::summarise(avg = mean(med_income), stdev = sd(med_income))
subreg_stats_df <- as.data.frame(subreg_stats)
print(subreg_stats_df)
  SUB_REGION      avg     stdev
1    E N Cen 46987.51  8884.485
2    E S Cen 37013.97  9051.496
3    Mid Atl 51969.16 14627.691
4        Mtn 45925.80 11647.884
5      N Eng 55049.88 12176.320
6    Pacific 49964.55 11783.643
7      S Atl 44223.54 13543.091
8    W N Cen 43641.07  8354.907
9    W S Cen 40164.12  8998.863

It is also possible to obtain counts by group as shown below. In this case, the count would represent the number of counties in each sub-region.

subreg_county_cnt <- counties_sub %>% group_by(SUB_REGION) %>% dplyr::count()
print(subreg_county_cnt)
# A tibble: 9 x 2
# Groups:   SUB_REGION [9]
  SUB_REGION     n
  <fct>      <int>
1 E N Cen      437
2 E S Cen      364
3 Mid Atl      150
4 Mtn          281
5 N Eng         67
6 Pacific      133
7 S Atl        589
8 W N Cen      618
9 W S Cen      470

Mutate and Transmute

mutate() is used to compute and append new columns to an object. In contrast, transmute() will create new columns and drop the originals. In the provided example, I am calculating the percentage of voters that did not vote democrat by adding the republican and other votes.

counties_vote <- counties %>% dplyr::select(STATE, COUNTY, FIPS, per_dem, per_rep, per_other)
head(counties_vote %>% dplyr::mutate(per_not_dem = per_rep + per_other))
  STATE  COUNTY FIPS per_dem per_rep per_other per_not_dem
1    AL Autauga 1001    25.8    73.6       0.6        74.2
2    AL Barbour 1005    49.0    50.4       0.6        51.0
3    AL    Bibb 1007    26.6    72.4       1.0        73.4
4    AL  Blount 1009    14.5    84.0       1.5        85.5
5    AL Bullock 1011    74.1    25.7       0.2        25.9
6    AL  Butler 1013    43.1    56.5       0.4        56.9

Random Sampling

I prefer to do any random sampling using dplyr. This package allows for both random and stratified random sampling. The first example shows a random sample of 150 counties without replacement from all counties using sample_n(), in which the number of desired samples is specified. The second example demonstrates a random selection of a defined fraction or percentage of the data (20%) with replacement using sample_frac(). When replacement is used, the same sample could potentially be selected multiple times. Lastly, we combine group_by() and sample_n() to obtain a stratified random sample without replacement of 15 counties per sub-region.

counties_rswor <- counties_sub %>% sample_n(150, replace=FALSE)
nrow(counties_rswor)
[1] 150
counties_rswr <- counties_sub %>% sample_frac(0.2, replace=TRUE)
nrow(counties_rswr)/(nrow(counties_sub))
[1] 0.2000643
counties_srswor <- counties_sub %>% group_by(SUB_REGION) %>% sample_n(15, replace=FALSE)
counties_srswor %>% group_by(SUB_REGION) %>% dplyr::count()
# A tibble: 9 x 2
# Groups:   SUB_REGION [9]
  SUB_REGION     n
  <fct>      <int>
1 E N Cen       15
2 E S Cen       15
3 Mid Atl       15
4 Mtn           15
5 N Eng         15
6 Pacific       15
7 S Atl         15
8 W N Cen       15
9 W S Cen       15

Binding

So far, the data manipulations that we have explored have only used one input table. However, dplyr provides functions for comparing or combining data tables. For example, bind_rows() and bind_cols() can be used to combine data based on rows or columns. In the first example, I have selected out all counties in Maine and all counties in Vermont. I then bind the selected rows together into a new data frame. Note that the different data sets should have the same number of columns and the same column names. In the second example, I have selected columns from the counties data then merged them back to a new data set. Note that each data set should have the same number of rows. This method matches rows by position as opposed to using a common field, so be careful when applying it.

maine_counties <- counties_sub %>% dplyr::filter(STATE=="ME")
vermont_counties <- counties_sub %>% dplyr::filter(STATE=="VT")
me_vt_counties <- dplyr::bind_rows(maine_counties, vermont_counties)
nrow(maine_counties)
[1] 16
nrow(vermont_counties)
[1] 14
nrow(me_vt_counties)
[1] 30
counties_names <- counties_sub %>% select(COUNTY, FIPS)
counties_data <- counties_sub %>% dplyr::select(med_income, per_gt55, per_pov)
counties_colbind <- dplyr::bind_cols(counties_names, counties_data)
print(head(counties_names))
   COUNTY FIPS
1 Autauga 1001
2 Barbour 1005
3    Bibb 1007
4  Blount 1009
5 Bullock 1011
6  Butler 1013
print(head(counties_data))
  med_income per_gt55 per_pov
1      51622 22.73369    10.7
2      30896 27.40649    24.5
3      41076 24.61706    18.5
4      46086 27.41530    13.1
5      26980 26.35148    33.6
6      31449 30.14274    22.3
print(head(counties_colbind))
   COUNTY FIPS med_income per_gt55 per_pov
1 Autauga 1001      51622 22.73369    10.7
2 Barbour 1005      30896 27.40649    24.5
3    Bibb 1007      41076 24.61706    18.5
4  Blount 1009      46086 27.41530    13.1
5 Bullock 1011      26980 26.35148    33.6
6  Butler 1013      31449 30.14274    22.3

Intersect, Union, and Setdiff

intersect(), union(), and setdiff() can be used to compare data sets. intersect() will return the rows that appear in both data sets while union() will return rows that appear in either or both data sets (i.e., all rows in both tables). setdiff() is used to extract rows that appear in one data set but not the other. When preparing for machine learning predictions, which will be demonstrated in later modules, I commonly use setdiff() to create separate training and test data sets from an input table by (1) extracting a training data set using sample_n() or sample_frac() then (2) extracting the remaining rows to a test set using setdiff(). This is demonstrated in the example below where 75% of the counties in each sub-region are extracted to a training set and the remaining, non-overlapping 25% are extracted to a test set.

train <- counties_sub %>% group_by(SUB_REGION) %>% sample_frac(0.75, replace=TRUE)
test <- setdiff(counties_sub, train)
nrow(train)
[1] 2332
nrow(test)
[1] 1478

Joins

bind_cols() allows you to combine columns based on the order of the rows. If you are not certain if the rows are in the same order, it is best to use a join instead. This is the same as a table join in ArcGIS in which a shared or common field is used to associate the data. If you work with relational databases, this is similar to the concepts of primary and foreign keys. There are a variety of join methods available in dplyr.

  • left_join(): join matching rows in second table to first
  • right_join(): join matching rows in first table to second
  • inner_join(): retain only rows in both tables
  • full_join(): retain all rows from both tables
  • semi_join(): retain all rows in first table that have a match in second
  • anti_join(): retain all rows in first table that do not have a match in the second

The example below is using a left_join() to complete the same process from above that used bind_cols(). In this case the “FIPS”” code is used as the common field since it occurs in both tables and is unique to each county.

counties_names <- counties_sub %>% select(COUNTY, FIPS)
counties_data <- counties_sub %>% dplyr::select(FIPS, med_income, per_gt55, per_pov)
counties_joined <- left_join(counties_names, counties_data, by="FIPS")
head(counties_joined)
   COUNTY FIPS med_income per_gt55 per_pov
1 Autauga 1001      51622 22.73369    10.7
2 Barbour 1005      30896 27.40649    24.5
3    Bibb 1007      41076 24.61706    18.5
4  Blount 1009      46086 27.41530    13.1
5 Bullock 1011      26980 26.35148    33.6
6  Butler 1013      31449 30.14274    22.3

Concluding Remarks

I hope you would agree that R is a very powerful tool for manipulating and wrangling data. In the next, short section, I will provide some examples of using the stringr and forcats packages, which are part of the tidyverse, for working with characters and factors. Then we will explore some other very useful components of the R language including functions, loops, and if…else statements. Following that, we will explore data summarization and simple statistics.