## Objectives

1. Peform mathematical calculations and logical operations in R
2. Query, filter, and subset data
3. recode and classify data
4. Deal with missing values
5. Generate random samples
6. Join data objects
7. Use base R and the dplyr package to manipulate and prepare data for an analysis

# Data Manipulation in Base R

Before you can peform any analyses in R, you will likely need to clean or manipulate your data to prep it for the desired experimentation. I find that I spend a lot of time finding and preparing data. In fact, this is the most time consuming portion of most of my projects. Luckily, R has a variety of great functions and techniques for manipulating data. In fact, it is one of my preferred environments to perform these tasks.

We will start off with a discussion of data manipulation methods availabe in base R. We will then move on to discuss the methods made available in the dplyr package.

## Math and Logic

Once numeric data are read into R (for example, as a vector or a column in a data frame), a variety of arithmetic operations can be peformed. In the example, I have provided a series of operations on a numeric vector (a). Note that the specific operation will be performed for each data point in the vector separately or independently. So, it is very easy to perform simple arithmetic on vectors and columns in data frames, even those that contain many data points. Here is a list of arithhmetic operators in R:

• Subtraction: -
• Multiplication: *
• Division: /
• Exponentation: ^ or **
• Modulus (remainder after division): %%

These operators are very similar to those used in other languages, such as Python. For a list of mathematical functions in R, have a look at this reference card.

a <- sample(1:255, 14, replace=TRUE)
print(a)
 101 170 243 251  86 138 198 212 184  81  36  38 120 216
b <- a + 3
print(b)
 104 173 246 254  89 141 201 215 187  84  39  41 123 219
c <- a - 2
print(c)
  99 168 241 249  84 136 196 210 182  79  34  36 118 214
d <- a/4
print(d)
 25.25 42.50 60.75 62.75 21.50 34.50 49.50 53.00 46.00 20.25  9.00
  9.50 30.00 54.00
e <- a^3
print(e)
  1030301  4913000 14348907 15813251   636056  2628072  7762392
  9528128  6229504   531441    46656    54872  1728000 10077696
f <- a%%2
print(f)
 1 0 1 1 0 0 0 0 0 1 0 0 0 0

Logical operators are also available. The result of a logical operation will be TRUE or FALSE. So, you can think of this as a test. I have provided some example use cases below.

a <- sample(1:255, 14, replace=TRUE)
print(a)
 146 127 204 195 100 236 109  87  19  74 233 171  45  14
b <- a > 50
print(b)
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
  TRUE FALSE FALSE
c <- a == 42
print(c)
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE
d <- a != 42
print(d)
 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
e = a >= 101 & a <= 211
print(e)
  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
  TRUE FALSE FALSE

Here are the common logical operators used in R:

• <: Less than
• <=: Less than or equal to
• >: Greater than
• >=: Greater than or equal to
• ==: Equal to
• !=: Not equal to
• &: x AND y
• |: x OR y, including x AND y
• xor: x or y, excluding x AND y
• !x: Not x

You may have noticed that == is used as opposed to = for “equal to”. This is because = can be used for variable assignment, similar to <-, so it cannot be used as a logical operator. You may have also noticed that you got a different result than I did for the example vector (a). This is because we used the sample() function to generate random numbers (in this case 14 random values between 1 and 255 with replacement). So, you likely obtained a different random set. If you run this again, you will get a new set of numbers.

## Working with Strings

It is also commonly necessary to work with and manipulate strings. Similar to other coding languages, R has many functions for string manipulation. Here are a few exanmples. First, I create a vector of characters. I then convert them to upper case and lower case. I use the paste() function to add “Topic:” to each element in the vector. This is know as concatenation. You can use paste0() if you don’t want to include a space between the concatenated strings. There are many other options for working with strings. Specifically, I would suggest checking out the stringr package.

str1 <- c("GIS", "Remote Sensing", "Spatial Analysis", "Spatial Modeling", "Spatial Analytics")
print(toupper(str1))
 "GIS"               "REMOTE SENSING"    "SPATIAL ANALYSIS"
 "SPATIAL MODELING"  "SPATIAL ANALYTICS"
print(tolower(str1))
 "gis"               "remote sensing"    "spatial analysis"
 "spatial modeling"  "spatial analytics"
print(paste("Topic:", str1, sep= " "))
 "Topic: GIS"               "Topic: Remote Sensing"
 "Topic: Spatial Analysis"  "Topic: Spatial Modeling"
 "Topic: Spatial Analytics"

I find string manipulation to be particularily useful when I am working with files names and file paths. For example, the list.files() function can be used to obtain a list of files in a folder or directory. It will return a vector of all files. If you just want to list files of a defined type, this can be accomplished by providing a pattern argument. Here, I just want to list the TIFF files in the folder. Next, I use the paste0() function to concatenate the path and the file name. Lastly, I use the file_path_sans_ext() function from the tools package and the basename() function to extract just the file name without the extension.

#You won't be able to run this since I didn't provide this folder.
setwd("D:/some_grids")
file_list <- list.files(path = "D:/some_grids")
print(file_list)
 "tile0.aux"          "tile0.tfw"          "tile0.tif"
 "tile0.tif.aux.xml"  "tile0.tif.ovr"      "tile0.tif.xml"
 "tile1.aux"          "tile1.tfw"          "tile1.tif"
 "tile1.tif.aux.xml"  "tile1.tif.ovr"      "tile1.tif.xml"
 "tile10.aux"         "tile10.tfw"         "tile10.tif"
 "tile10.tif.aux.xml" "tile10.tif.ovr"     "tile10.tif.xml"
 "tile2.aux"          "tile2.tfw"          "tile2.tif"
 "tile2.tif.aux.xml"  "tile2.tif.ovr"      "tile2.tif.xml"
 "tile3.aux"          "tile3.tfw"          "tile3.tif"
 "tile3.tif.aux.xml"  "tile3.tif.ovr"      "tile3.tif.xml"
 "tile4.aux"          "tile4.tfw"          "tile4.tif"
 "tile4.tif.aux.xml"  "tile4.tif.ovr"      "tile4.tif.xml"
 "tile5.aux"          "tile5.tfw"          "tile5.tif"
 "tile5.tif.aux.xml"  "tile5.tif.ovr"      "tile5.tif.xml"
 "tile6.aux"          "tile6.tfw"          "tile6.tif"
 "tile6.tif.aux.xml"  "tile6.tif.ovr"      "tile6.tif.xml"
 "tile7.aux"          "tile7.tfw"          "tile7.tif"
 "tile7.tif.aux.xml"  "tile7.tif.ovr"      "tile7.tif.xml"
 "tile8.aux"          "tile8.tfw"          "tile8.tif"
 "tile8.tif.aux.xml"  "tile8.tif.ovr"      "tile8.tif.xml"
 "tile9.aux"          "tile9.tfw"          "tile9.tif"
 "tile9.tif.aux.xml"  "tile9.tif.ovr"      "tile9.tif.xml"
raster_list <- list.files(path = "D:/some_grids", pattern = "\\.tif$") print(raster_list)  "tile0.tif" "tile1.tif" "tile10.tif" "tile2.tif" "tile3.tif"  "tile4.tif" "tile5.tif" "tile6.tif" "tile7.tif" "tile8.tif"  "tile9.tif" full_path <- paste0("D:/some_grids/", raster_list) print(full_path)  "D:/some_grids/tile0.tif" "D:/some_grids/tile1.tif"  "D:/some_grids/tile10.tif" "D:/some_grids/tile2.tif"  "D:/some_grids/tile3.tif" "D:/some_grids/tile4.tif"  "D:/some_grids/tile5.tif" "D:/some_grids/tile6.tif"  "D:/some_grids/tile7.tif" "D:/some_grids/tile8.tif"  "D:/some_grids/tile9.tif" just_name <- tools::file_path_sans_ext(basename(full_path)) print(just_name)  "tile0" "tile1" "tile10" "tile2" "tile3" "tile4" "tile5"  "tile6" "tile7" "tile8" "tile9"  These types of techniques are particularily useful if you want to process a large number of files. For example, you could peform the same set of operations on all raster grids in a folder using the list of files and a for loop, which we will discuss in a later next section. ## Data Selection and Subsetting In GIScience a common task is to query data to select a subset of features. So, you might be used to performing these types of operations in ArcGIS using Select by Attribute. R also has a set of data querying methods, which we will now explore. First, we will need to read in a CSV file (us_county_data.csv) using the read.csv() function. Note that this file has a header and commas are used to separate data values. I then use the str() function to inspect the data. #You will need to set your own working directory here. setwd("ENTER YOUR FILE PATH HERE") counties <- read.csv("us_county_data.csv", header=TRUE, sep=",") str(counties) 'data.frame': 3109 obs. of 68 variables:$ FID         : int  0 1 2 3 4 5 6 7 8 9 ...
$STATE : Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...$ STATE_FIPS  : int  1 1 1 1 1 1 1 1 1 1 ...
$COUNTY : Factor w/ 1818 levels "Abbeville","Acadia",..: 79 97 146 161 221 231 289 311 321 325 ...$ FIPS        : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
$FIPSx : int 1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...$ FIPS_1      : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
$winner12 : Factor w/ 2 levels "democrat","republican": 2 1 2 2 1 2 2 2 2 2 ...$ winner08    : Factor w/ 2 levels "democrat","republican": 2 2 2 2 1 2 2 2 2 2 ...
$per_dem : num 25.8 49 26.6 14.5 74.1 43.1 45.5 23.7 20.7 46.1 ...$ per_rep     : num  73.6 50.4 72.4 84 25.7 56.5 53.9 74.9 78.5 53.5 ...
$per_other : num 0.6 0.6 1 1.5 0.2 0.4 0.6 1.5 0.9 0.4 ...$ pop_den     : num  90.5 30.5 36.6 87.8 17.4 ...
$per_gt55 : num 0.227 0.274 0.246 0.274 0.264 ...$ per_notw    : num  0.2147 0.52 0.2415 0.0742 0.7703 ...
$per_vac : num 0.0865 0.1698 0.1145 0.0967 0.1665 ...$ med_income  : int  51622 30896 41076 46086 26980 31449 35614 38028 40292 30728 ...
$per_pov : num 10.7 24.5 18.5 13.1 33.6 22.3 18.7 17.7 17.1 22.9 ...$ rel         : num  677 550 499 652 577 ...
$FID_1 : int 41 41 41 41 41 41 41 41 41 41 ...$ AREA        : num  51716 51716 51716 51716 51716 ...
$STATE_NAME : Factor w/ 49 levels "Alabama","Arizona",..: 1 1 1 1 1 1 1 1 1 1 ...$ STATE_FIPS_1: int  1 1 1 1 1 1 1 1 1 1 ...
$SUB_REGION : Factor w/ 9 levels "E N Cen","E S Cen",..: 2 2 2 2 2 2 2 2 2 2 ...$ STATE_ABBR  : Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
$POP2000 : int 4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 ...$ POP2001     : int  4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 ...
$POP00_SQMI : int 86 86 86 86 86 86 86 86 86 86 ...$ WHITE       : int  3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 ...
$BLACK : int 1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 ...$ AMERI_ES    : int  22430 22430 22430 22430 22430 22430 22430 22430 22430 22430 ...
$ASIAN : int 31346 31346 31346 31346 31346 31346 31346 31346 31346 31346 ...$ HAWN_PI     : int  1409 1409 1409 1409 1409 1409 1409 1409 1409 1409 ...
$OTHER : int 28998 28998 28998 28998 28998 28998 28998 28998 28998 28998 ...$ MULT_RACE   : int  44179 44179 44179 44179 44179 44179 44179 44179 44179 44179 ...
$HISPANIC : int 75830 75830 75830 75830 75830 75830 75830 75830 75830 75830 ...$ MALES       : int  2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 ...
$FEMALES : int 2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 ...$ AGE_UNDER5  : int  295992 295992 295992 295992 295992 295992 295992 295992 295992 295992 ...
$AGE_5_17 : int 827430 827430 827430 827430 827430 827430 827430 827430 827430 827430 ...$ AGE_18_21   : int  262507 262507 262507 262507 262507 262507 262507 262507 262507 262507 ...
$AGE_22_29 : int 478301 478301 478301 478301 478301 478301 478301 478301 478301 478301 ...$ AGE_30_39   : int  642119 642119 642119 642119 642119 642119 642119 642119 642119 642119 ...
$AGE_40_49 : int 660385 660385 660385 660385 660385 660385 660385 660385 660385 660385 ...$ AGE_50_64   : int  700568 700568 700568 700568 700568 700568 700568 700568 700568 700568 ...
$AGE_65_UP : int 579798 579798 579798 579798 579798 579798 579798 579798 579798 579798 ...$ MED_AGE     : num  35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 ...
$MED_AGE_M : num 34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 ...$ MED_AGE_F   : num  37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 ...
$HOUSEHOLDS : int 1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 ...$ AVE_HH_SZ   : num  2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 ...
$HSEHLD_1_M : int 192309 192309 192309 192309 192309 192309 192309 192309 192309 192309 ...$ HSEHLD_1_F  : int  261589 261589 261589 261589 261589 261589 261589 261589 261589 261589 ...
$MARHH_CHD : int 391185 391185 391185 391185 391185 391185 391185 391185 391185 391185 ...$ MARHH_NO_C  : int  515731 515731 515731 515731 515731 515731 515731 515731 515731 515731 ...
$MHH_CHILD : int 29216 29216 29216 29216 29216 29216 29216 29216 29216 29216 ...$ FHH_CHILD   : int  141057 141057 141057 141057 141057 141057 141057 141057 141057 141057 ...
$FAMILIES : int 1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 ...$ AVE_FAM_SZ  : num  3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 ...
$HSE_UNITS : int 1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 ...$ VACANT      : int  226631 226631 226631 226631 226631 226631 226631 226631 226631 226631 ...
$OWNER_OCC : int 1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 ...$ RENTER_OCC  : int  478375 478375 478375 478375 478375 478375 478375 478375 478375 478375 ...
$NO_FARMS97 : int 41384 41384 41384 41384 41384 41384 41384 41384 41384 41384 ...$ AVG_SIZE97  : int  210 210 210 210 210 210 210 210 210 210 ...
$CROP_ACR97 : int 4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 ...$ AVG_SALE97  : num  74.9 74.9 74.9 74.9 74.9 ...
$z : num 0.297 0.297 0.297 0.297 0.297 ... Let’s begin by selecting out a subset of columns from the entire data set. This can be accomplished using square bracket notation and providing a vector list of column names.  counties_sub <- counties[,c("STATE", "STATE_FIPS", "COUNTY", "SUB_REGION", "FIPS", "winner12", "winner08", "per_dem", "per_rep", "per_other", "pop_den", "per_gt55", "med_income", "per_pov")] Next, I will select all rows or records (in this case counties) that voted for the democratic presidential canidate in 2008. There are multiple ways to accomplish this. Here, I will use the subset() function. It requires an input object and a query statement. The select argument allows you to provide a list of columns to include in the output. Here, we only want the county name. Lastly, I use the nrow() function to print the number of rows that were selected. So, 871 counties voted democrat. I then perform some simple math to get the percentage of counties that voted democrat. dem_2008 <- subset(counties_sub, winner08=="democrat", select=c("COUNTY")) nrow(dem_2008)  871 (nrow(dem_2008)/nrow(counties_sub))*100  28.01544 It is possible to query using more than one parameter. The provided example shows how to extract all counties that voted democrat in 2008 and 2012. dem_08_12 <- subset(counties_sub, winner08=="democrat" & winner12=="democrat", select=c("COUNTY")) nrow(dem_08_12)  669 (nrow(dem_08_12)/nrow(counties_sub))*100  21.51817 This examples shows how to select all counties that voted democrat and have a median income higher than$40,000.

dem_08_mi <- subset(counties_sub, winner08=="democrat" & med_income > 40000, select=c("COUNTY"))
nrow(dem_08_mi)
 589
(nrow(dem_08_mi)/nrow(counties_sub))*100
 18.945

As one last example of data subsampling, the code below will select all counties in California or Oregon that voted democrat and have a percent poverty rate greater than 12%. Note the need for parenthesis to define the order of operations.

subset1 <- subset(counties_sub, (STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12,
select=c("COUNTY"))
nrow(subset1)
 31
(nrow(subset1)/nrow(counties_sub))*100
 0.9971052

## Drop Columns

Dropping colummns from a data set is also petty easy and can be accomplished using syntax similar to that shown in the code block below. Here, I am removing the “winner08”" and “winner12”" columns from the counties_sub data frame. First, I create a vector that contains the names of the columns I want to remove. Then I use square brackets to exract all columns that are not in the list. Note that ! implies negation (or, not these columns). The %in% operator is used to determine if an element belongs to a vector, in this case the list of fields to delete.

col_to_del <- c("winner08", "winner12")
counties_sub2 <- counties_sub[,!(names(counties_sub) %in% col_to_del)]

## Data Recoding

It is also possible to create or alter columns based on data in other columns. In our county data, the percent over 55 (“per_gt55”) data are actually stored as proportions, so we could multiply them by 100 to obtain percentages.

print(mean(counties_sub$per_gt55))  0.291498 counties_sub$per_gt55 <- counties_sub$per_gt55*100 print(mean(counties_sub$per_gt55))
 29.1498

In the above example, I wrote over or replaced the data in an existing column. It is also possible to write data to a new column. Here, I will categorize the states into different poverty ranges using the percent poverty field as follows:

• Less than 5% = Low
• 5% to 15% = Medium
• Greater than 15% = High

In the example, note that the categories defined are mutually exclusive. Also, the “p_rank”" field did not need to exist before using it; it is created dynamically. Once the field is created, I use factor() to create an ordered factor so that the categories are correctly ordered. I then used the table() function to print the number of records per category.

counties_sub$p_rank[counties_sub$per_pov < 5] <- "Low"
counties_sub$p_rank[counties_sub$per_pov >= 5 & counties_sub$per_pov <= 15] <- "Medium" counties_sub$p_rank[counties_sub$per_pov > 15] <- "High" counties_sub$p_rank <- factor(counties_sub$p_rank, ordered=TRUE, levels = c("Low", "Medium", "High")) print(table(counties_sub$p_rank))

Low Medium   High
34   1673   1402 

## Dealing with Missing Data

It is very common to obtain data sets that are incomplete. Also, different software tools commonly define missing data different (for example, -1, -9999, NULL, NODATA, or NA). In R, NA is used to represent missing data. So, there is a need for methods to find and remove NA values or recode a value to NA. Let’s start by generating some false missing values since our county-level data is complete. In this first set of code I am choosing random rows from the percent poverty and median income columns to convert to -9999, an example of a missing data assignment.

counties_na <- counties_sub
na_per_pov <- sample(1:nrow(counties_na), 157, replace=FALSE)
na_med_income <- sample(1:nrow(counties_na), 97, replace=FALSE)
counties_na$per_pov[na_per_pov] <- -9999 counties_na$med_income[na_med_income] <- -9999

Next, I recode -9999 to NA so that R recognizes that this value actually indicate missing.

counties_na$per_pov[counties_na$per_pov==-9999] <- NA
counties_na$med_income[counties_na$med_income==-9999] <- NA

You can find all rows with missing values in specific columns using code similar that presented in the code block below.

na_list <- subset(counties_na, is.na(med_income) | is.na(per_pov))
print(nrow(na_list))
 250

We can also remove any rows that have NA in any column using the complete.cases() function and square brackets.

print(nrow(counties_na))
 3109
counties_complete <- counties_na[complete.cases(counties_na), ]
print(nrow(counties_complete))
 2859

There are methods available for estimating missing values in a data set. For example, you could fill the missing values with the mean of the available data as shown here. The na.rm argument is is used to specify whether missing values should be ignored. If TRUE, NA data points are not used.

mn_pp <- mean(counties_na$per_pov, na.rm=TRUE) counties_na$per_pov[is.na(counties_na\$per_pov)] <- mn_pp

There are more complicated methods for estimating missing values that will not be discussed here.

# Data Manipulation with dplyr

In the second half of this module, we will explore the dplyr package, which provides a variety of functions and techniques for data manipulation or wrangling. This package is part of the tidyverse. The tidyverse consists of several R packages for data manipulation, exploration, and visualization that share a common philosophy. Packages in the tidyverse include tibble, tidyr, purrr, magrittr, dplyr, stringr, ggplot2, and many others. More information about the tidyverse can be found here. Many of these packages have cheat sheets to help you use the functions available. For example, the cheat sheet for dplyr can be found here. Alternatively, it can be accessed in RStudio under Help -> Cheatsheets.

I have found dplyr to be very powerful and intuitive. It is my preferred tool for data prep and manipulation.

## Piping

dplyr makes heavy use of the forward-pipe operator (%>%) made available by the magrittr package. This operator allows you to “pipe” an object forward into a function call or expression. I find this operator to be very useful as it makes code more concise and readable. I tend to read it as “then.” Do this, then this, then this. You will see this operator used throughout the remainder of this section and course.

## Selecting Rows and Columns

The filter() function in dplyr is similar to the subset() function in base R. Here is an example use of this function to perform one of the queries from above. Here I am “piping” the data frame into the filter function. Note that I obtain the same results obtained above. :: is used to specify the package from which a function is being called and is useful if the same function name is used in multiple packages, in which case the desired function can be ambiguous.

library(dplyr)
subset_row <- counties_sub %>% dplyr::filter((STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12)
nrow(subset_row)
 31
(nrow(subset_row)/nrow(counties_sub))*100
 0.9971052

The select() function can be used to select columns. Here is an example in which three columns are selected from the county data.

library(dplyr)
subset_col <- counties_sub %>% dplyr::select("STATE", "COUNTY", "med_income")
ncol(subset_col)
 3

We could then combine these processes to select both the desired rows and columns.

library(dplyr)
subset_row_col <- counties_sub %>% dplyr::filter((STATE=="CA" | STATE == "OR") & winner08=="democrat" & per_pov > 12) %>% dplyr::select("STATE", "COUNTY", "med_income")
nrow(subset_row_col)
 31
ncol(subset_row_col)
 3

## Summarize and Count

Data summarization can also be performed using dplyr. We will explore this again in a later section when we talk about summary statistics. Below is an example of obtaining summary statistics for median income for all the counties in the data set. Note that the output is stored as a tibble, which is similar to a data frame and is defined by the tibble package. Tibbles can be converted to data frames using as.data.frame().

mn_med_income <- counties_sub %>% dplyr::summarise(avg = mean(med_income), stdev = sd(med_income))
print(mn_med_income)
avg    stdev
1 44044.95 11389.91

You can also obtain summary statistics by group using the group_by() function. In the example below I am obtaining summary statistics for each sub-region in the country. The result is also being converted to a data frame.

subreg_stats <- counties_sub %>% group_by(SUB_REGION) %>% dplyr::summarise(avg = mean(med_income), stdev = sd(med_income))
subreg_stats_df <- as.data.frame(subreg_stats)
print(subreg_stats_df)
SUB_REGION      avg     stdev
1    E N Cen 46987.51  8884.485
2    E S Cen 37013.97  9051.496
3    Mid Atl 51969.16 14627.691
4        Mtn 45925.80 11647.884
5      N Eng 55049.88 12176.320
6    Pacific 49964.55 11783.643
7      S Atl 44223.54 13543.091
8    W N Cen 43641.07  8354.907
9    W S Cen 40164.12  8998.863

It is also possibe to obtain counts by group as shown below. In this case, the count would represent the number of counties in each sub-region.

subreg_county_cnt <- counties_sub %>% group_by(SUB_REGION) %>% dplyr::count()
print(subreg_county_cnt)
# A tibble: 9 x 2
# Groups:   SUB_REGION 
SUB_REGION     n
<fct>      <int>
1 E N Cen      437
2 E S Cen      364
3 Mid Atl      150
4 Mtn          281
5 N Eng         67
6 Pacific      133
7 S Atl        589
8 W N Cen      618
9 W S Cen      470

## Mutate and Transmute

mutate() is used to compute and append new columns to an object. In contrast, transmute() will create new columns and drop the originals. In the provided example, we are calculating the percentage of voters that did not vote democrat by adding the republican and other votes.

counties_vote <- counties %>% dplyr::select(STATE, STATE_FIPS, COUNTY, FIPS, per_dem, per_rep, per_other)
head(counties_vote %>% dplyr::mutate(per_not_dem = per_rep + per_other))
STATE STATE_FIPS  COUNTY FIPS per_dem per_rep per_other per_not_dem
1    AL          1 Autauga 1001    25.8    73.6       0.6        74.2
2    AL          1 Barbour 1005    49.0    50.4       0.6        51.0
3    AL          1    Bibb 1007    26.6    72.4       1.0        73.4
4    AL          1  Blount 1009    14.5    84.0       1.5        85.5
5    AL          1 Bullock 1011    74.1    25.7       0.2        25.9
6    AL          1  Butler 1013    43.1    56.5       0.4        56.9

## Random Sampling

I prefer to do any random sampling using dplyr. This package allows for both random and stratified random sampling. The first example shows a random sample of 150 counties without replacement from all counties using sample_n(), in which the number of desired samples is specified. The second example demonstrates a random selection of a defined fraction or percentage of the data (20%) with replacement using sample_frac(). When replacement is used, the same sample could potentially be selected multiple times. Lastly, we combine group_by() and sample_n() to obtain a stratified random smaple without replacement of 15 counties per sub-region.

counties_rswor <- counties_sub %>% sample_n(150, replace=FALSE)
nrow(counties_rswor)
 150
counties_rswr <- counties_sub %>% sample_frac(0.2, replace=TRUE)
nrow(counties_rswr)/(nrow(counties_sub))
 0.2000643
counties_srswor <- counties_sub %>% group_by(SUB_REGION) %>% sample_n(15, replace=FALSE)
counties_srswor %>% group_by(SUB_REGION) %>% dplyr::count()
# A tibble: 9 x 2
# Groups:   SUB_REGION 
SUB_REGION     n
<fct>      <int>
1 E N Cen       15
2 E S Cen       15
3 Mid Atl       15
4 Mtn           15
5 N Eng         15
6 Pacific       15
7 S Atl         15
8 W N Cen       15
9 W S Cen       15

## Binding

So far, the data manipulations that we have explored have only used one input table. However, dplyr provides functions for comparing or combining data tables. For example, bind_rows() and bind_cols() can be used to combine data based on rows or columns. In the first example, I have selected out all counties in Maine and all counties in Vermont. I then bind the select rows together into a new data frame. Note that the different data sets should have the same number of columns and column names. In the second example, I have selected columms from the counties data then merged them back to a new data set. Note that each data set should have the same number of rows. This method matches rows by position as opposed to using a common field.

maine_counties <- counties_sub %>% dplyr::filter(STATE=="ME")
vermont_counties <- counties_sub %>% dplyr::filter(STATE=="VT")
me_vt_counties <- dplyr::bind_rows(maine_counties, vermont_counties)
nrow(maine_counties)
 16
nrow(vermont_counties)
 14
nrow(me_vt_counties)
 30
counties_names <- counties_sub %>% select(COUNTY, FIPS)
counties_data <- counties_sub %>% dplyr::select(med_income, per_gt55, per_pov)
counties_colbind <- dplyr::bind_cols(counties_names, counties_data)
COUNTY FIPS
1 Autauga 1001
2 Barbour 1005
3    Bibb 1007
4  Blount 1009
5 Bullock 1011
6  Butler 1013
med_income per_gt55 per_pov
1      51622 22.73369    10.7
2      30896 27.40649    24.5
3      41076 24.61706    18.5
4      46086 27.41530    13.1
5      26980 26.35148    33.6
6      31449 30.14274    22.3
COUNTY FIPS med_income per_gt55 per_pov
1 Autauga 1001      51622 22.73369    10.7
2 Barbour 1005      30896 27.40649    24.5
3    Bibb 1007      41076 24.61706    18.5
4  Blount 1009      46086 27.41530    13.1
5 Bullock 1011      26980 26.35148    33.6
6  Butler 1013      31449 30.14274    22.3

## Intersect, Union, and Setdiff

intersect(), union(), and setdiff() can be used to compare data sets. intersect() will return the rows that appear in both data sets while union() will return rows that appear in either or both data sets. setdiff() is used to extract rows that appear in one data set but not the other. When preparing for machine learning predictions, which will be demonstrated in later modules, I commonly use setdiff() to create separate training and test data sets from an input table by (1) extracting a training data set then (2) extracting the remaining rows to a test set. This is demonstrated in the example below where 75% of the counties in each sub-region are extracted to a training set and the remaining, non-overlapping 25% are extracted to a test set.

train <- counties_sub %>% group_by(SUB_REGION) %>% sample_frac(0.75, replace=TRUE)
test <- setdiff(counties_sub, train)
nrow(train)
 2332
nrow(test)
 1455

## Joins

bind_cols() allows you to combine columns based on the order of the rows. If you are not certain if the rows are in the same order, it is best to use a join instead. This is the same as a table join in ArcGIS in which a shared or common field is used to associate the data. If you work with relational databases, this is simlar to the concept of primary and foriegn keys. There are a variety of join methods available in dplyr.

• left_join(): join matching rows in second table to first
• right_join(): join matching rows in first table to second
• innner_join(): retain only rows in both tables
• full_join(): retain all rows from both tables
• semi_join(): retain all rows in first table that have a match in second
• anti_join(): retain all rows in first table that do not have a match in the second

The example below is using a left_join() to complete the same process from above that used bind_cols(). In this case the “FIPS”" code is used as the common field since it occurs in both tables and is unique to each county.

counties_names <- counties_sub %>% select(COUNTY, FIPS)
counties_data <- counties_sub %>% dplyr::select(FIPS, med_income, per_gt55, per_pov)
counties_joined <- left_join(counties_names, counties_data, by="FIPS")
COUNTY FIPS med_income per_gt55 per_pov
1 Autauga 1001      51622 22.73369    10.7
2 Barbour 1005      30896 27.40649    24.5
3    Bibb 1007      41076 24.61706    18.5
4  Blount 1009      46086 27.41530    13.1
5 Bullock 1011      26980 26.35148    33.6
6  Butler 1013      31449 30.14274    22.3

I hope you would agree that R is a very powerful tool for manipulating and wrangling data. In later sections we will apply some of these functions and techniques to geospatial data. In the next section, we will explore some other very useful component of the R language including functions, loops, and if…else statements. Following that, we will explore data summarization and simple statistics.

Back to Course Page

Back to WV View