Data Manipulation in R

Objectives

  1. Peform mathematical calculations and logical operations in R
  2. Query, filter, and subset data
  3. recode and classify data
  4. Deal with missing values
  5. Generate random samples
  6. Join data objects
  7. Use base R and the dplyr package to manipulate and prepare data for an analysis

Overview

Data Manipulation in Base R

Before you can peform any analyses in R, you will likely need to clean or manipulate your data to prep it for the desired experimentation. I find that I spend a lot of time finding and preparing data. In fact, this is the most time consuming portion of most of my projects. Luckily, R has a variety of great functions and techniques for manipulating data. In fact, it is one of my preferred environments to perform these tasks.

We will start off with a discussion of data manipulation methods availabe in base R. We will then move on to discuss the methods made available in the dplyr package.

Math and Logic

Once numeric data are read into R (for example, as a vector or a column in a data frame), a variety of arithmetic operations can be peformed. In the example, I have provided a series of operations on a numeric vector (a). Note that the specific operation will be performed for each data point in the vector separately or independently. So, it is very easy to perform simple arithmetic on vectors and columns in data frames, even those that contain many data points. Here is a list of arithhmetic operators in R:

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentation: ^ or **
  • Modulus (remainder after division): %%

These operators are very similar to those used in other languages, such as Python. For a list of mathematical functions in R, have a look at this reference card.

Logical operators are also available. The result of a logical operation will be TRUE or FALSE. So, you can think of this as a test. I have provided some example use cases below.

Here are the common logical operators used in R:

  • <: Less than
  • <=: Less than or equal to
  • >: Greater than
  • >=: Greater than or equal to
  • ==: Equal to
  • !=: Not equal to
  • &: x AND y
  • |: x OR y, including x AND y
  • xor: x or y, excluding x AND y
  • !x: Not x

You may have noticed that == is used as opposed to = for “equal to”. This is because = can be used for variable assignment, similar to <-, so it cannot be used as a logical operator. You may have also noticed that you got a different result than I did for the example vector (a). This is because we used the sample() function to generate random numbers (in this case 14 random values between 1 and 255 with replacement). So, you likely obtained a different random set. If you run this again, you will get a new set of numbers.

Working with Strings

It is also commonly necessary to work with and manipulate strings. Similar to other coding languages, R has many functions for string manipulation. Here are a few exanmples. First, I create a vector of characters. I then convert them to upper case and lower case. I use the paste() function to add “Topic:” to each element in the vector. This is know as concatenation. You can use paste0() if you don’t want to include a space between the concatenated strings. There are many other options for working with strings. Specifically, I would suggest checking out the stringr package.

I find string manipulation to be particularily useful when I am working with files names and file paths. For example, the list.files() function can be used to obtain a list of files in a folder or directory. It will return a vector of all files. If you just want to list files of a defined type, this can be accomplished by providing a pattern argument. Here, I just want to list the TIFF files in the folder. Next, I use the paste0() function to concatenate the path and the file name. Lastly, I use the file_path_sans_ext() function from the tools package and the basename() function to extract just the file name without the extension.

#You won't be able to run this since I didn't provide this folder. 
setwd("D:/some_grids")
file_list <- list.files(path = "D:/some_grids")
print(file_list)
 [1] "tile0.aux"          "tile0.tfw"          "tile0.tif"         
 [4] "tile0.tif.aux.xml"  "tile0.tif.ovr"      "tile0.tif.xml"     
 [7] "tile1.aux"          "tile1.tfw"          "tile1.tif"         
[10] "tile1.tif.aux.xml"  "tile1.tif.ovr"      "tile1.tif.xml"     
[13] "tile10.aux"         "tile10.tfw"         "tile10.tif"        
[16] "tile10.tif.aux.xml" "tile10.tif.ovr"     "tile10.tif.xml"    
[19] "tile2.aux"          "tile2.tfw"          "tile2.tif"         
[22] "tile2.tif.aux.xml"  "tile2.tif.ovr"      "tile2.tif.xml"     
[25] "tile3.aux"          "tile3.tfw"          "tile3.tif"         
[28] "tile3.tif.aux.xml"  "tile3.tif.ovr"      "tile3.tif.xml"     
[31] "tile4.aux"          "tile4.tfw"          "tile4.tif"         
[34] "tile4.tif.aux.xml"  "tile4.tif.ovr"      "tile4.tif.xml"     
[37] "tile5.aux"          "tile5.tfw"          "tile5.tif"         
[40] "tile5.tif.aux.xml"  "tile5.tif.ovr"      "tile5.tif.xml"     
[43] "tile6.aux"          "tile6.tfw"          "tile6.tif"         
[46] "tile6.tif.aux.xml"  "tile6.tif.ovr"      "tile6.tif.xml"     
[49] "tile7.aux"          "tile7.tfw"          "tile7.tif"         
[52] "tile7.tif.aux.xml"  "tile7.tif.ovr"      "tile7.tif.xml"     
[55] "tile8.aux"          "tile8.tfw"          "tile8.tif"         
[58] "tile8.tif.aux.xml"  "tile8.tif.ovr"      "tile8.tif.xml"     
[61] "tile9.aux"          "tile9.tfw"          "tile9.tif"         
[64] "tile9.tif.aux.xml"  "tile9.tif.ovr"      "tile9.tif.xml"     
raster_list <- list.files(path = "D:/some_grids", pattern = "\\.tif$")
print(raster_list)
 [1] "tile0.tif"  "tile1.tif"  "tile10.tif" "tile2.tif"  "tile3.tif" 
 [6] "tile4.tif"  "tile5.tif"  "tile6.tif"  "tile7.tif"  "tile8.tif" 
[11] "tile9.tif" 
full_path <- paste0("D:/some_grids/", raster_list)
print(full_path)
 [1] "D:/some_grids/tile0.tif"  "D:/some_grids/tile1.tif" 
 [3] "D:/some_grids/tile10.tif" "D:/some_grids/tile2.tif" 
 [5] "D:/some_grids/tile3.tif"  "D:/some_grids/tile4.tif" 
 [7] "D:/some_grids/tile5.tif"  "D:/some_grids/tile6.tif" 
 [9] "D:/some_grids/tile7.tif"  "D:/some_grids/tile8.tif" 
[11] "D:/some_grids/tile9.tif" 
just_name <- tools::file_path_sans_ext(basename(full_path))
print(just_name)
 [1] "tile0"  "tile1"  "tile10" "tile2"  "tile3"  "tile4"  "tile5" 
 [8] "tile6"  "tile7"  "tile8"  "tile9" 

These types of techniques are particularily useful if you want to process a large number of files. For example, you could peform the same set of operations on all raster grids in a folder using the list of files and a for loop, which we will discuss in a later next section.

Data Selection and Subsetting

In GIScience a common task is to query data to select a subset of features. So, you might be used to performing these types of operations in ArcGIS using Select by Attribute. R also has a set of data querying methods, which we will now explore. First, we will need to read in a CSV file (us_county_data.csv) using the read.csv() function. Note that this file has a header and commas are used to separate data values. I then use the str() function to inspect the data.

#You will need to set your own working directory here. 
setwd("ENTER YOUR FILE PATH HERE")
counties <- read.csv("us_county_data.csv", header=TRUE, sep=",")
str(counties)
'data.frame':   3109 obs. of  68 variables:
 $ FID         : int  0 1 2 3 4 5 6 7 8 9 ...
 $ STATE       : Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ STATE_FIPS  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY      : Factor w/ 1818 levels "Abbeville","Acadia",..: 79 97 146 161 221 231 289 311 321 325 ...
 $ FIPS        : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
 $ FIPSx       : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
 $ FIPS_1      : int  1001 1005 1007 1009 1011 1013 1017 1019 1021 1023 ...
 $ winner12    : Factor w/ 2 levels "democrat","republican": 2 1 2 2 1 2 2 2 2 2 ...
 $ winner08    : Factor w/ 2 levels "democrat","republican": 2 2 2 2 1 2 2 2 2 2 ...
 $ per_dem     : num  25.8 49 26.6 14.5 74.1 43.1 45.5 23.7 20.7 46.1 ...
 $ per_rep     : num  73.6 50.4 72.4 84 25.7 56.5 53.9 74.9 78.5 53.5 ...
 $ per_other   : num  0.6 0.6 1 1.5 0.2 0.4 0.6 1.5 0.9 0.4 ...
 $ pop_den     : num  90.5 30.5 36.6 87.8 17.4 ...
 $ per_gt55    : num  0.227 0.274 0.246 0.274 0.264 ...
 $ per_notw    : num  0.2147 0.52 0.2415 0.0742 0.7703 ...
 $ per_vac     : num  0.0865 0.1698 0.1145 0.0967 0.1665 ...
 $ med_income  : int  51622 30896 41076 46086 26980 31449 35614 38028 40292 30728 ...
 $ per_pov     : num  10.7 24.5 18.5 13.1 33.6 22.3 18.7 17.7 17.1 22.9 ...
 $ rel         : num  677 550 499 652 577 ...
 $ FID_1       : int  41 41 41 41 41 41 41 41 41 41 ...
 $ AREA        : num  51716 51716 51716 51716 51716 ...
 $ STATE_NAME  : Factor w/ 49 levels "Alabama","Arizona",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ STATE_FIPS_1: int  1 1 1 1 1 1 1 1 1 1 ...
 $ SUB_REGION  : Factor w/ 9 levels "E N Cen","E S Cen",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ STATE_ABBR  : Factor w/ 49 levels "AL","AR","AZ",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ POP2000     : int  4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 4447100 ...
 $ POP2001     : int  4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 4478538 ...
 $ POP00_SQMI  : int  86 86 86 86 86 86 86 86 86 86 ...
 $ WHITE       : int  3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 3162808 ...
 $ BLACK       : int  1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 1155930 ...
 $ AMERI_ES    : int  22430 22430 22430 22430 22430 22430 22430 22430 22430 22430 ...
 $ ASIAN       : int  31346 31346 31346 31346 31346 31346 31346 31346 31346 31346 ...
 $ HAWN_PI     : int  1409 1409 1409 1409 1409 1409 1409 1409 1409 1409 ...
 $ OTHER       : int  28998 28998 28998 28998 28998 28998 28998 28998 28998 28998 ...
 $ MULT_RACE   : int  44179 44179 44179 44179 44179 44179 44179 44179 44179 44179 ...
 $ HISPANIC    : int  75830 75830 75830 75830 75830 75830 75830 75830 75830 75830 ...
 $ MALES       : int  2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 2146504 ...
 $ FEMALES     : int  2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 2300596 ...
 $ AGE_UNDER5  : int  295992 295992 295992 295992 295992 295992 295992 295992 295992 295992 ...
 $ AGE_5_17    : int  827430 827430 827430 827430 827430 827430 827430 827430 827430 827430 ...
 $ AGE_18_21   : int  262507 262507 262507 262507 262507 262507 262507 262507 262507 262507 ...
 $ AGE_22_29   : int  478301 478301 478301 478301 478301 478301 478301 478301 478301 478301 ...
 $ AGE_30_39   : int  642119 642119 642119 642119 642119 642119 642119 642119 642119 642119 ...
 $ AGE_40_49   : int  660385 660385 660385 660385 660385 660385 660385 660385 660385 660385 ...
 $ AGE_50_64   : int  700568 700568 700568 700568 700568 700568 700568 700568 700568 700568 ...
 $ AGE_65_UP   : int  579798 579798 579798 579798 579798 579798 579798 579798 579798 579798 ...
 $ MED_AGE     : num  35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 35.8 ...
 $ MED_AGE_M   : num  34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 34.4 ...
 $ MED_AGE_F   : num  37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 37.2 ...
 $ HOUSEHOLDS  : int  1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 1737080 ...
 $ AVE_HH_SZ   : num  2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 2.49 ...
 $ HSEHLD_1_M  : int  192309 192309 192309 192309 192309 192309 192309 192309 192309 192309 ...
 $ HSEHLD_1_F  : int  261589 261589 261589 261589 261589 261589 261589 261589 261589 261589 ...
 $ MARHH_CHD   : int  391185 391185 391185 391185 391185 391185 391185 391185 391185 391185 ...
 $ MARHH_NO_C  : int  515731 515731 515731 515731 515731 515731 515731 515731 515731 515731 ...
 $ MHH_CHILD   : int  29216 29216 29216 29216 29216 29216 29216 29216 29216 29216 ...
 $ FHH_CHILD   : int  141057 141057 141057 141057 141057 141057 141057 141057 141057 141057 ...
 $ FAMILIES    : int  1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 1215968 ...
 $ AVE_FAM_SZ  : num  3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 3.01 ...
 $ HSE_UNITS   : int  1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 1963711 ...
 $ VACANT      : int  226631 226631 226631 226631 226631 226631 226631 226631 226631 226631 ...
 $ OWNER_OCC   : int  1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 1258705 ...
 $ RENTER_OCC  : int  478375 478375 478375 478375 478375 478375 478375 478375 478375 478375 ...
 $ NO_FARMS97  : int  41384 41384 41384 41384 41384 41384 41384 41384 41384 41384 ...
 $ AVG_SIZE97  : int  210 210 210 210 210 210 210 210 210 210 ...
 $ CROP_ACR97  : int  4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 4197670 ...
 $ AVG_SALE97  : num  74.9 74.9 74.9 74.9 74.9 ...
 $ z           : num  0.297 0.297 0.297 0.297 0.297 ...

Let’s begin by selecting out a subset of columns from the entire data set. This can be accomplished using square bracket notation and providing a vector list of column names.

Next, I will select all rows or records (in this case counties) that voted for the democratic presidential canidate in 2008. There are multiple ways to accomplish this. Here, I will use the subset() function. It requires an input object and a query statement. The select argument allows you to provide a list of columns to include in the output. Here, we only want the county name. Lastly, I use the nrow() function to print the number of rows that were selected. So, 871 counties voted democrat. I then perform some simple math to get the percentage of counties that voted democrat.

It is possible to query using more than one parameter. The provided example shows how to extract all counties that voted democrat in 2008 and 2012.

This examples shows how to select all counties that voted democrat and have a median income higher than $40,000.

As one last example of data subsampling, the code below will select all counties in California or Oregon that voted democrat and have a percent poverty rate greater than 12%. Note the need for parenthesis to define the order of operations.

Drop Columns

Dropping colummns from a data set is also petty easy and can be accomplished using syntax similar to that shown in the code block below. Here, I am removing the “winner08”" and “winner12”" columns from the counties_sub data frame. First, I create a vector that contains the names of the columns I want to remove. Then I use square brackets to exract all columns that are not in the list. Note that ! implies negation (or, not these columns). The %in% operator is used to determine if an element belongs to a vector, in this case the list of fields to delete.

Data Recoding

It is also possible to create or alter columns based on data in other columns. In our county data, the percent over 55 (“per_gt55”) data are actually stored as proportions, so we could multiply them by 100 to obtain percentages.

In the above example, I wrote over or replaced the data in an existing column. It is also possible to write data to a new column. Here, I will categorize the states into different poverty ranges using the percent poverty field as follows:

  • Less than 5% = Low
  • 5% to 15% = Medium
  • Greater than 15% = High

In the example, note that the categories defined are mutually exclusive. Also, the “p_rank”" field did not need to exist before using it; it is created dynamically. Once the field is created, I use factor() to create an ordered factor so that the categories are correctly ordered. I then used the table() function to print the number of records per category.

Dealing with Missing Data

It is very common to obtain data sets that are incomplete. Also, different software tools commonly define missing data different (for example, -1, -9999, NULL, NODATA, or NA). In R, NA is used to represent missing data. So, there is a need for methods to find and remove NA values or recode a value to NA. Let’s start by generating some false missing values since our county-level data is complete. In this first set of code I am choosing random rows from the percent poverty and median income columns to convert to -9999, an example of a missing data assignment.

Next, I recode -9999 to NA so that R recognizes that this value actually indicate missing.

You can find all rows with missing values in specific columns using code similar that presented in the code block below.

We can also remove any rows that have NA in any column using the complete.cases() function and square brackets.

There are methods available for estimating missing values in a data set. For example, you could fill the missing values with the mean of the available data as shown here. The na.rm argument is is used to specify whether missing values should be ignored. If TRUE, NA data points are not used.

There are more complicated methods for estimating missing values that will not be discussed here.

Data Manipulation with dplyr

In the second half of this module, we will explore the dplyr package, which provides a variety of functions and techniques for data manipulation or wrangling. This package is part of the tidyverse. The tidyverse consists of several R packages for data manipulation, exploration, and visualization that share a common philosophy. Packages in the tidyverse include tibble, tidyr, purrr, magrittr, dplyr, stringr, ggplot2, and many others. More information about the tidyverse can be found here. Many of these packages have cheat sheets to help you use the functions available. For example, the cheat sheet for dplyr can be found here. Alternatively, it can be accessed in RStudio under Help -> Cheatsheets.

I have found dplyr to be very powerful and intuitive. It is my preferred tool for data prep and manipulation.

Piping

dplyr makes heavy use of the forward-pipe operator (%>%) made available by the magrittr package. This operator allows you to “pipe” an object forward into a function call or expression. I find this operator to be very useful as it makes code more concise and readable. I tend to read it as “then.” Do this, then this, then this. You will see this operator used throughout the remainder of this section and course.

Selecting Rows and Columns

The filter() function in dplyr is similar to the subset() function in base R. Here is an example use of this function to perform one of the queries from above. Here I am “piping” the data frame into the filter function. Note that I obtain the same results obtained above. :: is used to specify the package from which a function is being called and is useful if the same function name is used in multiple packages, in which case the desired function can be ambiguous.

The select() function can be used to select columns. Here is an example in which three columns are selected from the county data.

We could then combine these processes to select both the desired rows and columns.

Summarize and Count

Data summarization can also be performed using dplyr. We will explore this again in a later section when we talk about summary statistics. Below is an example of obtaining summary statistics for median income for all the counties in the data set. Note that the output is stored as a tibble, which is similar to a data frame and is defined by the tibble package. Tibbles can be converted to data frames using as.data.frame().

You can also obtain summary statistics by group using the group_by() function. In the example below I am obtaining summary statistics for each sub-region in the country. The result is also being converted to a data frame.

It is also possibe to obtain counts by group as shown below. In this case, the count would represent the number of counties in each sub-region.

Random Sampling

I prefer to do any random sampling using dplyr. This package allows for both random and stratified random sampling. The first example shows a random sample of 150 counties without replacement from all counties using sample_n(), in which the number of desired samples is specified. The second example demonstrates a random selection of a defined fraction or percentage of the data (20%) with replacement using sample_frac(). When replacement is used, the same sample could potentially be selected multiple times. Lastly, we combine group_by() and sample_n() to obtain a stratified random smaple without replacement of 15 counties per sub-region.

Binding

So far, the data manipulations that we have explored have only used one input table. However, dplyr provides functions for comparing or combining data tables. For example, bind_rows() and bind_cols() can be used to combine data based on rows or columns. In the first example, I have selected out all counties in Maine and all counties in Vermont. I then bind the select rows together into a new data frame. Note that the different data sets should have the same number of columns and column names. In the second example, I have selected columms from the counties data then merged them back to a new data set. Note that each data set should have the same number of rows. This method matches rows by position as opposed to using a common field.

Intersect, Union, and Setdiff

intersect(), union(), and setdiff() can be used to compare data sets. intersect() will return the rows that appear in both data sets while union() will return rows that appear in either or both data sets. setdiff() is used to extract rows that appear in one data set but not the other. When preparing for machine learning predictions, which will be demonstrated in later modules, I commonly use setdiff() to create separate training and test data sets from an input table by (1) extracting a training data set then (2) extracting the remaining rows to a test set. This is demonstrated in the example below where 75% of the counties in each sub-region are extracted to a training set and the remaining, non-overlapping 25% are extracted to a test set.

Joins

bind_cols() allows you to combine columns based on the order of the rows. If you are not certain if the rows are in the same order, it is best to use a join instead. This is the same as a table join in ArcGIS in which a shared or common field is used to associate the data. If you work with relational databases, this is simlar to the concept of primary and foriegn keys. There are a variety of join methods available in dplyr.

  • left_join(): join matching rows in second table to first
  • right_join(): join matching rows in first table to second
  • innner_join(): retain only rows in both tables
  • full_join(): retain all rows from both tables
  • semi_join(): retain all rows in first table that have a match in second
  • anti_join(): retain all rows in first table that do not have a match in the second

The example below is using a left_join() to complete the same process from above that used bind_cols(). In this case the “FIPS”" code is used as the common field since it occurs in both tables and is unique to each county.

I hope you would agree that R is a very powerful tool for manipulating and wrangling data. In later sections we will apply some of these functions and techniques to geospatial data. In the next section, we will explore some other very useful component of the R language including functions, loops, and if…else statements. Following that, we will explore data summarization and simple statistics.

Back to Course Page

Back to WV View

Download Data