Vector-Based Spatial Analysis

Objectives

  1. Use the sf package to read in and prepare vector data
  2. Filter rows, select columns, and randomly sample data using dplyr
  3. Summarize attributes and compare groups
  4. Join tables and perform table calculations
  5. Perform a variety of geoprocessing tasks including dissolve, clip, intersect, union, erase, and symmetrical difference
  6. Summarize point, line, and polygon data relative to polygons
  7. Simplify geometries
  8. Create tessellations

Overview

In this section we will explore a wide variety of techniques for working with and analyzing vector data in R. Specifically, we will focus on methods made available in the sf package. Note there are additional libraries for reading in and working with vector data in R, such as sp. However, I prefer sf. If you work at all with PostgreSQL and PostGIS you will observe lots of similarities. For example, functions designed to work on spatial data in sf and PostGIS are all prefixed with “st_”. Also, they both make use of an object-based vector data model where the geometry and attributes are stored in the same table.

You will need to load in the following packages to execute the provided examples. We will use tmap to visualize the results of our analyses.

We will use a variety of different data layers in this module. I have provided a description of the layers below.

  • circle.shp: a circle (shapefile)
  • Triangles.shp: a triangle (shapefile)
  • extent.shp: bounding box extent of circle and triangle (shapefile)
  • interstates.shp: interstates in West Virginia as lines (shapefile)
  • rivers.shp: major rivers in West Virginia as lines (shapefile)
  • towns.shp: point features of large towns in West Virginia (shapefile)
  • wv_counties.shp: West Virginia county boundaries as polygons (shapefile)
  • harvest.csv: CSV table of deer harvest data by country from the West Virginia Division of Natural Resources (WVDNR) (shapefile)
  • tornadoes: point features of tornadoes across US (feature class)
  • cities: point features of cities in the US (feature class)
  • interstates:interstates in US as lines (feature class)
  • Us_eco_ref_l3: ecological regions of the US as polygons (feature class)
  • states: US state boundaries (feature class)

All vector files are read in using the st_read() function from sf. This will generate a data frame with a column for each attribute and an additional column to store the geographic information. The table is loaded using read.csv() from base R.

Data Summarization and Querying

Let’s start by mapping the ecological region boundaries using tmap. I am using color to show the Level 1 ecoregion name as a qualitative variable. I am also plotting the legend outside of the map space.

Since sf objects are data frames, we can work with and manipulate them using a variety of functions that accept data frames. In the next code block I am using dplyr to count the number of polygons belonging to each Level 1 ecoregion. The result is a tibble. You can see that the eastern temperate forest has the largest number of polygons associated with it.

This is no different than working with any data frame. This is one of the benefits of using sf to query, manipulate, and analyze vector data in R.

Records, rows, or geospatial features can be extracted using the filter() function from dplyr. In the next code block I am extracting out all features that are within the great plains Level 1 ecoregion. I then map the extracted features and use color to differentiate the Level 2 ecoregions. Note that I use droplevels() here to remove any Level 2 names that are not in the extracted Level 1 ecoregion. This is common practice when subsetting factors, as you might remember.

Again, since sf objects are just data frames you can explore the tabulated data using a variety of functions and methods. As an example, I am using ggplot2 to compare county-level population by sub-region of the US. Since the geographic information is not needed here, it is simply ignored.

As a similar example, I am using dplyr to obtain the mean county population for each sub-region. Note that the geographic data are automatically included even though we did not specify this. The results are stored as a tibble.

Each tornado point has a Fujita scale attribute, a measure of tornado intensity. In this example I am counting the number of tornadoes by Fujita category. As expected, severe tornadoes occur much less frequently. Note again that the geometric information is automatically added to the result.

In this example, I am extracting all tornadoes that have a Fujita value larger than or equal to 3 using dplyr. I then plot them as points using tmap.

It is also possible to subset columns or attributes using the select() function from dplyr. Using this method, I extract out only the state name, sub-region, and population attributes in the code block below. Note again that the geometric information is automatically included.

Table Joins

Tabulated data can be joined to spatial features if they share a common field. This is known as a table join. In the example, I am joining the deer harvest data to the counties. First, I change the second column’s name to “NAME” as opposed to “COUNTY” so that the column name matches that of the polygon features. I then use left_join() from dplyr to associate the data using the common “NAME” field.

I then map the total number of deer harvested by county.

New attribute columns can be generated using the mutate() function from dplyr. Using this function, I calculate the density of deer harvest as deer per square mile using the “TOTALDEER” and “SQUARE_MIL” fields. The density is then written to a new column called “DEN”.

Using the filter() function, I extract out and map all counties that had a deer harvest density greater than five deer per square mile.

The st_area() function is used to calculate area for polygon features. The area will be returned in the map units, in this case square meters since the data are projected to a UTM coordinate system. I write the result to a new column called “hec”.

I then calculate density again using hectares as opposed to square miles.

Bounding Box and Centroids

The st_bbox() function is used to extract the bounding box coordinates for a layer. It does not create a spatial feature. To convert the bounding box coordinates to a spatial feature, I use the st_make_grid() function to create a grid that covers the extent of the bounding box with a single grid cell. This is a bit of trick, but it works well.

In the map, the bounding box for West Virginia is shown in red.

st_centroid() is used to extract the centroid of polygon features and returns a point geometry. Note that the centroid is not always inside of the polygon that it is the center of (for example, a C-shaped island or a donut). st_point_on_surface() provides an alternative in which a point within the polygon will be returned.

When points are generated from polygons, all the attributes are also copied, as demonstrated by calling the column names.

Voronoi or Thiessen Polygons represent the areas that are closer to one point feature than any other feature in a data set. st_voronoi() can be used to generate these polygons. However, this only works when points are aggregated to multipoints, so I use the st_union() function to make this conversion. The result will be a GEOMETRYCOLLECTION, which can be converted back to points using st_cast(). I then extract out the polygons only within the extent of the state using st_intersection(). We will talk about these additional methods later in this module. So, this offers a means to convert points to polygons.

A convex hull is a bounding geometry that encompasses features without any concave angles. Think of this as stretching a rubber band around the margin points in a data set. A convex hull can be generated using he st_convex_hull() function. Similar to st_voronoi(), points must first be converted to multipoint.

Spatial Join

In contrast to a table join, a spatial join associates features based on spatial co-occurrence as opposed to a shared or common attribute. In the next code block I am using st_join() to spatially join the cities and states. The result will be a point geometry with all the attributes from each city and the state in which it occurs.

To demonstrate this, I am symbolizing the points using the “SUB_REGION” field, which was initially attributed to the states layer.

Random Sampling

Randomly sampling spatial features using dplyr is the same as randomly sampling records or rows from any data frame. In the first example, I am demonstrating simple random sampling without replacement. The replace argument is set to FALSE to indicate that I don’t want features to be selected multiple times.

The result is 100 randomly selected cities. Since this is random, you will get a different result each time you run the code.

It is also possible to stratify the sampling using group_by(). Here, I am selecting 10 random cities per sub-region. To confirm this, I create a table that provides a count of selected cities by sub-region.

Dissolve

Dissolve is used for boundary generalization. For example, you can dissolve county boundaries to state boundaries or state boundaries to the country boundary.

In the example, I am obtaining sub-region boundaries from the state boundaries. If you would like to include a summary of some attributes relative to the dissolved features, you can include a summary() function. As an example, I am summing the population of all the states in each sub-region and writing the result to a column called “totpop”.

So, boundary generalization can be performed using only dplyr functions.

This is another example of dissolve where I have generated the West Virginia state boundary from the county boundaries. Since all counties have the same value in the “STATE” field, a single feature is returned. I have also summed the population for each county, which will result in the total state population.

Buffer

Proximity analysis relates to finding features that are near to other features. A buffer represents the area that is within a defined distance of an input feature. Buffers can be generated for points, lines, and polygons. The result will always be a polygon or area.

In the example, I have created polygon buffers that encompass all areas within 20 km of the mapped interstates using the st_buffer() function. Note that the buffer distance is defined in the map units, in this case meters.

Clip and Intersect

Clip and intersect are used to extract features within another feature or return the overlapping extent of multiple features. Think of this as a spatial AND.

In this example I am intersecting the circle and triangle. The result will be the area that occurs in both the circle and the triangle (shown in red).

All attributes from both features will be returned, as demonstrated by calling glimpse() on the output. The table contains both the “C” and “T” fields, which came from the circle and triangle, respectively.

So, st_intersection() is equivalent to the Intersect Tool in ArcGIS. However, it can also be used to perform a clip, as the geometric result is the same.

In this example, I am using st_intersection(), the interstate buffer, and the tornado points to find all tornadoes that occurred within 20 km of an interstate. So, buffer and intersection can be combined to find features within a specified distance of other features.

Union, Erase, and Symmetrical Difference

st_union() from sf is a bit different from the Union Tool in ArcGIS. Instead of returning the unioned area with all boundaries, the features are merged to a single features with no internal boundaries maintained. In the example, the full spatial extent of the circle and triangle is returned as a single feature. Similar to st_inersection(), all attributes are maintained. Think of this as a spatial OR.

st_difference() will return the portion of the the first geometric feature that is not in the second. This is similar to the Erase Tool in ArcGIS. This a spatial NOT: Circle NOT Triangle. The result is the portion of the circle in the dotted red extent displayed in the map below.

Symmetrical difference is a spatial XOR: Circle OR Triangle, excluding Circle AND Triangle. So, the result will be the area in either the circle or the triangle, but not the area in both as demonstrated in the example below.

Points in Polygons

You can count the number of point features occurring in polygons using sf and dplyr. In the example, I am counting the number of cities in each state using the method outlined below.

  1. First, I use st_join() to join the cities and states using a spatial join. This will result in a point geometry where each point has the city and state attributes.
  2. I then use group_by() and count() from dplyr to count the number of cities by state.
  3. I use st_drop_geometry() to remove the geometric information from the output and return just the aspatial data.
  4. I join the table back to the state boundaries using the common “STATE_NAME” field and a table join.

Once this process is complete, I then map the resulting counts. Note that this entire process is completed using only sf and dplyr.

Length of Lines in Polygons

You can also sum the length of lines by polygon. In the example, I am summing the length of interstates by state.

  1. First, I intersect the interstates and states using st_intersection() to obtain a line geometry where the interstates have been split along state boundaries and now have the state attributes. 2.I then use st_length() to calculate the length of each line segment. I cannot use the original length field because this will not reflect the length after the interstates are split along state borders. I divide by 1,000 to obtain the measure in kilometers as opposed to meters.
  2. I then group the lines by state and sum the length using dplyr.
  3. I remove the geometry from the result.
  4. I join the result back to the state boundaries using left_join() and the common “STATE_NAME” field.

I then map the results using tmap. Again, this was accomplished using only sf and dplyr.

For a fairer comparison, it would make sense to calculate the density of interstates as length of interstates in kilometers per square kilometer. Again, this can be accomplished using only sf and dplyr.

Area of Polygons within Polygons

Calculating the area of each category of polygons within other polygons is a bit more complicated. In this example, I will calculate the area of each Level 1 ecoregion by state.

First, I will dissolve the Level 3 ecoregions to Level 1 ecoregions using group_by()

I then intersect the Level 1 ecoregions and states using st_intersection().

Since area measures may no longer be correct due to the intersections, I must recalculate the area using st_area(). To obtain a total area of each Level 1 ecoregion by stated, I then dissolve using group_by() and the state and Level 1 ecoregion attributes. I also summarize the area field.

I now have a land area for each Level 1 ecoregion by state. However, the data are not in the correct shape. So, I use the spread() function from tidyr to transform the table so that the columns are the ecoregions, the rows are the states, and the data are the areas.

Next, I replace any NA records with 0.

I then calculate the land area for each state then join the results back to the states.

To convert the results to percentages, I loop through each ecoregion column using a for loop. In the loop, I divide the area of the ecoregion by the state area and multiply by 100 to obtain a percentage for each ecoregion in each state. This is a bit complicated because I need to use variable names within the mutate() function.

This map shows percent eastern deciduous forest by state to visualize the results.

Simplify Polygons

The st_simplify() function can be used to simplify or generalize features. Higher values for dTolerance will result in more simplification. Topology can be preserved using the perserveToplogy argument.

In this second example, topology is not preserved.

This is a simplification of the West Virginia boundary using a tolerance of 10,000 meters with topology preserved.

Random Points in Polygons

Random points can be generated within polygons using the st_samples() function. In the example, I am generating 400 random points across the country. Since this is random, you will get a different result if you re-execute the code. You can also set the type argument to “regular” to obtain a regular as opposed to random pattern.

By combining st_sample() with group_by() you can stratify the sample based on a category. In the example, I am extracting 10 random points in each state.

Tesselation

st_make_grid() can be used to produce a rectangular grid tesselation over an extent. Here, I am extracting the bounding box for the states data. I then create a rectangle from the bounding box using st_make_grid(), as already demonstrated above. I then create a new grid of 10 by 10 cells within this extent also using st_make_grid().

A hexagonal tesselation can be obtained by setting the square argument in st_make_grid() to FALSE.

The tessellation result can then be clipped to the extent of the polygon data using st_intersection(). In this example, I am using the country boundary, created by dissolving the state boundaries, to clip the tessellation.

Again, I’ve tried to focus on common geoprocessing and vector analysis techniques here. You will likely run into specific issues that will require different techniques. However, I think you will find that the examples provided can offer a starting point for other analyses.

I also focused of dplyr and sf. However, there are other packages available for working with vector data. Many rely on sp, so you need to know how to convert between sp and sf types. You can save the results of your analyses to permanent files using the methods discussed in the spatial data module.

Now that you have studied vector-based spatial analysis, you are ready to investigate working with and analyzing raster data in R.

Back to Course Page

Back to WV View

Download Data