Reading in Spatial Data

Objectives

  1. Read in vector data using the sp and sf packages.
  2. Convert between the sp and sf data models.
  3. Define and transform datums and projections for vector data.
  4. Access and work with attribute data associated with vector features.
  5. Read in raster data using the raster package.
  6. Define and transform datums and projections for raster data.

Overview

We are now ready to begin discussing loading, visualizing, and analyzing spatial data in R. In this section specifically, we will discuss reading in vector and raster data, working with datums and projections, and saving spatial output to disk. I will assume that you have prior knowledge of spatial data but have not worked with these data in R. If you have trouble with the theoretical components of GIScience and spatial data, we provide an additional course here focused on these topics. This course covers a variety to GIScience topics including data models, datums and projections, digital cartography, spatial analysis, and spatial modeling. The lab exercises make use of ArcGIS Pro.

Since vector and raster spatial data have fundamentally different data structures, different methods and packages are available for working with them in R. We will start with vector data.

Vector Data

Vector data, including points, lines, and polygons, have a more complex data structure than raster data, which is effectively an array. Within R, we need to be able to store the geographic, attribute, and coordinate system components of vector data, and there are different means to accomplish this. Initially, vector data were read into R using the sp package. More recently, the sf package has provided an alternative. I will discuss both methods here, but we will primarily focus on sf in this course.

sp

The sp package provides a variety of options for storing vector data including the following:

  • SpatialPoints: stores points but no attributes
  • SpatialPointsDataFrame: stores points and attributes
  • SpatialLines: stores lines but no attributes
  • SpatialLinesDataFrame: stores lines and attributes
  • SpatialPolygons: stores polygons but no attributes
  • SpatialPolygonsDataFrame: stores polygons and attributes

These data models are all subclasses of the Spatial class. This class makes use of the s4 class, in which data components are stored in slots. Let’s start by exploring this structure.

Anytime I work with spatial data in R, I load the rgdal library, which provides bindings to the Geospatial Data Abstraction (GDAL) and PROJ.4 libraries. When you install these R packages the GDAL and PROJ.4 libraries will install to your computer outside of R. Here is a link to GDAL and PROJ.4. I am also using the tmap package here to visualize the data in map space. We will discuss this in detail in the next section, so don’t worry if you don’t understand the syntax yet.

Let’s build a SpatialPoints object from simulated data. First, I create a series of x, y coordinate pairs representing longitude and latitude. I then convert this to a SpatialPoints object using the SpatialPoints() function.

When I call summary() on the new SpatialPoints object, you can see that coordinates are now provided, but there is no coordinate system information defined.

The is.projected() function can be used to test whether a projected coordinate system is defined for an object of the Spatial class. This test confirms that the projection information is missing, as NA is returned. You can use the proj4string() function on the SpatialPoints object to define a projection using a PROJ.4 string. You can also define the projection using the European Petroleum Survey Group (EPSG) code as demonstrated in the line that is commented out. Calling summary() on the layer confirms that spatial reference information is now present. You will notice that “Is projected” is FALSE. This is because this layer is not technically “projected” since we did not use a projected coordinate system. Instead, a geographic coordinate system is defined. So, the data have been referenced to an ellipsoidal model of the Earth as opposed to a projected coordinate system.

There are many geographic and projected coordinate systems. So, you probably won’t know the EPSG code or PROJ.4 string for a specific projection. However, you can look them up on this website. This PDF provides a great explanation of coordinate systems in R.

We now have a SpatialPoints object. However, there is no attribute information. This is because this is not a SpatialPointsDataFrame. The next block of code demonstrates how to generate a SpatialPointsDataframe by combining SpatialPoints and a data frame. Calling summary() confirms that attributes are now present.

Generally, you will not build your spatial data from scratch. Instead, you will read spatial data into R. I will provide examples for reading in shapefile data of counties, rivers, towns, and interstates in West Virginia. This can be accomplished using the readOGR() function from the rgdal package. the dsn argument here is set to the subfolder within the working directory that houses the data. This could also be a geodatabase if you are reading in a feature class.

I then use the tmap package to plot the data in map space. Again, we will discuss the specifics of this in the next section.

Let’s explore the data structure of the town points layer in greater detail. The s4 class and the associated Spatial subclass contains many slots, and the structure is very similar to that of a list. In the first line of code, I am printing the mean population by county. I have to specify the SpatialPointsDataFrame object (towns), the slot (data), and the column of interest (“POPULATION”"). Note the the data slot stores the attribute information.

The coords slot holds the coordinate pairs that define the data point locations relative to the coordinate reference system.

As expected, the spatial information is more complex for a polygon. Here is an example of printing the first five coordinates that define a single polygon in the counties data set. Additional slots are available for the the ring direction, to note if holes are present, and for the area of the feature.

Here is an example of extracting the spatial reference information for the towns data. You can see that the spatial reference is NAD83 UTM Zone 17N.

This example shows how to extract the bounding box coordinates for the data.

sf

More recently, a second method has been developed for reading vector data into R. The sf package (Simple Features for R) uses a very different data structure than that used by sp as I will demonstrate here.

First, I will read in the same data layers using the st_read() function from sf.

In the next code block, I run is.data.frame() on the towns_sf object, and it returns TRUE, indicating that this object is indeed a data frame. That’s weird. Actually, this object is a data frame, but how is this possible?

If we summarize the towns_sf object you will see that each attribute is represented as a column, and that the last column is called geometry. So, the geometric or spatial information exists as a column in the table.

summary(towns_sf)
      AREA     PERIMETER     TOWN_          TOWN_ID          TOWNS_     
 Min.   :0   Min.   :0   Min.   : 17.0   Min.   : 17.0   Min.   : 17.0  
 1st Qu.:0   1st Qu.:0   1st Qu.: 62.5   1st Qu.: 62.5   1st Qu.: 62.5  
 Median :0   Median :0   Median :162.0   Median :162.0   Median :162.0  
 Mean   :0   Mean   :0   Mean   :147.1   Mean   :147.1   Mean   :147.1  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:228.0   3rd Qu.:228.0   3rd Qu.:228.0  
 Max.   :0   Max.   :0   Max.   :270.0   Max.   :270.0   Max.   :270.0  
                                                                        
    TOWNS_ID        PLACEFP     TYPE         INTPTLAT        INTPTLNG 
 Min.   : 17.0   05332  : 1   CDP : 1   +37260462: 1   -077969283: 1  
 1st Qu.: 62.5   08524  : 1   CITY:15   +37787480: 1   -079949772: 1  
 Median :162.0   14600  : 1             +38350550: 1   -080146087: 1  
 Mean   :147.1   15628  : 1             +38356390: 1   -080331821: 1  
 3rd Qu.:228.0   19108  : 1             +38378576: 1   -080564122: 1  
 Max.   :270.0   26452  : 1             +38412950: 1   -080696995: 1  
                 (Other):10             (Other)  :10   (Other)   :10  
      LAT             LONG                 NAME      POPULATION   
 Min.   :37.26   Min.   :-82.43   BECKLEY    : 1   Min.   :10753  
 1st Qu.:38.37   1st Qu.:-81.65   BLUEFIELD  : 1   1st Qu.:12366  
 Median :39.27   Median :-81.20   CHARLESTON : 1   Median :18178  
 Mean   :38.99   Mean   :-80.95   CLARKSBURG : 1   Mean   :23100  
 3rd Qu.:39.52   3rd Qu.:-80.51   CROSS LANES: 1   3rd Qu.:27875  
 Max.   :40.40   Max.   :-77.97   FAIRMONT   : 1   Max.   :57287  
                                  (Other)    :10                  
    FAMILIES       HOUSEHOLDS       X_COORD          Y_COORD       
 Min.   : 2873   Min.   : 4211   Min.   :374822   Min.   :4123586  
 1st Qu.: 3453   1st Qu.: 5141   1st Qu.:443236   1st Qu.:4247294  
 Median : 4628   Median : 7896   Median :482122   Median :4346950  
 Mean   : 6083   Mean   : 9791   Mean   :503578   Mean   :4315852  
 3rd Qu.: 7202   3rd Qu.:10807   3rd Qu.:542150   3rd Qu.:4375934  
 Max.   :15231   Max.   :25306   Max.   :760757   Max.   :4472140  
                                                                   
          geometry 
 POINT        :16  
 epsg:26917   : 0  
 +proj=utm ...: 0  
                   
                   
                   
                   

If I print the information in this column for a single town, you can see that all the geometric information, including the geometry type, dimensions, bounding box coordinates, spatial reference information, and actual coordinates, are present. So, when using sf, vector features are read in as a data frames with a special geometry column that stores the geometric information in a list structure. This is very similar to an object-based vector data model, where the attributes and spatial information are stored in the same file. For example, a feature class in a geodatabase stores the geometric information in a table column as a binary large object (BLOB).

One benefit of using sf is that functions that accept data frames can also be applied to these spatial objects, since they are data frames. You will see many examples of this in the following modules.

It is possible to remove the geographic information from the data frame using st_drop_geometry(), as demonstrated in the next code block.

Conversion

Since there are two separate data models available in R for vector data, which should you use? In this course, we will primarily work with sf; however, there are many packages available for analyzing spatial data in R, and not all are able to use sf data. So, it is important to be familiar with sp and also know how to convert between types. We will discuss conversion methods.

I am reading in the towns data again using sp and sf.

In this example, I am converting the sf object to sp using the as() function from sp. setting Class to “Spatial” indicates that we want the result to be an object of class Spatial. The class() function is used to confirm this.

Using the st_as_sf() function from sf, you can convert an object of class Spatial to a data frame, as defined by sf.

Working with Vector Coordinate Systems

When working with spatial data in R, it is important that coordinate reference information be provided and that it is correctly defined. I would recommend checking the spatial reference information every time you read data into R. Using sf, projection information can be printed, written to an object, removed, or defined using the st_crs() function. In this first example, I am simply printing the projection information for the counties_sf object, which will return the EPSG code and PROJ.4 string.

The coordinate reference information can also be saved as a list, and individual components can be saved to vectors.

In this example, I am removing the projection information by setting it to NA. I am then redefining the projection using the PROJ.4 string and EPSG code.

What if you would like to transform the datum or projection? This can be accomplished with the st_transform() function. You will need to provide the EPSG code or PROJ.4 string for the desired output projection. As an example, I am converting the county data from NAD83 UTM Zone 17N to the WGS84 geographic coordinate system.

Save to Disk

The result of your analyses in R can be saved to disk using both rgdal and sf. rgdal provides the writeORG() function which is demonstrated below. Note that you can save to a variety of vector formats; however, you must provide an object of class Spatial. So, here I am converting the data frame to the Spatial class within the writeORG() function. If you do not want to save these files to your machine, do not execute these lines of code.

Data can also be exported using the st_write() function from sf.

For point data, you can also save the result as a CSV file with the coordinates written to the table as demonstrated below.

Again, I prefer to use sf to read and write vector data files in R. However, you may find sp and/or rgdal to be necessary for some tasks, so it is good to be familiar with all three packages and the associated data models.

Raster Data

As mentioned above, raster data have a simpler data structure than vector data, as these data are represented simply as an array of values. A single-band raster is similar to a matrix in R while a multi-band raster is similar to a 3-dimensional array. The primary difference between spatial raster data and other gridded or array data sets, is that spatial reference information is associated with the data so that it will draw at the correct location within the map space and each cell or pixel will correspond to a certain land area. For example, a cell may cover 30 by 30 meters or land area.

Unlike vector data, there is one primary package used to work with raster grids in R: the raster package. Note that the raster package also makes use of rgdal.

Raster Data Stucture

To gain a better understanding of raster data, let’s build a raster grid from scratch. Using the raster() function from the raster package, I first generate an empty raster by defining the number of rows, number of columns, and the minimum and maximum coordinates in the x and y directions. I am using large numbers here because I will eventually define a UTM projection for this raster, so, easting will be reported in meters relative to the central meridan in the zone and northing will be the distance from the equator in meters. Since the grid has 10 rows and 10 columns, there will be 100 total cells. I then create 100 random values between 0 and 255 to mimic 8-bit data using the sample() function. I set replacement equal to TRUE so that a value can be selected more than once. I then fill the empty raster with the random values using the values() function also from the raster package. The next step is to define the projection using the projection() function, again from the raster package. Here, I am using the EPSG code for NAD83 UTM Zone 17N. Lastly, I plot the raster using tmap.

By simply typing the name of the raster into the console, information about the data is returned. It is always good to inspect the data to make sure there are no issues. The class of this object is a RasterLayer, which is defined by the raster package. It has 10 rows, 10 columns, and 100 total cells. The resolution is 10 meters by 10 meters. This is because we defined the extent as 100 meters of easting and northing, so each cell would need to measure 10 meters to fill the spatial extent. The EPSG code and PROJ.4 string are provided to define the projection. The data are currently stored in memory, and the values in the cells range from 3 to 244. Note that you will probably have different values since the data were generated randomly.

Read in Single-Band Raster Data

Similar to vector data, you probably won’t created raster grids from scratch. Instead, you will read in files. We will begin with a discussion of how to read in single-band raster data.

This can be accomplished using the raster() function and calling the file name. Note that you will have to provide the entire path if the raster grid is not in your working directory. I am reading in two raster grids in the code block below. I then call the objects to inspect them. Both are projected into NAD83 UTM Zone 17N, have a cell size of 30 by 30 meters, 926 rows, 995 columns, and 921,370 total cells.

The dem layer is a digital elevation model or DEM while the lc layer is a land cover grid representing a portion of the National Land Cover Database (NLCD) 2011. The DEM is an example of a continous raster grid where each cell holds a numeric measurement, in this case elevation in meters. The land cover data is an example of a categorical raster grid where each cells holds a number that is a stand in for a category, in this case a land cover class.

I am now using tmap to visualize the data in map space. Note that we will talk about refining these maps in the next section. Here, I am just inspecting the data.

Read in Multi-Band Raster Data

To call in multi-band raster data, such as multispectral imagery, you cannot use the raster() function, as this is intended for single-band data only. Instead, you must use either brick() or stack(). A RasterBrick can only reference a single file while a RasterStack can reference multiple files or a subset of bands from one file. You can use either method to call in a multi-band file, as demonstrated in the following two examples where a Sentinel-2 multispectral image of Vancouver, British Columbia is read into R.

Note that I am using the plotRGB() function from the raster package to visualize the multi-band data as opposed to using tmap. This function allows you to specify which bands to display with each color channel and allows for contrast stretching. In the examples, the red channel is showing NIR, green is showing red, and blue is showing green. So, this is a standard false color composite where vegetated areas appear red (since they reflect a larger percentage of incoming NIR radiation).

RasterBrick and RasterStack yield the same results in this case. However, if you want to reference multiple files, you cannot use a RasterBrick.

It is generally recommended to use RasterBricks when referencing a single file, as this in more computationally efficient and can speed up processing. I have found that RasterStacks may fail if all the referenced files do not have the exact same extent, number of rows, number of columns, cell size, and spatial reference system. So, I generally just stack or composite raster grids outside of R before reading them in. This is just a personal preference.

This is an example of displaying the image using histogram equalization as opposed to a linear stretch.

Here is an example of a different composite in which the red channel is showing SWIR, the green channel is showing NIR, and the blue channel is showing green.

As a side note, the raster(), brick(), and stack() functions are capable of reading in any raster layer even if it does not represent geospatial data. Here is an example in which I have read in a picture of my cats: Freddy and Peri.

Working with Raster Coordinate Systems

Working with raster coordinate systems is pretty similar to working with those for vector data. You can use the crs() function to obtain the projection information for a RasterLayer, RasterBrick, or RasterStack. In the example, you can see that the Sentinel-2 image is reference to WGS84 UTM Zone 10N.

Similar to the example for vector data above, in this code block I am removing the spatial reference information for the Sentinel-2 data by setting it to NA. I then reapply the projection using the appropriate PROJ.4 string. By calling crs() on this object, the projection information in returned.

Lastly, the projectRaster() function can be used to reproject the data into a different geographic or projected coordinate system. In the code block below I am transforming the Sentinel-2 image to the WGS84 geographic coordinate system using the corresponding EPSG code. Note that this can be slow for large data sets.

Save to Disk

Raster data can be written to disk as a permanent file using the the writeRaster() function. I generally prefer to use TIFF or IMG files when reading data in or writing data out from R. However, a variety of formats are available and can be specified using the format argument.

Now that you know how to read in geospatial data in R and the characteristics of the data models used to represent vector and raster data, you can start using these data in your analyses. In the next section, we will explore visualizing spatial data using and tmap. In later sections we will explore vector- and raster-based spatial analysis.

Back to Course Page

Back to WV View

Download Data