Package | Metapackage | Description |
---|---|---|
Data prep and wrangling | ||
readr (site) | tidyverse | Read tabulated data |
tibble (site) | tidyverse | Define tibble data type |
dplyr (site) | tidyverse | General data wrangling, filtering, query, and manipulation |
stringr (site) | tidyverse | Work with string/character data |
forcats (site) | tidyverse | Work with factor data |
ggplot2 (site) | tidyverse | Data visualization and graphing |
gt (site) | - | Create data tables |
Accuracy assessment | ||
yardstick (site) | tidymodels | Accuracy assessment |
micer (site) | - | Map image classification efficacy (MICE) and related metrics |
pROC (site) | - | Receiver operating characteristic (ROC) curves |
Machine learning | ||
parsnips (site) | tidymodels | Implement a variety of ML algorithms using consistent syntax |
recipes (site) | tidymodels | Preprocessing and feature engineering |
rsample (site) | tidymodels | Data splitting and resampling |
tune (site) | tidymodels | Model optimization and hyperparameter tuning |
dials (site) | tidymodels | Define and manage hyperparameter settings to test |
workflows (site) | tidymodels | Define and implement a unified workflow consisting of pre-processing, tuning, training, and post-processing components |
workflowsets (site) | tidymodels | Combine multiple workflows |
randomForest (site) | - | Random forest algorithm implementation |
ranger (site) | - | Random forest algorihtm implementation |
kernlab (site) | - | Support vector mahcine algorithm implementation |
permimp (site) | - | Variable importance estimation with random forest |
Boruta (site) | - | Variable importance estimation with random forest |
fastshap (site) | - | SHAP values |
shapviz (site) | - | SHAP value visualization |
Deep learning | ||
torch (site) | - | Deep learning R/C++ implementation |
luz (site) | - | Simplifies torch training loop |
torchvision (site) | - | Adds more computer vision functionality to torch |
geodl (site) | - | Geospatial semantic segmentation using torch, luz, and terra |
Geospatial | ||
terra (site) | - | Raster geospatial data reading and processing |
sf (site) | - | Vector geospatial data reading and processing |
tmap (site) | - | Thematic map creation |
leaflet (site) | - | Interactive web map creation |
spatialEco (site) | - | Additional functionality for processing geospatial data with a focus on spatial ecology applications |
RSToobox (site) | - | Additonal tools for working with remotely sensed data |
MultiscaleDTM (site) | - | Configure moving windows/kernels for digital terrain analysis |
Geospatial Supervised Learing using R
Preface
This web text can be viewed in a light mode or a dark mode. A toggle is provided in the sidebar.
Goals of Text
Supervised learning, or learning from labeled data and predictor variables, has a variety of applications in remote sensing and geospatial science including image classification, land cover mapping, change detection, object detection, digital soil mapping, geohazard mapping and assessment, and habitat suitability modeling. The primary goal of this text is to provide practical examples and guidance on how to implement geospatial supervised learning in the R language and computational environment for the generation of spatial models and map products.
This text serves as a companion text to Supervised Learning in Remote Sensing and Geospatial Science: Theory and Practice by Maxwell, Ramezan, and He, which is published by Elsevier. During the production of that text, we purposely refrained from including specific code examples. Instead, author Maxwell, with assistance from his graduate students (Sarah Farhadpour and Behnam Solouki) generated this online text to accompany the book with a goal of making updates on a regular basis, which is much more difficult to undertake for a published and printed text. Supervised Learning in Remote Sensing and Geospatial Science: Theory and Practice provides a detailed coverage of the theory underpinning the topics presented in this online text, which instead focuses on implementing the techniques in the R language.
Audience
This text was written for several key audiences.
- Geospatial professionals interested in developing or further developing their supervised learning skills and learning how to implement supervised learning workflows
- Researchers needing to apply geospatial supervised learning to problems that interest them
- General data scientists and analysts needing to learn more about supervised learning and working with geospatial data
The material covered in this text may be appropriate for an advanced undergraduate course or a graduate-level course. Since the focus is code implementation and practical considerations, a more theoretical companion text may be necessary.
The first three sections of the text, as described below, assume the user is familiar with the R language and the tidyverse. If you are new to R or need a refresher, the fourth, appendix section of the text introduces R and the tidyverse and assumes no prior knowledge of the R language specifically or coding in general. That section also introduces reading, visualizing, and working with geospatial data in R.
Structure of Text
This text consists of 25 chapters broken into four sections. Each chapter consists of examples with associated code and explanations. The covered topics are listed at the beginning of each chapter in the Topics Covered section. At the end of chapters, we provide review questions and exercise(s). All chapters have questions, but a few chapters do not have exercises. The data to complete the exercises have been provided or linked, which is described below.
The first section focuses on data preprocessing and feature engineering. Since input data for geospatial predictive modeling is primarily provided as raster grids, Chapter 1 discusses general raster data handling and processing using the terra package. Continuing the theme of Chapter 1, Chapter 2 explores processing multispectral imagery including creating normalized difference band ratios (e.g., the normalized difference vegetation index (NDVI)), performing a tasseled cap transformation, implementing principal component analysis (PCA) for feature reduction, and generating spatial enhancement using convolution filters or kernels. This chapter also discusses creating land surface parameters (LSPs) from digital terrain models (DTMs). Chapter 3 discusses data preprocessing more generally with a focus on data visualization, exploratory data analysis, and creating preprocessing pipelines using the recipes package. Chapter 4 discusses the processing and rasterization of light detection and ranging (lidar) point cloud data. Lastly, Chapter 5 discusses model accuracy assessment methods and metrics with a specific focus on the yardstick package. These methods and metrics are used throughout the following two sections of the text.
The second section of the text focuses on linear models and machine learning methods. Chapter 6 covers the process of training and evaluating linear models, including the interpretation of learned coefficients and ancillary output and testing model assumptions. Chapter 7 explores the random forest algorithm, as implemented with the randomForest package, for probabilistic spatial predictive modeling. It also discusses means to assess the importance of predictor variables using random forest and Shapley additive explanations (SHAP) values, a model agnostic method. Chapter 8 and Chapter 9 explain the process of implementing machine learning using the tidymodels metapackage, which includes several key packages: parsnips, recipes, rsample, tune, dials, yardstick, and workflows. By presenting and implementing a series of workflows, we explain how to train, tune, validate, compare, and use models generated with the tidymodels workflow.
The third section of the text focuses on deep learning using the torch package, which provides a means to implement deep learning in R without connecting to a Python/PyTorch environment. Throughout this section, the training of deep learning algorithms is implemented using luz, which greatly simplifies the error prone training process. Chapter 10 explains the tensor data model used to represent data for deep learning as multidimensional arrays that can be processed on both a CPU and GPU and for which the operations performed to create them can be tracked as a computational graph, which is required for performing backpropagation during the training process. In Chapter 11, you learn to build, train, and evaluate a fully connected artificial neural network (ANN) architecture for a scene classification task (i.e., labeling an entire image to a single class) while Chapter 12 explores the same task but using convolutional neural network (CNN) architectures. Chapter 13 explores a variety of methods to potentially improve deep learning models by augmenting the architecture and/or training process. The final three chapters of this section focus on geospatial semantic segmentation, or pixel-level labeling. In Chapter 14 you learn to build the UNet architectures from scratch. In Chapter 15 and Chapter 16, you learn to use the geodl package, which builds on terra, torch, and luz to implement geospatial semantic segmentation workflows.
The fourth and final section of the text serves as an appendix to introduce using R for general data science tasks and for working with geospatial data. If you are already familiar with R, this section may not be useful or necessary. However, for those new to R, still learning the language, or looking for a refresher, this section, or some chapters from it, may be of interest. The goal of this appendix section is to support users in learning R for general data science, GIS, and spatial mapping and modeling tasks. Chapter 17 covers the basics of the R language while Chapter 18 covers data manipulation, wrangling, and summarization using the tidyverse packages (e.g., readr, tibble, dplyr, stringr, and forcats). Chapter 19 further explores the R language and discusses functions, loops, and control flow. Chapter 20 and Chapter 21 explore the use of ggplot2 for building data visualizations and graphs. Chapter 22 introduces the gt package for building and refining tables. The final set of chapters focus on geospatial data specifically. Chapter 23 introduces mapping and spatial data visualization with tmap while Chapter 24 focuses on working with and analyzing vector geospatial data with the sf package. The final chapter, Chapter 25, introduces building interactive web maps using the leaflet package.
Data and Source Files
This books was written using Quarto. To learn more about Quarto, we recommend Quarto for Scientists by Nicholas Tierney, which was also written using Quarto and has been made available free-of-charge. All the Quarto (.qmd) files that constitute this book can be downloaded from the text’s landing page.
We have also provided the required source data and Quarto files to execute the provided code. These data can be downloaded using the download options in the sidebar. Separate downloads have been provided for each chapter. You will need to uncompress the data before use. If you open a Quarto file from the downloaded directory, the relative file paths used in the code should work so that the paths do not need to be updated. In some cases, the used files are large or already available on the web. In such cases, we note the source at the beginning of the chapter in which they are used. We have also provided trained models for use in the deep learning section if you would like to execute some of the example code but not the associated training loop, which can take hours in some cases.
All data are housed in the gslrData folder. Each chapter has an associated folder in this director (e.g., chpt1). Within each chapter directory, needed data to execute the chapter code are provided in the data folder while files for the associated exercise(s) are in the exercise folder. We also include initially empty output and scratch folders in each chapter folder for use when the code is executed.
We generally recommend using an integrated development environment (IDE). We suggest RStudio by Posit, which is freely available.
Installing R packages is generally easy. In code, this can be accomplished with installed.packages("INSERT PACKAGE NAME")
. In RStudio, packages can be installed using Tools –> Install Packages. Setting up torch for deep learning in R can be tricky, especially if you plan to use a GPU. Use of a GPU is required to implement the CNNs and semantic segmentation architectures presented in the text. This page describes how to set up torch in R with and without GPU support.
Thanks
This online text was partially funded by AmericaView. AmericaView is a nationwide, university-based, and state-implemented network that advances Earth observation education through remote sensing science, applied research, workforce development, technology transfer, and community outreach. AmericaView is supported by the United States Geological Survey (USGS) under Grant/Cooperative Agreement No. G23AP00683. The views and conclusions contained in this text are those of the authors and should not be interpreted as representing the opinions or policies of the U.S. Geological Survey. Mention of trade names or commercial products does not constitute their endorsement by the U.S. Geological Survey.
Partial funding was also provided by the National Science Foundation (NSF) (Federal Award ID No. 2046059: “CAREER: Mapping Anthropocene Geomorphology with Deep Learning, Big Data Spatial Analytics, and LiDAR”). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Error Statement
The authors have tried their best to provide accurate and complete information. Any errors are unintentional. If you would like to comment on any material in this text, please feel free to contact the lead author, Aaron Maxwell, at Aaron.Maxwell@mail.wvu.edu.
Nomenclature
We use the following type standards throughout the text:
- Bold text is used to highlight key or important terms at first use in each chapter (e.g., Landsat).
- Underlined text is used for R package or metapackage names (e.g., terra).
- Italic text is used for file names, folder paths/directories, and book or journal titles (e.g., rasterStack.tif).
- Table, data frame, or tibble column names are written in “quotes” and italicized (“Band1”).
- “Quotes” are placed around names of journal articles.
- In-text code snippets and variable names are written as fixed-width text (e.g.,
var1 = c(1,2,3,4)
). - Callout blocks are used to highlight key information, offers practical advice, or relate concepts across chapters.
This is an example callout block.
Concluding Remarks
We hope your are excited to start learning more about supervised learning as implemented in R. As noted above, we begin the discussion with processing raster geospatial data since this is how we commonly represent data for use in the modeling workflow.
Appendix: Package List
Below is a list of the key packages used in this text with a brief description of their use and links to their documentation.