Geospatial Machine Learning

There are several different platforms for implementing and experimenting with machine learning algorithms. For example, the R caret and tidymodels packages/environments allow for common syntax to be applied to execute a wide variety of algorithms. If you are interested in machine learning using R, please take a look at our Open-Source Spatial Analytics (R) course.

In this course, we will explore scikit-learn (also referred to as sklearn). This is a toolkit that expands the functionality of SciPy, a library for scientific computation with Python used extensively within mathematics, engineering, and the physical sciences in general. A large number of tool kits are available for expanding SciPy. A full list can be found here.

Scikit-learn specifically allows for:

  • preprocessing data for use in predictive modeling
  • training a variety of machine learning algorithms
  • optimizing hyperparameters
  • predicting to new data and assessing model performance
  • visualizing models and results

Why do we care about machine learning as geospatial professionals and scientists? The primary reasons are that machine learning algorithms have been shown to be powerful tools for discovering patterns in data, mapping features, and modeling or predicting processes. They allow for a variety of prediction types including:

  • Categorical: predict discrete categories, such as types of land cover or types of forests
  • Binary/Probabilistic: differentiate two categories and/or predict the probability of a sample belonging to a certain class; for example, what is the likelihood that a location might experience a landslide in the near future?
  • Continuous: predict continuous measures, such as depth to the water table or percent clay content of soil; these are regression-type problems.

In this section, I will provide examples relating to a binary classification problem. Specifically, we will differentiate wetlands from uplands to produce an estimate of the probability of wetland occurrence at locations on the landscape based on local topographic variables.

After working through this module you will be able to:

  1. prepare data for input into machine learning models.
  2. optimize, train, and assess machine learning models.
  3. compare feature spaces and different algorithms in regards to model performance.
  4. apply results to raster-based spatial data to make spatial predictions.

We will not cover deep learning techniques in this course or the PyTorch library. We are working toward producing a second course focused on that topic.

We have provided some lecture material along with this course that was created for our Open-Source Spatial Analytics (R) course that provides a conceptual background of machine learning applied to geospatial data. Although the code examples are based in R, this module is useful as a general background on machine learning prior to working through this practical example.

Optimize, Train, and Predict using Random Forest

As usual, we will begin by reading in the required libraries.

  1. Pandas: read in our training and validation data tables and store them as DataFrames
  2. NumPy: work with data arrays
  3. Matplotlib/Seaborn: visualize data
  4. Rasterio: read in raster data
  5. Pyspatialml: apply machine learning models to predict to raster grids
  6. Scikit-learn: implement machine learning in Python including data preprocessing, hyperparameter optimization, training, and assessment

In order to simplify the code, I am importing specific tools from scikit-learn. Here is a brief description of the specific tools.

  1. RandomForestClassifier(): implementation of random forest (RF) algorithm for classification
  2. KNeighborsClassifier(): implementation of k-Nearest Neighbor (k-NN) algorithm for classification
  3. SVC(): implementation of support vector machines (SVM) for classification
  4. confusion_matrix(): return confusion matrix for model assessment
  5. classification_report(): return several classification accuracy assessment metrics
  6. roc_curve(): calculate receiver operating characteristics curve (ROC) for probabilistic assessment
  7. roc_auc_score(): calculate area under the ROC curve (AUC)
  8. StandardScaler: standardize predictor variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import rasterio as rio
import pyspatialml as pml
%matplotlib inline

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler

Get the Data

I begin by reading in the training and testing data, which will be used to train the models and assess the performance, respectively. Sometimes the test set is referred to as the validation data. This is accomplished with Pandas, since the data are stored in CSV files. There are tools available in Python to extract training data using vector and raster data. However, I have found that this is generally easier in a GIS software tool, such as QGIS or ArcGIS Pro. So, I would recommend doing your data preparation in a GIS environment. For example, QGIS and ArcGIS Pro have tools available to extract raster values at point locations.

The "class" field represents the category being predicted:

  • "wet": example of a wetland
  • "not": example of not a wetland

All of the predictor variables represent different measures that quantify the conditions of the local terrain surface.

  1. "cost": distance from water bodies weighted by topographic slope
  2. "crv_arc": topographic curvature
  3. "crv_plane": plan curvature
  4. "crv_pro": profile curvature
  5. "ctmi": compound topographic moisture index (CTMI)
  6. "diss_5": terrain dissection (5 m radius)
  7. "diss_10": terrain dissection (10 m radius)
  8. "diss_20": terrain dissection (20 m radius)
  9. "diss_25": terrain dissection (25 m radius)
  10. "diss_a": terrain dissection average
  11. "rough_5": terrain roughness (5 m radius)
  12. "rough_10": terrain roughness (10 m radius)
  13. "rough_20": terrain roughness (20 m radius)
  14. "rough_25": terrain roughness (25 m radius)
  15. "rough_a": terrain roughness average
  16. "slp_d": topographic slope in degrees
  17. "sp_5": slope position (5 m radius)
  18. "sp_10": slope position (10 m radius)
  19. "sp_20": slope position (20 m radius)
  20. "sp_25" slope position (25 m radius)
  21. "sp_a": slope position average

It is common for examples to be provided as a single table. This will require the user to randomly split the data into training and testing sets. This can be accomplished using model_selection.train_test_split(). This function also allows for stratified random sampling based on a categorical or grouping variable.

train = pd.read_csv('D:data/ml/training.csv')
test = pd.read_csv('D:/data/ml/validation.csv')
print(train.head())
  class        cost     ctmi   crv_arc  crv_plane   crv_pro   diss_10  \
0   not  404.444000  9.72181 -0.376308  -0.270172  0.106136  0.596801   
1   not   19.365499  7.81020 -0.501769  -0.095858  0.405911  0.127337   
2   not  907.374023  5.83340  1.889750   0.630867 -1.258890  0.882726   
3   not  407.984985  6.05410  0.668900   0.152399 -0.516501  0.490594   
4   not  360.717987  6.85992 -0.878002  -0.237463  0.640539  0.421594

    diss_20   diss_25    diss_5  ...  rough_25   rough_5  rough_a    slp_d  \
0  0.423932  0.390846  0.559673  ...   3.54012  1.430060  2.59963  4.11556   
1  0.070422  0.067852  0.150943  ...   3.12002  0.944582  2.02356  2.78634   
2  0.776313  0.746689  0.974049  ...   2.67275  1.538140  2.21755  5.02015   
3  0.504769  0.486941  0.592066  ...   3.29430  1.741610  2.53107  8.01975   
4  0.338215  0.330672  0.464568  ...   3.46882  1.948190  2.76768  8.94219

      sp_10     sp_20     sp_25      sp_5      sp_a       dem_m  
0  0.823059  0.356201  -0.31958 -0.115417  0.186066  544.406982  
1 -2.120790 -6.384640  -9.49066 -0.599976 -4.649020  579.661987  
2  4.615170  6.065980   5.46094  2.332760  4.618710  676.323975  
3  0.584595 -2.148440  -3.32928  1.069950 -0.955795  620.539002  
4 -3.208800 -9.338870 -10.89050 -0.280945 -5.929780  585.283997

[5 rows x 23 columns]
print(test.head())
  class      cost      ctmi   crv_arc  crv_plane   crv_pro   diss_10  \
0   wet  0.000000  19.30950 -0.041820     0.0000  0.041820  0.070967   
1   wet  0.000000   9.93015 -0.961718     0.0000  0.961718  0.091834   
2   wet  0.656767  11.06300 -0.041820     0.0000  0.041820  0.580664   
3   wet  0.171550  12.10270  0.000000     0.0000  0.000000  0.288487   
4   wet  0.000000   9.09360 -0.209177    -0.0295  0.179677  0.159990

    diss_20   diss_25    diss_5  ...  rough_25   rough_5   rough_a     slp_d  \
0  0.049314  0.042373  0.069436  ...   2.75714  1.060430  1.932930  0.114367   
1  0.054215  0.044775  0.072723  ...   2.20201  1.240590  1.796470  0.836766   
2  0.128384  0.117649  0.500000  ...   1.22992  0.318957  0.740689  0.053913   
3  0.084509  0.047502  0.000000  ...   1.48042  0.397072  0.863066  0.038122   
4  0.049411  0.048077  0.199988  ...   1.99719  0.556684  1.194620  1.158880

      sp_10     sp_20    sp_25      sp_5      sp_a       dem_m  
0 -2.442140 -6.259220 -8.46277 -0.819946 -4.496020  526.085022  
1 -2.483700 -4.499270 -5.17249 -1.204160 -3.339900  627.888000  
2  0.079712 -0.511047 -1.09052 -0.005432 -0.381821  628.497986  
3 -0.202637 -0.728271 -1.38171 -0.112488 -0.606277  628.497986  
4 -0.319031 -1.730900 -2.81525 -0.224976 -1.272540  628.903992

[5 rows x 23 columns]

Next, I separate out the dependent variable ("class") from the predictor variables to create a "train_y" object. To separate out the predictor variables, I drop the "class" column. I also drop 'dem_m", which represents the elevation and will not be used as a predictor variable. This process is then repeated for the test data.

train_y = train['class']
train_x = train.drop(columns=['class', 'dem_m'])
test_y = test['class']
test_x = test.drop(columns=['class', 'dem_m'])
print(train_y.head())
print(train_x.head())
0    not
1    not
2    not
3    not
4    not
Name: class, dtype: object
         cost     ctmi   crv_arc  crv_plane   crv_pro   diss_10   diss_20  \
0  404.444000  9.72181 -0.376308  -0.270172  0.106136  0.596801  0.423932   
1   19.365499  7.81020 -0.501769  -0.095858  0.405911  0.127337  0.070422   
2  907.374023  5.83340  1.889750   0.630867 -1.258890  0.882726  0.776313   
3  407.984985  6.05410  0.668900   0.152399 -0.516501  0.490594  0.504769   
4  360.717987  6.85992 -0.878002  -0.237463  0.640539  0.421594  0.338215

    diss_25    diss_5    diss_a  ...  rough_20  rough_25   rough_5  rough_a  \
0  0.390846  0.559673  0.492813  ...   3.21397   3.54012  1.430060  2.59963   
1  0.067852  0.150943  0.104138  ...   2.61464   3.12002  0.944582  2.02356   
2  0.746689  0.974049  0.844944  ...   2.59621   2.67275  1.538140  2.21755   
3  0.486941  0.592066  0.518593  ...   2.88848   3.29430  1.741610  2.53107   
4  0.330672  0.464568  0.388762  ...   3.20197   3.46882  1.948190  2.76768

     slp_d     sp_10     sp_20     sp_25      sp_5      sp_a  
0  4.11556  0.823059  0.356201  -0.31958 -0.115417  0.186066  
1  2.78634 -2.120790 -6.384640  -9.49066 -0.599976 -4.649020  
2  5.02015  4.615170  6.065980   5.46094  2.332760  4.618710  
3  8.01975  0.584595 -2.148440  -3.32928  1.069950 -0.955795  
4  8.94219 -3.208800 -9.338870 -10.89050 -0.280945 -5.929780

[5 rows x 21 columns]

Machine learning algorithms commonly have user-defined parameters that are not learned by the algorithm as part of the training process and must be set by the user. These are known as hyperparameters. Default options are provided, but optimizing these settings may improve model performance.

How do you know what values to use? In order to select values or settings for these hyperparameters, it is necessary to test a variety of combinations. In this example, I am using a grid search and k-fold cross validation to assess the model performance using all possible combinations of a set of provided values for different hyperparameters.

This requires me to:

  1. Create a dictionary with the parameter names and the values to test, provided as a list within the dictionary.
  2. Initiate a model, RF in this case.
  3. Define the grid search parameters; here I am specifically using 5-fold cross validation.

If you decide to execute this code, note that it may take some time as a large number of models have to be generated and assessed.

param_grid = {
    'bootstrap': [True],
    'max_depth': [80, 90, 100, 110],
    'max_features': [2, 3],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [8, 10, 12],
    'n_estimators': [100, 200, 300, 1000]
}
rf = RandomForestClassifier()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 2)

Now, I perform the grid search. A total of 1,440 models are fitted and evaluated using cross validation.

# Fit the grid search to the data
grid_search.fit(train_x, train_y)
Fitting 5 folds for each of 288 candidates, totalling 1440 fits





GridSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'bootstrap': [True], 'max_depth': [80, 90, 100, 110],
                         'max_features': [2, 3], 'min_samples_leaf': [3, 4, 5],
                         'min_samples_split': [8, 10, 12],
                         'n_estimators': [100, 200, 300, 1000]},
             verbose=2)

Once the grid search is complete, I obtain the list of selected, optimal parameters from the parameters that were tested. I then make a copy of the model that used the selected parameters and apply it to predict the test data. When using the model to make a prediction, you only have to provide the predictor variables, "test_x" in this case.

grid_search.best_params_
{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 3,
 'min_samples_split': 8,
 'n_estimators': 200}
rf_mod = grid_search.best_estimator_
print(rf_mod)
rfc_pred = rf_mod.predict(test_x)
RandomForestClassifier(max_depth=80, max_features=3, min_samples_leaf=3,
                       min_samples_split=8, n_estimators=200)

Once predications are made to the new, withheld test data, assessment metrics can be generated by comparing the predictions to the actual or correct classifications.

First, I generate a confusion matrix. Of the 2,000 points, only 77 were misclassified, suggesting strong model performance. Next, I generate a variety of assessment metrics using the classification_report() function. Calculated metrics include precision, recall, and the F1 score.

If you run this code, you may get a slightly different answer. This is because there is some randomness in the model. To ensure reproducibility, you can set a random seed.

print(confusion_matrix(test_y,rfc_pred))
[[943  57]
 [ 17 983]]
print(classification_report(test_y, rfc_pred))
              precision    recall  f1-score   support

         not       0.98      0.94      0.96      1000
         wet       0.95      0.98      0.96      1000

    accuracy                           0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000

In order to assess the model using ROC and AUC, I will need to obtain the predicted class probabilities as opposed to the "hard" or binary classification. This can be accomplished using predict_proba(). This will result in a prediction for both classes ("not" and "wet") that sum to 1. Larger probabilities for the "wet" class indicate that the algorithm predicted that that site was likely to be a wetland based on its local topographic characteristics.

rfc_prob = rf_mod.predict_proba(test_x)

An ROC curve is easiest to calculate when the positive class is coded as 1 and the absence or negative class is coded as 0. So, I create a new version of "test_y" where "wet" has been coded to 1 and "not" has been coded to 0. This is accomplished using list comprehension.

In the last line of code, I calculate the AUC score using the probabilities from the second column (index 1). This is because we are interested in the probabilities for the positive class as opposed to the negative or absence class.

A large AUC of 0.99 is obtained, suggesting strong performance.

test_y2 = test_y.to_numpy()
my_dict = {"not":0, "wet":1}
test_y3 = np.asarray([my_dict[zi] for zi in test_y2])
roc_auc_score(test_y3, rfc_prob[:,1])
0.9911509999999999

Next, I calculate the ROC curve and plot the result using matplotlib. The roc_curve() function produces multiple outputs, which can be accessed separately using tuple unpacking.

fpr, tpr, _ = roc_curve(test_y3,  rfc_prob[:, 1])
plt.plot(fpr,tpr, color ='red', lw=2, label="All Variables")
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-.05, 1.0])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.show()

png

Compare Feature Spaces

Now that we have worked through the entire process, I will provide some examples of comparing different models and different algorithms. In this example specifically, we will assess three different models that use different sets of local terrain features:

  1. All terrain variables
  2. Just topographic slope
  3. Topographic slope + Slope position

First, I create the data subsets. When creating the data for just slope, I have to explicitly define the output as a DataFrame. Since only one column is present, the default is to output a Series.

train_y = train['class']
train_x = train.drop(columns=['class', 'dem_m'])
test_y = test['class']
test_x = test.drop(columns=['class', 'dem_m'])

train_x_slp = pd.DataFrame(train['slp_d'])
test_x_slp = pd.DataFrame(test['slp_d'])

train_x_slp_sp = train[['slp_d', 'sp_a']]
test_x_slp_sp = test[['slp_d', 'sp_a']]

Next, I fit the three different models. I am not optimizing the hyperparameters for the sake of time. However, for a more robust experiment, each model should be optimized separately.

rf1 = RandomForestClassifier(n_estimators=200)
rf_model1 = rf1.fit(train_x, train_y)
rf2 = RandomForestClassifier(n_estimators=200)
rf_model2 = rf2.fit(train_x_slp, train_y)
rf3 = RandomForestClassifier(n_estimators=200)
rf_model3 = rf3.fit(train_x_slp_sp, train_y)

Once the models are fitted, I then use them to predict the test data and assess using the classification metrics. The model with all the terrain parameters performs best, followed by slope + slope position. The slope model was the poorest; however, precision, recall, and the F1 score are all still above 0.85, suggesting that topographic slope is an important predictor of wetland occurrence. However, the inclusion of additional predictor variables does improve performance; even if only slope position is added, the overall accuracy increases by 6%.

pred_model1= rf_model1.predict(test_x)
pred_model2 = rf_model2.predict(test_x_slp)
pred_model3 = rf_model3.predict(test_x_slp_sp)
print(classification_report(test_y, pred_model1))
print(classification_report(test_y, pred_model2))
print(classification_report(test_y, pred_model3))
              precision    recall  f1-score   support

         not       0.98      0.95      0.96      1000
         wet       0.95      0.98      0.96      1000

    accuracy                           0.96      2000
   macro avg       0.96      0.96      0.96      2000
weighted avg       0.96      0.96      0.96      2000

              precision    recall  f1-score   support

         not       0.87      0.86      0.86      1000
         wet       0.86      0.87      0.86      1000

    accuracy                           0.86      2000
   macro avg       0.86      0.86      0.86      2000
weighted avg       0.86      0.86      0.86      2000

              precision    recall  f1-score   support

         not       0.92      0.93      0.92      1000
         wet       0.93      0.92      0.92      1000

    accuracy                           0.92      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.92      0.92      0.92      2000

In order to further explore and compare the results, I calculate the AUC measures and the ROC curves. Using matplotlib, I plot all the curves in the same graph space for comparison.

Again, these metrics suggest that using all variables provides the strongest performance followed by slope + slope position.

test_y2 = test_y.to_numpy()
my_dict = {"not":0, "wet":1}
test_y3 = np.asarray([my_dict[zi] for zi in test_y2])

rfc_prob1 = rf_model1.predict_proba(test_x)
rfc_prob2 = rf_model2.predict_proba(test_x_slp)
rfc_prob3 = rf_model3.predict_proba(test_x_slp_sp)

print("All Variables: " + str(roc_auc_score(test_y3, rfc_prob1[:,1])))
print("Just Slope: " + str(roc_auc_score(test_y3, rfc_prob2[:,1])))
print("Slope + Slope Position: " + str(roc_auc_score(test_y3, rfc_prob3[:,1])))                       
All Variables: 0.9905974999999999
Just Slope: 0.9295705
Slope + Slope Position: 0.9676729999999999
print(rfc_prob1)
[[0.005 0.995]
 [0.02  0.98 ]
 [0.005 0.995]
 ...
 [0.995 0.005]
 [0.95  0.05 ]
 [1.    0.   ]]
fpr1, tpr1, _ = roc_curve(test_y3,  rfc_prob1[:, 1])
fpr2, tpr2, _ = roc_curve(test_y3,  rfc_prob2[:, 1])
fpr3, tpr3, _ = roc_curve(test_y3,  rfc_prob3[:, 1])
plt.plot(fpr1,tpr1, color ='red', lw=2, label="All Variables")
plt.plot(fpr2,tpr2, color ='green', lw=2, label="Slope")
plt.plot(fpr3,tpr3, color ='blue', lw=2, label="Slope + Slope Position")
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-.05, 1.0])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Compare Feature Space')
plt.legend(loc="lower right")
plt.show()

png

Compare Algorithms

We will now compare three different machine learning algorithms: RF, k-NN, and SVM. Specifically, we will only use the topographic slope and slope position predictor variables.

In this first set of code I am preparing the training and testng data. RF is not generally sensitive or impacted by the predictor variable scale, which is also true for other tree-based models, such as single decision trees and boosted decision trees. However, SVM, a kernel-based method, and k-NN are impacted. So, it is necessary to rescale the data prior to using them in the prediction.

This is accomplished using StandardScaler(). This is fit to the training data, and the results are used to rescale both the testing and training data. This is so that the model is not biased by seeing a representation of the testing data prior to training and so that the data sets share a common transformation.

All subsequent modeling is conducted on the rescaled data.

train_x_sub = train[['slp_d', 'sp_a']]
test_x_sub = test[['slp_d', 'sp_a']]

scaler = StandardScaler()
scale_fit = scaler.fit(train_x_sub)
train_x2 = pd.DataFrame(scale_fit.transform(train_x_sub))
train_x2.columns = ['slp_d', 'sp_a']
test_x2 = pd.DataFrame(scale_fit.transform(test_x_sub))
test_x2.columns = ['slp_d', 'sp_a']
print(train_x_sub.head())
print(train_x2.head())
     slp_d      sp_a
0  4.11556  0.186066
1  2.78634 -4.649020
2  5.02015  4.618710
3  8.01975 -0.955795
4  8.94219 -5.929780
      slp_d      sp_a
0 -0.520301  0.237190
1 -0.667147 -0.580154
2 -0.420365  0.986504
3 -0.088983  0.044165
4  0.012924 -0.796660

Once the data are prepared, I next initiate the three models then train them using the rescaled training data. Again, the models should be optimized for a more rigorous experiment. However, I am not doing so here to save time.

rf_comp =  RandomForestClassifier(n_estimators=200)
knn_comp = KNeighborsClassifier(n_neighbors=7)
svm_comp =  SVC()

rf_compM = rf_comp.fit(train_x2, train_y)
knn_compM = knn_comp.fit(train_x2, train_y)
svm_compM = rf_comp.fit(train_x2, train_y)

I then predict to the rescaled test data using all three models and compare them using classification metrics (accuracy, precision, recall, and the F1 score) and ROC curves.

The results suggest that all three models have similar performance for this specific problem.

pred_rf_comp= rf_compM.predict(test_x2)
pred_knn_comp = knn_compM.predict(test_x2)
pred_svm_comp = svm_compM.predict(test_x2)
print(classification_report(test_y, pred_rf_comp))
print(classification_report(test_y, pred_knn_comp))
print(classification_report(test_y, pred_svm_comp))
              precision    recall  f1-score   support

         not       0.92      0.92      0.92      1000
         wet       0.92      0.92      0.92      1000

    accuracy                           0.92      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.92      0.92      0.92      2000

              precision    recall  f1-score   support

         not       0.93      0.91      0.92      1000
         wet       0.91      0.93      0.92      1000

    accuracy                           0.92      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.92      0.92      0.92      2000

              precision    recall  f1-score   support

         not       0.92      0.92      0.92      1000
         wet       0.92      0.92      0.92      1000

    accuracy                           0.92      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.92      0.92      0.92      2000
test_y2 = test_y.to_numpy()
my_dict = {"not":0, "wet":1}
test_y3 = np.asarray([my_dict[zi] for zi in test_y2])

rf_comp_prob = rf_compM.predict_proba(test_x2)
knn_comp_prob = knn_compM.predict_proba(test_x2)
svm_comp_prob = svm_compM.predict_proba(test_x2)

print("RF: " + str(roc_auc_score(test_y3, rf_comp_prob[:,1])))
print("K-NN: " + str(roc_auc_score(test_y3, knn_comp_prob[:,1])))
print("SVM: " + str(roc_auc_score(test_y3, svm_comp_prob[:,1])))   
RF: 0.966986
K-NN: 0.9650399999999999
SVM: 0.966986
fpr1, tpr1, _ = roc_curve(test_y3,  rf_comp_prob[:, 1])
fpr2, tpr2, _ = roc_curve(test_y3,  knn_comp_prob[:, 1])
fpr3, tpr3, _ = roc_curve(test_y3,  svm_comp_prob[:, 1])
plt.plot(fpr1,tpr1, color ='red', lw=2, label="RF")
plt.plot(fpr2,tpr2, color ='green', lw=2, label="K-NN")
plt.plot(fpr3,tpr3, color ='blue', lw=2, label="SVM")
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-.05, 1.0])
plt.ylim([-.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Compare Algorithms')
plt.legend(loc="lower right")
plt.show()

png

Predict to Raster Data

Since this class focuses on geospatial data analysis, it is important to discuss how trained models can be applied to geospatial data. Again, I recommend that you do your data preparation in GIS software, such as making your training and testing data tables. However, you will need to apply the trained models to your spatial data in Python.

There are several ways to accomplish this. Here, I am relying on Pyspatialml. Currently, this is an experimental package that is only available on GitHub.

Using the Raster() function from this package, I call in a grid stack of the predictor variables. I have generally found that it is easier to feed in the predictor variables as a single grid stack, which is prepared using GIS software, to avoid any issues with cell misalignment or different numbers of rows and columns. Python is generally not robust to dealing with these issues.

I then use the name property to list the band names. Since the code needs to know which band corresponds to which predictor variable from the training table, I have to rename the bands to match the table. This is accomplished using the Pyspatialml method rename(), which accepts a dictionary of current and new name pairs. This will require you to know which band correspond to which predictor variable. So, keep this in mind when you are generating grid stacks.

r_preds = pml.Raster("D:/data/ml/predictors.img")
print(r_preds.names)
r_preds.rename({'predictors_1':"cost", 
               'predictors_2':"crv_arc", 
               'predictors_3':"crv_plane", 
               'predictors_4': "crv_pro", 
               'predictors_5':"ctmi", 
               'predictors_6':"diss_5", 
               'predictors_7':"diss_10", 
               'predictors_8':"diss_20", 
               'predictors_9':"diss_25", 
               'predictors_10':"diss_a", 
               'predictors_11':"rough_5", 
               'predictors_12':"rough_10", 
               'predictors_13':"rough_20", 
               'predictors_14':"rough_25", 
               'predictors_15':"rough_a", 
               'predictors_16':"slp_d", 
               'predictors_17':"sp_5", 
               'predictors_18':"sp_10", 
               'predictors_19':"sp_20", 
               'predictors_20':"sp_25", 
               'predictors_21':"sp_a"})
print(r_preds.names)
dict_keys(['predictors_1', 'predictors_2', 'predictors_3', 'predictors_4', 'predictors_5', 'predictors_6', 'predictors_7', 'predictors_8', 'predictors_9', 'predictors_10', 'predictors_11', 'predictors_12', 'predictors_13', 'predictors_14', 'predictors_15', 'predictors_16', 'predictors_17', 'predictors_18', 'predictors_19', 'predictors_20', 'predictors_21'])
dict_keys(['predictors_1', 'predictors_2', 'predictors_3', 'predictors_4', 'predictors_5', 'predictors_6', 'predictors_7', 'predictors_8', 'predictors_9', 'predictors_10', 'predictors_11', 'predictors_12', 'predictors_13', 'predictors_14', 'predictors_15', 'predictors_16', 'predictors_17', 'predictors_18', 'predictors_19', 'predictors_20', 'predictors_21'])

Once the raster stack is prepared, I predict the probability at each grid cell using an RF model trained using all the predictor variables. Since I want a probabilistic output, I use predict_prob() as opposed to predict().

Once the result is obtained, I write it to disk.

result = r_preds.predict_proba(estimator=rf_mod)
result.write("D:/data/out/wet_pred.tif")
Raster Object Containing 2 Layers
    attribute                                             values
0       names                                   [prob_0, prob_1]
1       files  [D:/data/out/wet_pred.tif, D:/data/out/wet_pre...
2        rows                                               1881
3        cols                                               1787
4         res                                         (9.0, 9.0)
5  nodatavals  [-3.4028234663852886e+38, -3.4028234663852886e...

Lastly, I read the file back in with Rasterio and plot the result using Matplotlib.

This prediction is specifically for palustrine forested (PFO) wetlands, so it would be good to mask the results so that predictions are only made in forested extents. This can be accomplished using GIS software and overlay with a forest extent data set.

m_result = rio.open("D:/data/out/wet_pred.tif")
m_result_arr = m_result.read(2)
plt.rcParams['figure.figsize'] = [10, 8]
plt.imshow(m_result_arr, cmap="Blues", vmin=0, vmax=1)
<matplotlib.image.AxesImage at 0x1cd9cb50910>

png

Concluding Remarks

My goal here was to provide a practical introduction to using scikit-learn for machine learning-based predictive modeling. You should now have a general understanding of how to prepare data, optimize algorithms, train models, and assess model performance. If you need to predict back to raster grids, I would suggest Pyspatialml, which is available on GitHub.