3  General Data Preprocessing

3.1 Topics Covered

  1. Exploratory data analysis via summarization and graphing
  2. Feature engineering with the recipes package
  3. Principal component analysis (PCA) on tabulated data
  4. Exploration of PCA results

3.2 Introduction

The first two chapters focused on preprocessing geospatial raster data specifically. We now turn our attention to more general data preprocessing and feature engineering. In the first section of the chapter, we focus on general data summarization, visualization, and exploration. Next, we introduce the recipes package, which is part of the [tidymodels{.underline} metapackage, for general data preprocessing. Lastly, we more generally explore principal component analysis (PCA).

For general data wrangling, we load in the tidyverse and reshape2 packages. recipes is loaded for data preprocessing, corrplot and GGally for creating correlation visualizations, and gt for printing tables. factoextra is used for visualizing principal component analysis results.

The fourth, appendix section of the text includes an entire chapter devoted to data wrangling with dplyr (Chapter 18), two chapters devoted to making graphs with ggplot2 (Chapters 20 and 21), and a chapter on building tables with gt (Chapter 22). If you find the examples in this chapter difficult to follow, we recommend reviewing or working through the appendix material first.

In this chapter we make use of the EuroSat dataset. This dataset is available on GitHub and Kaggle. This paper introduced the data. The version of the dataset used here consists of 27,597 samples labeled to 10 categories: annual crop, forest, herbaceous vegetation, highway, industrial, pasture, permanent crop, residential, river, and sea/lake. The input data are 64-by-64 pixel image chips generated from the Sentinel-2 Multispectral Instrument (MSI) sensor, and we use 10 of the 13 bands: blue, green, red, red edge 1, red edge 2, red edge 3, near infrared (NIR), NIR (narrow), shortwave infrared 1 (SWIR1), and SWIR2. The imagery have a spatial resolution of 10 m, so each chip covers an area of 640-by-640 m. All of the data are from European countries. Since we are working with tabulated data in this chapter, each 64-by-64 pixel chip has been aggregated to 10 band means.

We begin by (1) saving our folder path to a variable, (2) defining a vector of colors using hex codes for visualizing the different land cover classes, and (3) reading in the tabulated version of the EuroSat dataset. These data have already been partitioned into training, validation, and test subsets. These data are read using read_csv() from readr. We also convert all character columns to factors and drop unneeded columns. After these processing steps, each tibble consists of a factor column that provides the land cover class for each sample (“class”) and 10 columns representing the image band means for the blue, green, red, red edge 1, red edge 2, red edge 3, NIR, narrow NIR, SWIR1, and SWIR2 bands.

fldPth <- "gslrData/chpt3/data/"

classColors <- c("#ff7f0e","#2ca02c","#8c564b","#d62728","#9467bd","#bcbd22","#e377c2","#7f7f7f","#1f77b4","#17becf")

euroSatTrain <- read_csv(str_glue("{fldPth}train_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

euroSatTest <- read_csv(str_glue("{fldPth}test_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

euroSatVal <- read_csv(str_glue("{fldPth}val_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

3.3 Explore Data

It is generally a good idea to explore your data prior to performing any preprocessing or feature engineering or before using them in a modeling workflow. For classification tasks, and especially multiclass classification, class imbalance is often of concern. Using dplyr, we count the number of records or rows belonging to each class. The number of samples per class varies between 1,200 (pasture) to 2,158 (sea/lake). There does not appear to be a high degree of class imbalance. However, if class imbalance were an issue, some means to combat this may be considered, such as oversampling the minority class, generating synthetic samples, or using a class-weighted loss metric.

If you are incorporating some categorical predictor variables, it is also good practice to count the number of occurrence of each factor level for those variables. If there are a large number of levels and/or if some factor levels have a small number of occurrences, you may want to recode the variable. Also, some algorithms, such as artificial neural networks (ANNs), and support vector machines (SVMs), cannot natively accept categorical predictor variables. As a result, such variables need to be converted to a different representation, such as dummy variables. In contrast, some algorithms, such as tree-based methods, can natively accept categorical predictor variables.

library(dplyr)
euroSatTrain |>
  group_by(class) |>
  summarize(cnt = n()) |>
  ungroup() |> gt()
class cnt
AnnualCrop 1800
Forest 1800
HerbaceousVegetation 1800
Highway 1500
Industrial 1500
Pasture 1200
PermanentCrop 1500
Residential 1800
River 1500
SeaLake 2158

It is also good to explore the distribution of each continuous predictor variable. If the data are heavily skewed, you may choose to perform a data transformation, such as a logarithmic or square root transform. The need for this will vary based on the learning algorithm being used.

Below, we have created histograms to show the distribution of each image band. Since the tibble is not in the correct shape to create the graph, melt() from reshape2 is used to convert the table into a tall format with the band values stored in the “value” column, the name of the band stored in the “variable” column, and the class name stored in the “class” column, which is defined as an ID field when using melt().

euroSatTrain |> melt(ID = c(class)) |>
  ggplot(aes(x=value))+
  geom_histogram()+
  facet_wrap(vars(variable))+
  theme_bw(base_size=12)

We can also explore the distribution of each predictor variable by class. In our example, we are making use of a box plot via geom_boxplot(). It is also possible to use a violin plot (geom_violin()). Assigning the class to the x-axis position allows for differentiating each class to generate grouped box plots. Since the data are not in the correct shape, melt() is used again. Generally, the visualization suggests that there are differences in the distribution and combination of channel digital number (DN) values between the different classes; as a result, supervised learning algorithms may be able to accurately differentiate the classes of interest.

euroSatTrain |> melt(ID = c(class)) |>
  ggplot(aes(x=class, y=value, fill=class))+
  geom_boxplot()+
  facet_wrap(vars(variable), ncol=3)+
  scale_fill_manual(values = classColors)+
  theme_bw(base_size=10)+
  theme(axis.text.x = element_text(size=8, 
                                   color="gray20", 
                                   angle=90, 
                                   hjust=1))

Correlation coefficients can be used to characterize the degree of correlation between pairs of numeric variables. As configured below, the cor() function returns the Pearson correlation coefficient, which is a measure of linear correlation. This function can also calculate Kendell and Spearman correlations, which use ranks and assess for monotonic correlation more generally. The corrplot package provides a variety of plots for visualizing pair-wise correlation. As configured, our example shows the Pearson correlation coefficient in the lower triangle and a circular symbol representing the degree of correlation in the upper triangle. Darker shades of blue indicate increasing positive linear correlation while darker shades of red indicate increasing negative linear correlation. The size of the symbol indicates the direction but not the magnitude of correlation. This article provides additional examples of correlation plots generated using corrplot.

cor(euroSatTrain[,2:11]) |> 
  data.frame() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits = 3))) |>
  gt()
blue green red red_edge1 red_edge2 red_edge3 NIR NIR_Narrow swir1 swir2
1.000 0.932 0.885 0.777 0.263 0.152 0.133 0.036 0.703 0.131
0.932 1.000 0.959 0.919 0.476 0.368 0.351 0.211 0.824 0.346
0.885 0.959 1.000 0.942 0.452 0.342 0.330 0.213 0.888 0.330
0.777 0.919 0.942 1.000 0.695 0.596 0.591 0.471 0.924 0.587
0.263 0.476 0.452 0.695 1.000 0.989 0.987 0.758 0.643 0.984
0.152 0.368 0.342 0.596 0.989 1.000 0.998 0.741 0.548 0.997
0.133 0.351 0.330 0.591 0.987 0.998 1.000 0.774 0.544 0.998
0.036 0.211 0.213 0.471 0.758 0.741 0.774 1.000 0.446 0.744
0.703 0.824 0.888 0.924 0.643 0.548 0.544 0.446 1.000 0.543
0.131 0.346 0.330 0.587 0.984 0.997 0.998 0.744 0.543 1.000
eurosatCorr <- cor(euroSatTrain[,2:11])
corrplot.mixed(eurosatCorr, order = 'AOE')

Another option for visualizing pair-wise variable correlation is a scatter plot matrix. The ggpairs() function for the GGally package allows for creating a scatter plot matrix using ggplot2. The lower triangle provides the scatter plot for each pair while the upper diagonal provides the associated Pearson correlation coefficient. The main diagonal characterizes the distribution of each variable using a kernel density plot.

set.seed(42)
euroSatTrain |> 
  sample_n(200) |>
  ggpairs(columns = c(2:11),
          axisLabels = "none")+
  theme_bw(base_size=18)

Generally, the correlation measures and visualizations suggest a high degree of correlation between the image bands. For example, the red edge 2 and 3 bands are highly correlated with each other and also with the NIR band. More generally, bands tend to be positively correlated.

It may also be informative to calculate summary statistics for each numeric predictor. In the next example, we calculate the mean, standard deviation, median, and interquartile range (IQR) for all image bands.

euroSatTrain |>
  summarise(
    across(
      where(is.numeric),
      list(mean = ~ mean(.x, na.rm = TRUE),
           sd   = ~ sd(.x,   na.rm = TRUE),
           median = ~ median(.x, na.rm = TRUE),
           iqr = ~ IQR(.x, na.rm = TRUE)),
      .names = "{.col}_{.fn}"
    )
  ) |> gt()
blue_mean blue_sd blue_median blue_iqr green_mean green_sd green_median green_iqr red_mean red_sd red_median red_iqr red_edge1_mean red_edge1_sd red_edge1_median red_edge1_iqr red_edge2_mean red_edge2_sd red_edge2_median red_edge2_iqr red_edge3_mean red_edge3_sd red_edge3_median red_edge3_iqr NIR_mean NIR_sd NIR_median NIR_iqr NIR_Narrow_mean NIR_Narrow_sd NIR_Narrow_median NIR_Narrow_iqr swir1_mean swir1_sd swir1_median swir1_iqr swir2_mean swir2_sd swir2_median swir2_iqr
1116.076 255.5433 1082.944 361.6446 1035.312 312.2988 996.797 395.2883 937.8773 479.8577 853.9935 675.3386 1184.292 500.9336 1163.053 621.9081 1971.095 779.6658 2099.205 822.2243 2334.381 976.8461 2464.24 1015.767 2262.04 970.0852 2394.394 1018.417 724.9527 384.4926 708.1434 444.3475 1103.834 670.8869 1080.137 943.1074 2552.018 1120.9 2713.485 1164.891

Missing data can also be an issue. Our code below confirms that there are no missing values in the table. If there were missing values, it may be necessary to remove rows with missing data or attempt to fill or impute the missing values.

euroSatTrain |>
  summarise(
    across(
      where(is.numeric),
      ~ sum(is.na(.x)),
      .names = "{.col}_missing"
    )
  ) |> gt()
blue_missing green_missing red_missing red_edge1_missing red_edge2_missing red_edge3_missing NIR_missing NIR_Narrow_missing swir1_missing swir2_missing
0 0 0 0 0 0 0 0 0 0

3.4 Recipes

The recipes package is part of the tidymodels metapackage, and its primary purpose is to support data preprocessing and feature engineering within machine learning workflows. We have found it to be generally vary useful for data manipulation and use it regularly even outside of tidymodels.

It implements a variety of functions for manipulating single numeric variables, single categorical/nominal predictor variables, and multiple numeric predictor variables. It also provides functions for performing imputation, or estimating missing values. Table 3.1 lists some of the functions provided and provides brief explanations of their uses. As a few examples, a logarithmic transformation can be applied to a single predictor variable using a user-defined base via the step_log() function. Non-negative numeric predictor variables can be transformed to a more normal distribution via a Box-Cox transformation (step_BoxCox()) while positive and negative values can be transformed using a Yeo-Johnson transformation (step_YeoJohnson()). step_normalize() can rescale data to have a mean of zero and a standard deviation on one. For categorical data, dummy variables can be created using step_dummy(). Feature reduction can be implemented using principal component analysis (PCA) (step_pca()), kernel-PCA (kPCA) (step_kpca()), or independent component analysis (ICA) (step_ica()).

Variables with no variance (step_zv()) or near zero variance (step_nzv()) can be removed along with variables highly correlated with other variables (step_corr()). Lastly, a variety of imputation methods for missing data are provided including using bagged trees (step_impute_bag()), k-nearest neighbor (step_impute_knn()), and linear models (step_impute_linear()). Alternatively, missing values can be filled with the mean (step_impute_mean()), median (step_imnpute_median()), or mode (step_impute_mode()) of the available values.

Other than these functions, a variety of helper functions are provided to select variables on which to perform transformations. For example, all_numeric() selects all numeric variables while all_numeric_predictors() selects only numeric predictor variables. All predictor variables, regardless of their data type, can be selected with all_predictors(). All predictor variables with a factor data type can be selected using all_factor_predictors().

If you are interested in learning more about preprocessing and feature engineering using recipes, please consult the package documentation, which includes examples along with descriptions of all provided functions.

Table 3.1. Preprocessing/feature engineering functions from the recipes package.
Function Description
Individual Numeric Variable

step_center()

Center to mean of 0

step_normalize()

Convert to mean of 0 and standard deviation of 1

step_range()

Convert data to a desired data range (min-max rescaling)

step_log()

Logarithmic transform using user-defined base

step_inverse()

Take inverse of values

step_sqrt()

Take square root of values

step_BoxCox()

Used for non-negative values; make data more normally distributed

step_YeoJohnson()

Used for negative and positive values; make data more normally distributed
Categorical/Nominal Variable

step_dummy()

Create dummy variables

step_num2factor()

Numeric to a factor data type

step_ordinalscore()

Ordinal data to numeric scores

step_srings2factor()

Convert character/string data to factor data type
Multiple Numeric Variables

step_interact()

Generate interaction variable

step_pca()

Convert to principal components

step_kpca()

Perform kernel-PCA

step_ica()

Perform independent component analysis

step_pls()

Perform partial least squares feature extraction

step_spatialsign()

Perform spatial sign processing

step_corr()

Remove highly correlated variables

step_zv()

Remove variables with zero variance

step_nzv()

Remove variables with near zero variance

step_lincom()

Remove variables that are linear combinations of other variables
Imputation

step_filter_missing()

Remove predictor variables with missing values

step_impute_bag()

Estimate missing values using bagged trees

step_impute_knn()

Estimate missing values using k-nearest neighbor

step_impute_linear()

Estimate missing values using linear model

step_impute_mean()

Fill missing values with mean of existing values

step_impute_median()

Fill missing values with median of existing values

step_impute_mode()

Fill missing values with mode of existing values

A data preprocessing pipeline is generally implemented using recipes as follows:

  1. The modeling problem and pipeline are defined using recipe().
  2. The recipe is prepared using prep().
  3. The prepared recipe is applied to data using bake().

We have provided three example workflows below.

3.4.1 Example 1

The goal of this first workflow is to simply rescale all numeric predictor variables from a scale of zero to one via min-max rescaling, which is implemented in recipes via step_range(). The first step is to define the recipe. The model or problem is defined using the following formula syntax: class ~ . This reads as “class is predicted by all other columns in the data table.” We also specify the data table (data=euroSatTrain). It is necessary to define the formula so that the role of each variable is clear. In other words, defining the model indicted that “class” is the response or dependent variable while all other columns are predictor variables. Next, we indicate to perform min-max rescaling using step_range() for all numeric predictor variables using all_numeric_predictors(). Since all of the predictor variables are numeric, in this case this yields the same result as all_predictors(). We also specify to rescale to a range of zero to one.

Once the recipe is defined it must be prepared sing prep(). It is important to note that we are only using the training set here. Data transformations should be specified using only the training set. Using all the data, or just the validation or test samples, is considered a data leak since using these data during the preprocessing stage allows the model to learn something about these withheld data.

myRecipe <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1)

myRecipePrep <- prep(myRecipe, training=euroSatTrain)

Once the model is defined, it can be applied to each dataset using bake(). If new_data=NULL, the data used to define the recipe, in this case the training set, is processed.

trainProcessed <- bake(myRecipePrep, new_data=NULL)
valProcessed <- bake(myRecipePrep, new_data=euroSatVal)
testProcessed  <- bake(myRecipePrep, new_data=euroSatTest)

To verify that the data have been rescaled we replicate the histograms created above. This confirms that all bands are now scaled between zero and one.

trainProcessed |> melt(ID = c(class)) |>
  ggplot(aes(x=value))+
  geom_histogram()+
  facet_wrap(vars(variable))+
  theme_bw(base_size=12)

3.4.2 Example 2

Adding to the last example, we now perform min-max rescaling followed by principal component analysis, which is implemented with step_pca(). There are a total of 10 image bands or predictor variables; as a result, the maximum number of PCA bands that can be generated is 10. Here, we specify to extract the first five principal components (num_comp=5). The transformations are applied to only the numeric predictor variables using all_numeric_predictors(). The required Eigenvectors are generated using prep(). Again, only the training data are used to avoid a data leak. Lastly, all three tables are processed using bake().

Printing the first six rows of the table, you can see that the processed data now consist of the dependent variable (“class”) and five principal components. The second block of code displays the samples within a space defined by the first two principal component values.

myRecipe2 <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1) |>
  step_pca(all_numeric_predictors(), trained=FALSE, num_comp = 5)

myRecipePrep2 <- prep(myRecipe2, training=euroSatTrain)

trainProcessed2 <- bake(myRecipePrep2, new_data=NULL)
valProcessed2 <- bake(myRecipePrep2, new_data=euroSatVal)
testProcessed2  <- bake(myRecipePrep2, new_data=euroSatTest)

trainProcessed2 |> head() |> gt()
class PC1 PC2 PC3 PC4 PC5
AnnualCrop -1.444476 -0.200027534 -0.053225523 -0.04677407 0.12586640
AnnualCrop -1.180831 -0.007249725 0.019563326 -0.03126833 0.06238193
AnnualCrop -1.740357 -0.197465575 -0.102684014 -0.05571302 0.16976016
AnnualCrop -1.551906 0.061390630 -0.137713366 -0.12116799 0.03138270
AnnualCrop -1.661165 -0.191002264 0.017887586 -0.03739950 0.18641233
AnnualCrop -1.744396 -0.307589852 -0.008958601 -0.14514547 -0.02851730
myRecipe2 <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1) |>
  step_pca(all_numeric_predictors(), trained=FALSE, num_comp = 5)

myRecipePrep2 <- prep(myRecipe2, training=euroSatTrain)

trainProcessed2 <- bake(myRecipePrep2, new_data=NULL)
valProcessed2 <- bake(myRecipePrep2, new_data=euroSatVal)
testProcessed2  <- bake(myRecipePrep2, new_data=euroSatTest)

trainProcessed2 |> head() |> gt()
class PC1 PC2 PC3 PC4 PC5
AnnualCrop -1.444476 -0.200027534 -0.053225523 -0.04677407 0.12586640
AnnualCrop -1.180831 -0.007249725 0.019563326 -0.03126833 0.06238193
AnnualCrop -1.740357 -0.197465575 -0.102684014 -0.05571302 0.16976016
AnnualCrop -1.551906 0.061390630 -0.137713366 -0.12116799 0.03138270
AnnualCrop -1.661165 -0.191002264 0.017887586 -0.03739950 0.18641233
AnnualCrop -1.744396 -0.307589852 -0.008958601 -0.14514547 -0.02851730
trainProcessed2 |> 
  ggplot(aes(x=PC1, y=PC2, color=class))+
  geom_point()+
  scale_color_manual(values = classColors)+
  theme_bw(base_size=11)

3.4.3 Example 3

As a final example using recipes, we convert the class labels into dummy variables. This is not generally necessary for the dependent variable; we are just using this as an example of processing categorical/nominal data. all_outcomes() is used to specify that we want to apply the dummy variable transformation to the dependent variable (i.e., outcome). As with our prior two examples, once the recipe is defined and prepared, it can be applied to each table using bake().

Printing the first five rows, we can see that dummy variables have been created and added to the end of the table. There are a total of n-1 new columns. This is because if the classification is known for nine of the 10 columns, the other one can be determined. A row is coded to one for only the column representing the class to which it belongs, and all other columns are coded to zero. So, if one already exists in the dummy variable columns for a sample, then the sample associated with the row does not belong to the missing class. If all dummy variables are zero for a sample, then it must belong to the missing class. Another reason for creating only n-1 dummy variables is that perfect correlation would exist since all columns would have to sum to 1. This is an issue for some modeling methods.

myRecipe3 <- recipe(class ~ ., data = euroSatTrain) |>
  step_dummy(all_outcomes())

myRecipePrep3 <- prep(myRecipe3, training=euroSatTrain)

trainProcessed3 <- bake(myRecipePrep3, new_data=NULL)
valProcessed3 <- bake(myRecipePrep3, new_data=euroSatVal)
testProcessed3  <- bake(myRecipePrep3, new_data=euroSatTest)

trainProcessed3 |> head() |> gt()
blue green red red_edge1 red_edge2 red_edge3 NIR NIR_Narrow swir1 swir2 class_Forest class_HerbaceousVegetation class_Highway class_Industrial class_Pasture class_PermanentCrop class_Residential class_River class_SeaLake
1301.216 1310.900 1690.943 1882.683 2300.972 2693.981 2679.530 674.3467 1461.916 3229.571 0 0 0 0 0 0 0 0 0
1042.025 1032.954 1146.716 1371.954 2059.983 2428.848 2392.777 758.6294 1189.457 2714.708 0 0 0 0 0 0 0 0 0
1411.485 1490.846 1949.925 2199.880 2840.266 3354.370 3278.168 761.7170 1621.122 3900.247 0 0 0 0 0 0 0 0 0
1188.904 1186.154 1303.843 1526.593 2638.546 3538.587 3379.391 638.5325 1469.149 3986.078 0 0 0 0 0 0 0 0 0
1313.137 1403.036 1935.128 2199.130 2620.265 3031.684 3112.547 948.0181 1656.649 3609.165 0 0 0 0 0 0 0 0 0
1494.215 1542.500 1762.494 1988.496 2771.243 3218.373 3092.701 730.0217 2437.478 3556.718 0 0 0 0 0 0 0 0 0

3.5 Principal Component Analysis

We discussed principal component analysis (PCA) as applied to image data in the last chapter. In this final section of the chapter we will further explore PCA as applied to tabulated data.

3.6 Calculate Principal Components

There are multiple implementations of PCA available in R. Here, we use the prcomp() function from the stats package, which is included with a base R implementation.

For our example, we use the data that have been rescaled from 0 to 1. We only consider the numeric predictor variables, not the categorical response variable since PCA is not applicable to categorical data. When scale=TRUE, the data are rescaled to have unit variance (i.e., a standard deviation of one), which is generally recommended.

Once the PCA executes, summary() can be applied to the resulting object to obtain the proportion of variance explained by each component and the associated cumulative variance. The results suggest that the first four principal components capture ~99% of the variance in the original image bands. The first principal component captures ~66% while the second captures ~27%. In other words, ~93% of the variance in the original data are captured by only two new variables. Thus, due to the correlation between the original image bands, PCA offers a significant feature reduction.

pcaOut <- prcomp(trainProcessed[,1:10], scale = TRUE)
summary(pcaOut)
Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.5678 1.6490 0.61350 0.45025 0.26931 0.14352 0.10266
Proportion of Variance 0.6593 0.2719 0.03764 0.02027 0.00725 0.00206 0.00105
Cumulative Proportion  0.6593 0.9313 0.96892 0.98919 0.99644 0.99850 0.99956
                           PC8     PC9    PC10
Standard deviation     0.05716 0.02977 0.01672
Proportion of Variance 0.00033 0.00009 0.00003
Cumulative Proportion  0.99988 0.99997 1.00000

3.7 Explore PCA Results

The Eigenvectors are stored in the rotation object within the list object created by the prcomp() function. These values represent the coefficients that the 10 original image bands must be multiplied by in order to obtain the component. Negative values indicate that the original variable is negatively correlated with the component while a positive value indicates that it is positively correlated with the component. Larger magnitude values indicate that the input contributes more to the component.

pcaOut$rotation |> as.data.frame() |> gt()
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
-0.2332634 -0.4499598 -0.09592833 -0.55918371 -0.543608333 0.3291772660 0.12301040 -0.001778931 -0.04701750 -0.0047617102
-0.3022350 -0.3656105 -0.08623443 -0.27708812 0.225995289 -0.7519375536 -0.19483653 0.174788435 0.07597545 0.0004594956
-0.3007377 -0.3732250 0.03542181 0.18152349 0.374574593 0.4406563880 -0.55950411 -0.265823234 0.13360553 -0.0036017235
-0.3592516 -0.2110618 0.09649944 0.12167750 0.465114161 0.1719181056 0.68421544 0.180699463 -0.23473280 0.0220022517
-0.3544881 0.2400222 -0.17453459 -0.03672366 -0.026899167 -0.1322480621 0.31263076 -0.653070559 0.48852054 -0.0617412527
-0.3307472 0.3037438 -0.26486505 -0.04236619 -0.038152324 -0.0529464411 -0.16993956 -0.226867213 -0.63654051 0.4864223404
-0.3295254 0.3152591 -0.18440194 -0.04018152 -0.014196488 0.0622893750 -0.14702228 0.162590327 -0.23712850 -0.8050114813
-0.2626938 0.3036496 0.85553680 -0.30337990 -0.003373832 -0.0001312174 -0.10444142 0.001153149 0.01871505 0.0594584771
-0.3388080 -0.2013612 0.20855286 0.68248338 -0.541717226 -0.1943846527 0.00137486 0.056490989 -0.02986709 -0.0075382556
-0.3275605 0.3130695 -0.25047381 0.01409093 -0.030482109 0.2013457311 -0.08104717 0.598776660 0.46664633 0.3277586720

The fvix_screeplot() function from the factoextra package generates a Scree plot, which shows the proportion of the variance explained by each principal component. Again, the majority of the variance in the original data is captured by the first two principal components.

fviz_screeplot(pcaOut, addlabels = TRUE)+
  theme_bw(base_size=11)

A biplot visualizes how the input predictor variables correlate with the first two principal components. The magnitude of the arrow in the x direction relates to the contribution of the variable to the first principal component while the magnitude in the y direction relates to its contribution to the second principal component. The direction relates to the sign of the Eigenvector for the two principal components. Vectors occurring in the right portion of the plot correspond to input variables that are positively correlated with the first principal component while those in the left are negatively correlated. In our example, all the input predictor variables are negatively correlated with the first principal component. All the Eigenvectors for the first principal component are negative, as is evident in the table above. Vectors occurring in the top portion of the plot correspond to input variables that are positively correlated with the second principal component while those in the lower portion are negatively correlated. Some of the input channels, such as the NIR band, are positively correlated with the second principal component while others, such as the red band, are negatively correlated with the second principal component.

fviz_pca_var(pcaOut)+
  theme_bw(base_size=11)

3.8 Concluding Remarks

We have now covered preprocessing considerations and methods for raster data in general and multispectral imagery and digital terrain models more specifically. This chapter focused on data preprocessing not specific to raster geospatial data. The next chapter is our final chapter associated with data preprocessing, and we discuss processing light detection and ranging (lidar) point clouds using the lidR package.

3.9 Questions

  1. Explain the difference between one-hot encoding and dummy variables.
  2. Explain how the Box-Cox transformation is applied.
  3. Explain how the Yeo-Johnson transformation is applied.
  4. Explain the relative impact of a natural log transformation on numbers based on their magnitude. In other words, does a natural log transformation result in compressing ranges of larger values or ranges of smaller values? Why might this be desirable?
  5. Explain the relative impact of calculating the square of numbers based on their magnitude. In other words, does a square transformation result in compressing ranges of larger values or ranges of smaller values? Why might this be desirable?
  6. From the recipes package, explain the difference between the step_center(), step_normalize(), step_range(), and step_scale() functions.
  7. What is the difference between principal component analysis (PCA) and kernel PCA?
  8. What is the difference between principal component analysis (PCA) and independent component analysis (ICA)?

3.10 Exercises

You have been provided two CSV files: training.csv and test.csv in the exercise folder for the chapter. Both represent data for a object-based image classification. Multiple data layers have been summarized relative to image objects including red, green, blue, and near infrared (NIR) image bands derived from National Agriculture Imagery Program (NAIP) orthophotography; feature heights, return intensity, and topographic slope derived from lidar; density of man-made structures; and density of roads. Your task is to perform preprocessing to prepare the tables for subsequent analysis.

  1. Read in both tables
  2. For both tables, complete the following tasks using the tidyverse:
    • Extract out the “class” column and all other columns beginning with the word “Mean” (should result in 16 columns)
    • Convert all character columns to factors
    • Drop any rows with missing data in any column
    • Some of the rows have missing values that contain a very low numeric placeholder value as opposed to NA (< 1e-30). Remove rows that have a value lower then 1e-30 in any numeric column
    • Remove “Mean” from all extracted column names
  3. Count the number of records within each class differentiated in the “Class” column in the training table
  4. use ggpairs() to compare the correlation between all 15 extracted predictor variables (all columns other than “class”) in the training table; to speed this up, extract out a random subset of 200 samples prior to generating the correlation plot
  5. Create grouped box plots using the “class” column and training set for the following predictor variables:
    • “nDSM” (object-level mean normalized digital surface model; mean height above ground)
    • “NDVI” (object-level mean normalized difference vegetation index)
    • “Int” (object-level mean first return lidar intensity)
    • “Slp” (object-level mean topographic slope)
    • “Str250” (object-level mean structure density using a 250 m circular radius for kernel density calculation)
    • “Roads250” (object-level mean road density using a 250 m circular radius for kernel density calculation)
  6. Use recipes to perform the following preprocessing steps:
    • Rescale all the predictor variables to a range of 0 to 1
    • Convert the 15 predictor variables into 8 principal components
    • Prepare the recipe using the training data
    • Apply the prepared recipe to both the training and test tables
  7. Using the results of your exploratory analysis, generate a short write up that discusses the following:
    • Issues of class imbalance
    • Issues of correlation between predictor variables
    • How well specific classes are differentiated by the graphed predictor variables