3 General Data Preprocessing

3.1 Topics Covered

Exploratory data analysis via summarization and graphing
Feature engineering with the recipes package
Principal component analysis (PCA) on tabulated data
Exploration of PCA results

3.2 Introduction

The first two chapters focused on preprocessing geospatial raster data specifically. We now turn our attention to more general data preprocessing and feature engineering. In the first section of the chapter, we focus on general data summarization, visualization, and exploration. Next, we introduce the recipes package, which is part of the [tidymodels{.underline} metapackage, for general data preprocessing. Lastly, we more generally explore principal component analysis (PCA).

For general data wrangling, we load in the tidyverse and reshape2 packages. recipes is loaded for data preprocessing, corrplot and GGally for creating correlation visualizations, and gt for printing tables. factoextra is used for visualizing principal component analysis results.

The fourth, appendix section of the text includes an entire chapter devoted to data wrangling with dplyr (Chapter 18), two chapters devoted to making graphs with ggplot2 (Chapters 20 and 21), and a chapter on building tables with gt (Chapter 22). If you find the examples in this chapter difficult to follow, we recommend reviewing or working through the appendix material first.

library(tidyverse)
library(reshape2)
library(recipes)
library(corrplot)
library(GGally)
library(factoextra)
library(gt)

In this chapter we make use of the EuroSat dataset. This dataset is available on GitHub and Kaggle. This paper introduced the data. The version of the dataset used here consists of 27,597 samples labeled to 10 categories: annual crop, forest, herbaceous vegetation, highway, industrial, pasture, permanent crop, residential, river, and sea/lake. The input data are 64-by-64 pixel image chips generated from the Sentinel-2 Multispectral Instrument (MSI) sensor, and we use 10 of the 13 bands: blue, green, red, red edge 1, red edge 2, red edge 3, near infrared (NIR), NIR (narrow), shortwave infrared 1 (SWIR1), and SWIR2. The imagery have a spatial resolution of 10 m, so each chip covers an area of 640-by-640 m. All of the data are from European countries. Since we are working with tabulated data in this chapter, each 64-by-64 pixel chip has been aggregated to 10 band means.

We begin by (1) saving our folder path to a variable, (2) defining a vector of colors using hex codes for visualizing the different land cover classes, and (3) reading in the tabulated version of the EuroSat dataset. These data have already been partitioned into training, validation, and test subsets. These data are read using read_csv() from readr. We also convert all character columns to factors and drop unneeded columns. After these processing steps, each tibble consists of a factor column that provides the land cover class for each sample (“class”) and 10 columns representing the image band means for the blue, green, red, red edge 1, red edge 2, red edge 3, NIR, narrow NIR, SWIR1, and SWIR2 bands.

fldPth <- "gslrData/chpt3/data/"

classColors <- c("#ff7f0e","#2ca02c","#8c564b","#d62728","#9467bd","#bcbd22","#e377c2","#7f7f7f","#1f77b4","#17becf")

euroSatTrain <- read_csv(str_glue("{fldPth}train_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

euroSatTest <- read_csv(str_glue("{fldPth}test_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

euroSatVal <- read_csv(str_glue("{fldPth}val_aggregated.csv")) |> 
  mutate_if(is.character, as.factor) |>
  select(-1, -code)

3.3 Explore Data

It is generally a good idea to explore your data prior to performing any preprocessing or feature engineering or before using them in a modeling workflow. For classification tasks, and especially multiclass classification, class imbalance is often of concern. Using dplyr, we count the number of records or rows belonging to each class. The number of samples per class varies between 1,200 (pasture) to 2,158 (sea/lake). There does not appear to be a high degree of class imbalance. However, if class imbalance were an issue, some means to combat this may be considered, such as oversampling the minority class, generating synthetic samples, or using a class-weighted loss metric.

If you are incorporating some categorical predictor variables, it is also good practice to count the number of occurrence of each factor level for those variables. If there are a large number of levels and/or if some factor levels have a small number of occurrences, you may want to recode the variable. Also, some algorithms, such as artificial neural networks (ANNs), and support vector machines (SVMs), cannot natively accept categorical predictor variables. As a result, such variables need to be converted to a different representation, such as dummy variables. In contrast, some algorithms, such as tree-based methods, can natively accept categorical predictor variables.

library(dplyr)
euroSatTrain |>
  group_by(class) |>
  summarize(cnt = n()) |>
  ungroup() |> gt()

class	cnt
AnnualCrop	1800
Forest	1800
HerbaceousVegetation	1800
Highway	1500
Industrial	1500
Pasture	1200
PermanentCrop	1500
Residential	1800
River	1500
SeaLake	2158

It is also good to explore the distribution of each continuous predictor variable. If the data are heavily skewed, you may choose to perform a data transformation, such as a logarithmic or square root transform. The need for this will vary based on the learning algorithm being used.

Below, we have created histograms to show the distribution of each image band. Since the tibble is not in the correct shape to create the graph, melt() from reshape2 is used to convert the table into a tall format with the band values stored in the “value” column, the name of the band stored in the “variable” column, and the class name stored in the “class” column, which is defined as an ID field when using melt().

euroSatTrain |> melt(ID = c(class)) |>
  ggplot(aes(x=value))+
  geom_histogram()+
  facet_wrap(vars(variable))+
  theme_bw(base_size=12)

We can also explore the distribution of each predictor variable by class. In our example, we are making use of a box plot via geom_boxplot(). It is also possible to use a violin plot (geom_violin()). Assigning the class to the x-axis position allows for differentiating each class to generate grouped box plots. Since the data are not in the correct shape, melt() is used again. Generally, the visualization suggests that there are differences in the distribution and combination of channel digital number (DN) values between the different classes; as a result, supervised learning algorithms may be able to accurately differentiate the classes of interest.

euroSatTrain |> melt(ID = c(class)) |>
  ggplot(aes(x=class, y=value, fill=class))+
  geom_boxplot()+
  facet_wrap(vars(variable), ncol=3)+
  scale_fill_manual(values = classColors)+
  theme_bw(base_size=10)+
  theme(axis.text.x = element_text(size=8, 
                                   color="gray20", 
                                   angle=90, 
                                   hjust=1))

Correlation coefficients can be used to characterize the degree of correlation between pairs of numeric variables. As configured below, the cor() function returns the Pearson correlation coefficient, which is a measure of linear correlation. This function can also calculate Kendell and Spearman correlations, which use ranks and assess for monotonic correlation more generally. The corrplot package provides a variety of plots for visualizing pair-wise correlation. As configured, our example shows the Pearson correlation coefficient in the lower triangle and a circular symbol representing the degree of correlation in the upper triangle. Darker shades of blue indicate increasing positive linear correlation while darker shades of red indicate increasing negative linear correlation. The size of the symbol indicates the direction but not the magnitude of correlation. This article provides additional examples of correlation plots generated using corrplot.

cor(euroSatTrain[,2:11]) |> 
  data.frame() |> 
  mutate(across(where(is.numeric), ~ round(.x, digits = 3))) |>
  gt()

blue	green	red	red_edge1	red_edge2	red_edge3	NIR	NIR_Narrow	swir1	swir2
1.000	0.932	0.885	0.777	0.263	0.152	0.133	0.036	0.703	0.131
0.932	1.000	0.959	0.919	0.476	0.368	0.351	0.211	0.824	0.346
0.885	0.959	1.000	0.942	0.452	0.342	0.330	0.213	0.888	0.330
0.777	0.919	0.942	1.000	0.695	0.596	0.591	0.471	0.924	0.587
0.263	0.476	0.452	0.695	1.000	0.989	0.987	0.758	0.643	0.984
0.152	0.368	0.342	0.596	0.989	1.000	0.998	0.741	0.548	0.997
0.133	0.351	0.330	0.591	0.987	0.998	1.000	0.774	0.544	0.998
0.036	0.211	0.213	0.471	0.758	0.741	0.774	1.000	0.446	0.744
0.703	0.824	0.888	0.924	0.643	0.548	0.544	0.446	1.000	0.543
0.131	0.346	0.330	0.587	0.984	0.997	0.998	0.744	0.543	1.000

eurosatCorr <- cor(euroSatTrain[,2:11])
corrplot.mixed(eurosatCorr, order = 'AOE')

Another option for visualizing pair-wise variable correlation is a scatter plot matrix. The ggpairs() function for the GGally package allows for creating a scatter plot matrix using ggplot2. The lower triangle provides the scatter plot for each pair while the upper diagonal provides the associated Pearson correlation coefficient. The main diagonal characterizes the distribution of each variable using a kernel density plot.

set.seed(42)
euroSatTrain |> 
  sample_n(200) |>
  ggpairs(columns = c(2:11),
          axisLabels = "none")+
  theme_bw(base_size=18)

Generally, the correlation measures and visualizations suggest a high degree of correlation between the image bands. For example, the red edge 2 and 3 bands are highly correlated with each other and also with the NIR band. More generally, bands tend to be positively correlated.

It may also be informative to calculate summary statistics for each numeric predictor. In the next example, we calculate the mean, standard deviation, median, and interquartile range (IQR) for all image bands.

euroSatTrain |>
  summarise(
    across(
      where(is.numeric),
      list(mean = ~ mean(.x, na.rm = TRUE),
           sd   = ~ sd(.x,   na.rm = TRUE),
           median = ~ median(.x, na.rm = TRUE),
           iqr = ~ IQR(.x, na.rm = TRUE)),
      .names = "{.col}_{.fn}"
    )
  ) |> gt()

blue_mean	blue_sd	blue_median	blue_iqr	green_mean	green_sd	green_median	green_iqr	red_mean	red_sd	red_median	red_iqr	red_edge1_mean	red_edge1_sd	red_edge1_median	red_edge1_iqr	red_edge2_mean	red_edge2_sd	red_edge2_median	red_edge2_iqr	red_edge3_mean	red_edge3_sd	red_edge3_median	red_edge3_iqr	NIR_mean	NIR_sd	NIR_median	NIR_iqr	NIR_Narrow_mean	NIR_Narrow_sd	NIR_Narrow_median	NIR_Narrow_iqr	swir1_mean	swir1_sd	swir1_median	swir1_iqr	swir2_mean	swir2_sd	swir2_median	swir2_iqr
1116.076	255.5433	1082.944	361.6446	1035.312	312.2988	996.797	395.2883	937.8773	479.8577	853.9935	675.3386	1184.292	500.9336	1163.053	621.9081	1971.095	779.6658	2099.205	822.2243	2334.381	976.8461	2464.24	1015.767	2262.04	970.0852	2394.394	1018.417	724.9527	384.4926	708.1434	444.3475	1103.834	670.8869	1080.137	943.1074	2552.018	1120.9	2713.485	1164.891

Missing data can also be an issue. Our code below confirms that there are no missing values in the table. If there were missing values, it may be necessary to remove rows with missing data or attempt to fill or impute the missing values.

euroSatTrain |>
  summarise(
    across(
      where(is.numeric),
      ~ sum(is.na(.x)),
      .names = "{.col}_missing"
    )
  ) |> gt()

blue_missing	green_missing	red_missing	red_edge1_missing	red_edge2_missing	red_edge3_missing	NIR_missing	NIR_Narrow_missing	swir1_missing	swir2_missing
0	0	0	0	0	0	0	0	0	0

3.4 Recipes

The recipes package is part of the tidymodels metapackage, and its primary purpose is to support data preprocessing and feature engineering within machine learning workflows. We have found it to be generally vary useful for data manipulation and use it regularly even outside of tidymodels.

It implements a variety of functions for manipulating single numeric variables, single categorical/nominal predictor variables, and multiple numeric predictor variables. It also provides functions for performing imputation, or estimating missing values. Table 3.1 lists some of the functions provided and provides brief explanations of their uses. As a few examples, a logarithmic transformation can be applied to a single predictor variable using a user-defined base via the step_log() function. Non-negative numeric predictor variables can be transformed to a more normal distribution via a Box-Cox transformation (step_BoxCox()) while positive and negative values can be transformed using a Yeo-Johnson transformation (step_YeoJohnson()). step_normalize() can rescale data to have a mean of zero and a standard deviation on one. For categorical data, dummy variables can be created using step_dummy(). Feature reduction can be implemented using principal component analysis (PCA) (step_pca()), kernel-PCA (kPCA) (step_kpca()), or independent component analysis (ICA) (step_ica()).

Variables with no variance (step_zv()) or near zero variance (step_nzv()) can be removed along with variables highly correlated with other variables (step_corr()). Lastly, a variety of imputation methods for missing data are provided including using bagged trees (step_impute_bag()), k-nearest neighbor (step_impute_knn()), and linear models (step_impute_linear()). Alternatively, missing values can be filled with the mean (step_impute_mean()), median (step_imnpute_median()), or mode (step_impute_mode()) of the available values.

Other than these functions, a variety of helper functions are provided to select variables on which to perform transformations. For example, all_numeric() selects all numeric variables while all_numeric_predictors() selects only numeric predictor variables. All predictor variables, regardless of their data type, can be selected with all_predictors(). All predictor variables with a factor data type can be selected using all_factor_predictors().

If you are interested in learning more about preprocessing and feature engineering using recipes, please consult the package documentation, which includes examples along with descriptions of all provided functions.

**Table 3.1.** Preprocessing/feature engineering functions from the recipes package.
Function	Description
Individual Numeric Variable
step_center()	Center to mean of 0
step_normalize()	Convert to mean of 0 and standard deviation of 1
step_range()	Convert data to a desired data range (min-max rescaling)
step_log()	Logarithmic transform using user-defined base
step_inverse()	Take inverse of values
step_sqrt()	Take square root of values
step_BoxCox()	Used for non-negative values; make data more normally distributed
step_YeoJohnson()	Used for negative and positive values; make data more normally distributed
Categorical/Nominal Variable
step_dummy()	Create dummy variables
step_num2factor()	Numeric to a factor data type
step_ordinalscore()	Ordinal data to numeric scores
step_srings2factor()	Convert character/string data to factor data type
Multiple Numeric Variables
step_interact()	Generate interaction variable
step_pca()	Convert to principal components
step_kpca()	Perform kernel-PCA
step_ica()	Perform independent component analysis
step_pls()	Perform partial least squares feature extraction
step_spatialsign()	Perform spatial sign processing
step_corr()	Remove highly correlated variables
step_zv()	Remove variables with zero variance
step_nzv()	Remove variables with near zero variance
step_lincom()	Remove variables that are linear combinations of other variables
Imputation
step_filter_missing()	Remove predictor variables with missing values
step_impute_bag()	Estimate missing values using bagged trees
step_impute_knn()	Estimate missing values using k-nearest neighbor
step_impute_linear()	Estimate missing values using linear model
step_impute_mean()	Fill missing values with mean of existing values
step_impute_median()	Fill missing values with median of existing values
step_impute_mode()	Fill missing values with mode of existing values

A data preprocessing pipeline is generally implemented using recipes as follows:

The modeling problem and pipeline are defined using recipe().
The recipe is prepared using prep().
The prepared recipe is applied to data using bake().

We have provided three example workflows below.

3.4.1 Example 1

The goal of this first workflow is to simply rescale all numeric predictor variables from a scale of zero to one via min-max rescaling, which is implemented in recipes via step_range(). The first step is to define the recipe. The model or problem is defined using the following formula syntax: class ~ . This reads as “class is predicted by all other columns in the data table.” We also specify the data table (data=euroSatTrain). It is necessary to define the formula so that the role of each variable is clear. In other words, defining the model indicted that “class” is the response or dependent variable while all other columns are predictor variables. Next, we indicate to perform min-max rescaling using step_range() for all numeric predictor variables using all_numeric_predictors(). Since all of the predictor variables are numeric, in this case this yields the same result as all_predictors(). We also specify to rescale to a range of zero to one.

Once the recipe is defined it must be prepared sing prep(). It is important to note that we are only using the training set here. Data transformations should be specified using only the training set. Using all the data, or just the validation or test samples, is considered a data leak since using these data during the preprocessing stage allows the model to learn something about these withheld data.

myRecipe <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1)

myRecipePrep <- prep(myRecipe, training=euroSatTrain)

Once the model is defined, it can be applied to each dataset using bake(). If new_data=NULL, the data used to define the recipe, in this case the training set, is processed.

trainProcessed <- bake(myRecipePrep, new_data=NULL)
valProcessed <- bake(myRecipePrep, new_data=euroSatVal)
testProcessed  <- bake(myRecipePrep, new_data=euroSatTest)

To verify that the data have been rescaled we replicate the histograms created above. This confirms that all bands are now scaled between zero and one.

trainProcessed |> melt(ID = c(class)) |>
  ggplot(aes(x=value))+
  geom_histogram()+
  facet_wrap(vars(variable))+
  theme_bw(base_size=12)

3.4.2 Example 2

Adding to the last example, we now perform min-max rescaling followed by principal component analysis, which is implemented with step_pca(). There are a total of 10 image bands or predictor variables; as a result, the maximum number of PCA bands that can be generated is 10. Here, we specify to extract the first five principal components (num_comp=5). The transformations are applied to only the numeric predictor variables using all_numeric_predictors(). The required Eigenvectors are generated using prep(). Again, only the training data are used to avoid a data leak. Lastly, all three tables are processed using bake().

Printing the first six rows of the table, you can see that the processed data now consist of the dependent variable (“class”) and five principal components. The second block of code displays the samples within a space defined by the first two principal component values.

myRecipe2 <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1) |>
  step_pca(all_numeric_predictors(), trained=FALSE, num_comp = 5)

myRecipePrep2 <- prep(myRecipe2, training=euroSatTrain)

trainProcessed2 <- bake(myRecipePrep2, new_data=NULL)
valProcessed2 <- bake(myRecipePrep2, new_data=euroSatVal)
testProcessed2  <- bake(myRecipePrep2, new_data=euroSatTest)

trainProcessed2 |> head() |> gt()

class	PC1	PC2	PC3	PC4	PC5
AnnualCrop	-1.444476	-0.200027534	-0.053225523	-0.04677407	0.12586640
AnnualCrop	-1.180831	-0.007249725	0.019563326	-0.03126833	0.06238193
AnnualCrop	-1.740357	-0.197465575	-0.102684014	-0.05571302	0.16976016
AnnualCrop	-1.551906	0.061390630	-0.137713366	-0.12116799	0.03138270
AnnualCrop	-1.661165	-0.191002264	0.017887586	-0.03739950	0.18641233
AnnualCrop	-1.744396	-0.307589852	-0.008958601	-0.14514547	-0.02851730

myRecipe2 <- recipe(class ~ ., data = euroSatTrain) |>
  step_range(all_numeric_predictors(), min=0, max=1) |>
  step_pca(all_numeric_predictors(), trained=FALSE, num_comp = 5)

myRecipePrep2 <- prep(myRecipe2, training=euroSatTrain)

trainProcessed2 <- bake(myRecipePrep2, new_data=NULL)
valProcessed2 <- bake(myRecipePrep2, new_data=euroSatVal)
testProcessed2  <- bake(myRecipePrep2, new_data=euroSatTest)

trainProcessed2 |> head() |> gt()

class	PC1	PC2	PC3	PC4	PC5
AnnualCrop	-1.444476	-0.200027534	-0.053225523	-0.04677407	0.12586640
AnnualCrop	-1.180831	-0.007249725	0.019563326	-0.03126833	0.06238193
AnnualCrop	-1.740357	-0.197465575	-0.102684014	-0.05571302	0.16976016
AnnualCrop	-1.551906	0.061390630	-0.137713366	-0.12116799	0.03138270
AnnualCrop	-1.661165	-0.191002264	0.017887586	-0.03739950	0.18641233
AnnualCrop	-1.744396	-0.307589852	-0.008958601	-0.14514547	-0.02851730

trainProcessed2 |> 
  ggplot(aes(x=PC1, y=PC2, color=class))+
  geom_point()+
  scale_color_manual(values = classColors)+
  theme_bw(base_size=11)

3.4.3 Example 3

As a final example using recipes, we convert the class labels into dummy variables. This is not generally necessary for the dependent variable; we are just using this as an example of processing categorical/nominal data. all_outcomes() is used to specify that we want to apply the dummy variable transformation to the dependent variable (i.e., outcome). As with our prior two examples, once the recipe is defined and prepared, it can be applied to each table using bake().

Printing the first five rows, we can see that dummy variables have been created and added to the end of the table. There are a total of n-1 new columns. This is because if the classification is known for nine of the 10 columns, the other one can be determined. A row is coded to one for only the column representing the class to which it belongs, and all other columns are coded to zero. So, if one already exists in the dummy variable columns for a sample, then the sample associated with the row does not belong to the missing class. If all dummy variables are zero for a sample, then it must belong to the missing class. Another reason for creating only n-1 dummy variables is that perfect correlation would exist since all columns would have to sum to 1. This is an issue for some modeling methods.

myRecipe3 <- recipe(class ~ ., data = euroSatTrain) |>
  step_dummy(all_outcomes())

myRecipePrep3 <- prep(myRecipe3, training=euroSatTrain)

trainProcessed3 <- bake(myRecipePrep3, new_data=NULL)
valProcessed3 <- bake(myRecipePrep3, new_data=euroSatVal)
testProcessed3  <- bake(myRecipePrep3, new_data=euroSatTest)

trainProcessed3 |> head() |> gt()

blue	green	red	red_edge1	red_edge2	red_edge3	NIR	NIR_Narrow	swir1	swir2
1301.216	1310.900	1690.943	1882.683	2300.972	2693.981	2679.530	674.3467	1461.916	3229.571
1042.025	1032.954	1146.716	1371.954	2059.983	2428.848	2392.777	758.6294	1189.457	2714.708
1411.485	1490.846	1949.925	2199.880	2840.266	3354.370	3278.168	761.7170	1621.122	3900.247
1188.904	1186.154	1303.843	1526.593	2638.546	3538.587	3379.391	638.5325	1469.149	3986.078
1313.137	1403.036	1935.128	2199.130	2620.265	3031.684	3112.547	948.0181	1656.649	3609.165
1494.215	1542.500	1762.494	1988.496	2771.243	3218.373	3092.701	730.0217	2437.478	3556.718

3.5 Principal Component Analysis

We discussed principal component analysis (PCA) as applied to image data in the last chapter. In this final section of the chapter we will further explore PCA as applied to tabulated data.

3.6 Calculate Principal Components

There are multiple implementations of PCA available in R. Here, we use the prcomp() function from the stats package, which is included with a base R implementation.

For our example, we use the data that have been rescaled from 0 to 1. We only consider the numeric predictor variables, not the categorical response variable since PCA is not applicable to categorical data. When scale=TRUE, the data are rescaled to have unit variance (i.e., a standard deviation of one), which is generally recommended.

Once the PCA executes, summary() can be applied to the resulting object to obtain the proportion of variance explained by each component and the associated cumulative variance. The results suggest that the first four principal components capture ~99% of the variance in the original image bands. The first principal component captures ~66% while the second captures ~27%. In other words, ~93% of the variance in the original data are captured by only two new variables. Thus, due to the correlation between the original image bands, PCA offers a significant feature reduction.

pcaOut <- prcomp(trainProcessed[,1:10], scale = TRUE)
summary(pcaOut)

Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     2.5678 1.6490 0.61350 0.45025 0.26931 0.14352 0.10266
Proportion of Variance 0.6593 0.2719 0.03764 0.02027 0.00725 0.00206 0.00105
Cumulative Proportion  0.6593 0.9313 0.96892 0.98919 0.99644 0.99850 0.99956
                           PC8     PC9    PC10
Standard deviation     0.05716 0.02977 0.01672
Proportion of Variance 0.00033 0.00009 0.00003
Cumulative Proportion  0.99988 0.99997 1.00000

3.7 Explore PCA Results

The Eigenvectors are stored in the rotation object within the list object created by the prcomp() function. These values represent the coefficients that the 10 original image bands must be multiplied by in order to obtain the component. Negative values indicate that the original variable is negatively correlated with the component while a positive value indicates that it is positively correlated with the component. Larger magnitude values indicate that the input contributes more to the component.

pcaOut$rotation |> as.data.frame() |> gt()

PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
-0.2332634	-0.4499598	-0.09592833	-0.55918371	-0.543608333	0.3291772660	0.12301040	-0.001778931	-0.04701750	-0.0047617102
-0.3022350	-0.3656105	-0.08623443	-0.27708812	0.225995289	-0.7519375536	-0.19483653	0.174788435	0.07597545	0.0004594956
-0.3007377	-0.3732250	0.03542181	0.18152349	0.374574593	0.4406563880	-0.55950411	-0.265823234	0.13360553	-0.0036017235
-0.3592516	-0.2110618	0.09649944	0.12167750	0.465114161	0.1719181056	0.68421544	0.180699463	-0.23473280	0.0220022517
-0.3544881	0.2400222	-0.17453459	-0.03672366	-0.026899167	-0.1322480621	0.31263076	-0.653070559	0.48852054	-0.0617412527
-0.3307472	0.3037438	-0.26486505	-0.04236619	-0.038152324	-0.0529464411	-0.16993956	-0.226867213	-0.63654051	0.4864223404
-0.3295254	0.3152591	-0.18440194	-0.04018152	-0.014196488	0.0622893750	-0.14702228	0.162590327	-0.23712850	-0.8050114813
-0.2626938	0.3036496	0.85553680	-0.30337990	-0.003373832	-0.0001312174	-0.10444142	0.001153149	0.01871505	0.0594584771
-0.3388080	-0.2013612	0.20855286	0.68248338	-0.541717226	-0.1943846527	0.00137486	0.056490989	-0.02986709	-0.0075382556
-0.3275605	0.3130695	-0.25047381	0.01409093	-0.030482109	0.2013457311	-0.08104717	0.598776660	0.46664633	0.3277586720

The fvix_screeplot() function from the factoextra package generates a Scree plot, which shows the proportion of the variance explained by each principal component. Again, the majority of the variance in the original data is captured by the first two principal components.

fviz_screeplot(pcaOut, addlabels = TRUE)+
  theme_bw(base_size=11)

A biplot visualizes how the input predictor variables correlate with the first two principal components. The magnitude of the arrow in the x direction relates to the contribution of the variable to the first principal component while the magnitude in the y direction relates to its contribution to the second principal component. The direction relates to the sign of the Eigenvector for the two principal components. Vectors occurring in the right portion of the plot correspond to input variables that are positively correlated with the first principal component while those in the left are negatively correlated. In our example, all the input predictor variables are negatively correlated with the first principal component. All the Eigenvectors for the first principal component are negative, as is evident in the table above. Vectors occurring in the top portion of the plot correspond to input variables that are positively correlated with the second principal component while those in the lower portion are negatively correlated. Some of the input channels, such as the NIR band, are positively correlated with the second principal component while others, such as the red band, are negatively correlated with the second principal component.

fviz_pca_var(pcaOut)+
  theme_bw(base_size=11)

3.8 Concluding Remarks

We have now covered preprocessing considerations and methods for raster data in general and multispectral imagery and digital terrain models more specifically. This chapter focused on data preprocessing not specific to raster geospatial data. The next chapter is our final chapter associated with data preprocessing, and we discuss processing light detection and ranging (lidar) point clouds using the lidR package.

3.9 Questions

Explain the difference between one-hot encoding and dummy variables.
Explain how the Box-Cox transformation is applied.
Explain how the Yeo-Johnson transformation is applied.
Explain the relative impact of a natural log transformation on numbers based on their magnitude. In other words, does a natural log transformation result in compressing ranges of larger values or ranges of smaller values? Why might this be desirable?
Explain the relative impact of calculating the square of numbers based on their magnitude. In other words, does a square transformation result in compressing ranges of larger values or ranges of smaller values? Why might this be desirable?
From the recipes package, explain the difference between the step_center(), step_normalize(), step_range(), and step_scale() functions.
What is the difference between principal component analysis (PCA) and kernel PCA?
What is the difference between principal component analysis (PCA) and independent component analysis (ICA)?

3.10 Exercises

You have been provided two CSV files: training.csv and test.csv in the exercise folder for the chapter. Both represent data for a object-based image classification. Multiple data layers have been summarized relative to image objects including red, green, blue, and near infrared (NIR) image bands derived from National Agriculture Imagery Program (NAIP) orthophotography; feature heights, return intensity, and topographic slope derived from lidar; density of man-made structures; and density of roads. Your task is to perform preprocessing to prepare the tables for subsequent analysis.

Read in both tables
For both tables, complete the following tasks using the tidyverse:
- Extract out the “class” column and all other columns beginning with the word “Mean” (should result in 16 columns)
- Convert all character columns to factors
- Drop any rows with missing data in any column
- Some of the rows have missing values that contain a very low numeric placeholder value as opposed to NA (< 1e-30). Remove rows that have a value lower then 1e-30 in any numeric column
- Remove “Mean” from all extracted column names
Count the number of records within each class differentiated in the “Class” column in the training table
use ggpairs() to compare the correlation between all 15 extracted predictor variables (all columns other than “class”) in the training table; to speed this up, extract out a random subset of 200 samples prior to generating the correlation plot
Create grouped box plots using the “class” column and training set for the following predictor variables:
- “nDSM” (object-level mean normalized digital surface model; mean height above ground)
- “NDVI” (object-level mean normalized difference vegetation index)
- “Int” (object-level mean first return lidar intensity)
- “Slp” (object-level mean topographic slope)
- “Str250” (object-level mean structure density using a 250 m circular radius for kernel density calculation)
- “Roads250” (object-level mean road density using a 250 m circular radius for kernel density calculation)
Use recipes to perform the following preprocessing steps:
- Rescale all the predictor variables to a range of 0 to 1
- Convert the 15 predictor variables into 8 principal components
- Prepare the recipe using the training data
- Apply the prepared recipe to both the training and test tables
Using the results of your exploratory analysis, generate a short write up that discusses the following:
- Issues of class imbalance
- Issues of correlation between predictor variables
- How well specific classes are differentiated by the graphed predictor variables