19  Base R (Part II)

19.1 Topics Covered

  1. Define and use your own functions
  2. Implement while and for loops
  3. Implement control flow
  4. Vectorized options with dplyr and purrr

19.2 Introduction

Now, that we have covered base R and tidyverse syntax, we will introduce some other core coding concepts and their implementations in R. Specifically, we will explore

  1. functions: allow you to build your own tools or snippets of code that can be reused
  2. Loops: allow for iteration over data or elements, such as rows in a data frame/tibble or items in a vector
  3. control flow: use conditions to determine if or what code is executed
  4. Some alternatives using the tidyverse

Although these topics are discussed in the context of R, they are core to programming and scripting in general.

19.3 Functions

A function is a set of code that can be reused. If you find yourself using the same lines of code repeatedly, you may want to convert them into a function. This can make your code both more concise and easier to edit. For example, if you want to update the code, you can simply update the function as opposed to updating every instance of repeating lines.

In our first example below, we are defining a function to rescale a vector. In R, function() is used to define a new function. Our function is given the name scale2, and it has two parameters: data and scale. The values or inputs assigned to the parameters are called arguments. It is possible to define default arguments for parameters. In our example, the default argument for scale is 1. What the function actually does is defined inside of {}. This function does the following:

  1. Calculates the largest value in the vector using max()
  2. Calculates the smallest value in the vector using min()
  3. Subtracts the smallest value from the current value to obtain the numerator (n)
  4. Calculates the range for the denominator (d)
  5. Divides n by d to rescale the data from zero to one.
  6. Further rescales the data by multiplying by the scale argument.

This function specifically implements min-max rescaling to a range of zero to one then multiplies by the scale argument to rescale to a new range.

This is a good time to discuss the concept of scope in code. All of the variables defined inside of the function have local scope: they can only be used or called inside of the function. This is in contrast to variables that are created or declared outside of a function, which have global scope and can be called or used anywhere in the code.

The return() function defines what the function returns or produces. In this example, the rescaled data are returned as a vector. Note that functions can return a variety of different types of objects including vectors, matrices, arrays, data frames, and lists.

Executing the code below will instantiate the scale2() function. Once instantiated, it can then be used. In the following blocks of code we generated a vector of values then rescale the values from a range of 0 to 100 and 0 to 1. When rescaling from a range of 0 to 1, we do not need to provide and argument for the scale parameter since 1 is the default.

scale2 <- function(data, scale=1){
  max1 <- max(data)
  min1 <- min(data)
  n <- data-min1
  d <- max1-min1
  s <- n/d
  s2 <- s*scale
  return(s2)
}
x <- c(1, 14, 21, 16, 18, 16, 19, 20, 6, 8, 9, 11, 17)
x100 <- scale2(x, 100)
x1 <- scale2(x)
print(x)
 [1]  1 14 21 16 18 16 19 20  6  8  9 11 17
print(x100)
 [1]   0  65 100  75  85  75  90  95  25  35  40  50  80
print(x1)
 [1] 0.00 0.65 1.00 0.75 0.85 0.75 0.90 0.95 0.25 0.35 0.40 0.50 0.80

Next, we define a more complicated function that calculates root mean square error (RMSE) to assess the precision of a georeferencing process. The function is named rmse_georef() and has 4 parameters:

  • x_c: correct x coordinates
  • y_c: correct y coordinates
  • x_p: predicted x coordinates
  • y_p: predicted y coordinates

Each argument should be a vector of coordinates, all of the same length. The code inside the function calculates RMSEx, RMSEy, and RMSETotal. The output is a list object that contains vectors of residuals in the x and y directions and the three RMSE metrics. We also rename the objects within the list to make them more interpretable.

Once the function is initialized, it can be used by providing arguments for all four parameters. The objects contained in the resulting list can be accessed using $ and the object name.

rmse_georef <- function(x_c, y_c, x_p, y_p){
  x_residual <- x_c - x_p
  y_residual <- y_c - y_p
  x_residual_sq <- x_residual^2
  y_residual_sq <- y_residual^2
  rmse_x <- sqrt(sum(x_residual_sq)/length(x_residual))
  rmse_y <- sqrt(sum(y_residual_sq)/length(y_residual))
  rmse_total <- sqrt(rmse_x^2 + rmse_y^2)
  rmse_list <- list(x_residual, 
                    y_residual, 
                    rmse_x, 
                    rmse_y, 
                    rmse_total)
  names(rmse_list) <- c("X.Residuals", 
                        "Y.Residuals", 
                        "RMSE.X", 
                        "RMSE.Y", 
                        "RMSE.Total")
  return(rmse_list)
}
x_actual <- c(584026.624, 583179.7805, 589507.5837, 579463.0782, 585908.4986, 588190.2715)
y_actual <- c(4474131.442, 4479283.074, 4476648.449, 4478436.23, 4470697.021, 4480318.105)
x_predicted <-c(584041.7902, 583211.7964, 589496.2211, 579447.4653, 585909.7985, 588206.0155)
y_predicted <- c(4474159.608, 4479295.524, 4476664.073, 4478462.252, 4470719.12, 4480344.345)
example_rmse <-rmse_georef(x_c=x_actual, 
                           y_c =y_actual, 
                           x_p=x_predicted,
                           y_p=y_predicted)
print(example_rmse$RMSE.Total)
[1] 28.64713
print(example_rmse$RMSE.X)
[1] 17.68929
print(example_rmse$RMSE.Y)
[1] 22.53325

19.4 Loops

Loops allow for iterating over elements to perform an operation multiple times. There are two types of loops implemented in R.

  • While Loop: continues to iterate until a condition evaluates to FALSE
  • For Loop: iterates over all elements in an iterable object; these types of objects are capable of returning their contents separately, such as every item in a vector or every row in a data frame/tibble

19.4.1 while Loops

While loops are executed as long as a condition evaluates to TRUE. Once the condition evaluates to FALSE, the loop will stop. In the example below, the while loop is initiated with while() and the code within {} is what is executed if the condition evaluates to TRUE. The condition in this case is x1 > 90; as a result, the code executes as long as the variable x1 is greater than 90. x1 initially holds a value of 100. At the end of each iteration, 1 is subtracted from 100 (100, 99, 98, 97, …). When x1 reaches a value of 90, the condition evaluates to FALSE, and the loop stops.

If the condition never evaluates to FALSE, the loop will continue indefinitely (or until it is manually stopped or the system runs out of memory or experiences some other error). This is called an infinite loop.

x1 <- 100
while(x1 > 90) {
  print(x1)
  x1=x1-1
}
[1] 100
[1] 99
[1] 98
[1] 97
[1] 96
[1] 95
[1] 94
[1] 93
[1] 92
[1] 91

19.4.2 for Loops

In R, we generally find that we use for loops more often than while loops. The syntax is very similar. A for loop is defined using for(), and the code within {} is executed for each iteration of the loop. In the first example, we are iterating over the vector x of country names. The variable i represents the value being used for the current iteration of the loop. The print statement is generated for each country in the vector until all countries have been processed.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in x){
  print(paste("I would like to go to ", i, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

There is nothing special about the variable name i. Below, we have replicated the same code but using country as opposed to i. The important point is that the variable name defined inside of for() that represents the item for the current iteration must also be used inside of the code executed within the loop.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
  print(paste("I would like to go to ", country, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

For loops can be useful for executing a set of processes for a set of files. In the example below, we are listing all raster grid file names with the “.tif” extension within a directory into a vector. We then loop over this vector of file names to perform the following operations. (You cannot run this code since we did not provide the example raster data.)

  1. Read in the raster grid using the terra package
  2. Identify all cells in the grid with a value greater than 500 to generate a binary raster output
  3. Save the result to disk with the same name but in a new directory
library(terra)
raster_list <- list.files(path = "C:/Teaching/elev_data", pattern = "\\.tif$")
new_dir <- c("C:/Teaching/elev_data_out/")
for(ras in raster_list){
  r1 <- rast(paste0("C:/Teaching/elev_data/", ras))
  r2 <- r1 > 500
  writeRaster(r2, filename=paste0(new_dir, ras), format="GTiff", overwrite=TRUE)
}

It is important to understand that there are different means to iterate over an object. In our countries example, which is again included below, we are iterating over each element in the vector. Another option, which is demonstrated in the next code block, is to iterate over the indices for each element in the vector (1:length(x)). The result is the same, but the syntax has to be adjusted (i.e., "I would like to go to ", country, ".", sep="" vs. "I would like to go to ", x[i], ".", sep="").

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
  print(paste("I would like to go to ", country, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."
x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in 1:length(x)){
  print(paste("I would like to go to ", x[i], ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

As another example, below we demonstrate how to loop over rows in a data frame. Specifically, for(i in 1:nrow(course_data) indicates to loop over the row indices. We then grab data for the row of interest using course_data[i, "COLUMN NAME"].

course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)

course_data |> gt::gt()
course_prefix course_num course_name enrollment
Geog 107 Physical Geography 210
Geog 350 GIScience 45
Geol 101 Planet Earth 235
Geol 104 Earth Through Time 80
Geog 455 Remote Sensing 35
for(i in 1:nrow(course_data)){
  print(paste0(course_data[i, "course_prefix"],
              " ",
              as.character(course_data[i, "course_num"]),
              ": ",
              course_data[i, "course_name"]))
}
[1] "Geog 107: Physical Geography"
[1] "Geog 350: GIScience"
[1] "Geol 101: Planet Earth"
[1] "Geol 104: Earth Through Time"
[1] "Geog 455: Remote Sensing"

It is also possible to loop over column indices. In the example below, we use this method to print the data type for each column in the data frame. In the following code block, we obtain the same result by looping over the column names as opposed to their associated indices.

for(i in 1:ncol(course_data)){
  print(paste0("The data type of ", 
               "'",
               names(course_data)[i],
               "'",
               " is ", typeof(course_data[1,i]),
               "."))
}
[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."
cNames <- names(course_data)
for(i in cNames){
  print(paste0("The data type of ", 
               "'",
               i,
               "'",
               " is ", typeof(course_data[1,i]),
               "."))
}
[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."

19.5 Control Flow

Control flow allows you to not execute code or execute different code depending on a condition.

19.5.1 if

When code is wrapped inside of if(){}, it will only be executed if the associated condition evaluates to TRUE. In the first example, the variable a holds a value of 4, and the condition defined within if() is a <= 6. Since 4 is less than or equal to 6, the condition evaluates to TRUE, and the statement is printed. If the value associated with a was larger than 6, the code would not execute. When using if(), you must include a condition that evaluates to TRUE or FALSE.

a <- 4
if(a <= 6){
  print("Value less than or equal to 6.")
}
[1] "Value less than or equal to 6."

19.5.2 else

else() allows you to include code to execute if the condition associated with if() does not evaluate to TRUE. You can think of this as defining the default code to execute or the default behavior. Below, the condition evaluates to FALSE since the value associated with a is not less than or equal to 6. The code associated with else is executed as opposed to the code associated with if(). else does not require a condition since it is the default behavior.

a <- 8
if(a <=6){
  print("Value less than or equal to 6.")
}else{
  print("Value greater than than 6.")
}
[1] "Value greater than than 6."

19.5.3 else if

You can test multiple conditions using if() and else if(). The first condition must use if() while all subsequent conditions must use else if(). Since 8 is between 6 and 10, the code associated with else if() is executed.

If your conditions are not mutually exclusive, the code associated with the first condition that evaluates to TRUE is executed. So, the order matters. If none of the conditions evaluate to TRUE, then the code associated with else is executed.

a <- 8
if(a <= 6){
  print("Value less than or equal to 6.")
}else if(a > 6 & a <10){
  print("Value is between 6 and 12.")
}else{
  print("Value is greater than or equal to 12.")
}
[1] "Value is between 6 and 12."

19.6 Combining For Loops and Control Flow

It is possible to combine loops and control flow to have different code execute for each iteration of the loop depending on the item used in the current iteration. As the first example demonstrates, you can generate different print statements for each iteration depending on the defined conditions.

b <- c(1, 3, 5, 7, 9, 11)
for(num in b){
  if(num <= 6){
    print("Value less than or equal to 6.")
  }else if(num > 6 & num <10){
    print("Value is between 6 and 10.")
  }else{
    print("Value is greater than or equal to  10.")
  }
}
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value is between 6 and 10."
[1] "Value is between 6 and 10."
[1] "Value is greater than or equal to  10."

Instead of printing the results, you may want to save them to an object, such as a vector or a data frame. Below, we are defining an empty vector using c(). We then insert the generated character strings into this vector with each iteration of the loop. In the second example, we write the results to columns in a new data frame.

b <- c(1, 3, 5, 7, 9, 11)
c = c()
for(num in b){
  if(num <= 6){
    c=c(c, paste(num, "is less than or equal to 6.", sep=" "))
  }else if(num > 6 & num <10){
    c=c(c, paste(num, "is between 6 and 10.", sep=" "))
  }else{
    c=c(c, paste(num, "is greater than or equal to 10.", sep=" "))
  }
}
print(c)
[1] "1 is less than or equal to 6."      "3 is less than or equal to 6."     
[3] "5 is less than or equal to 6."      "7 is between 6 and 10."            
[5] "9 is between 6 and 10."             "11 is greater than or equal to 10."
course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)

course_data |> gt::gt()
course_prefix course_num course_name enrollment
Geog 107 Physical Geography 210
Geog 350 GIScience 45
Geol 101 Planet Earth 235
Geol 104 Earth Through Time 80
Geog 455 Remote Sensing 35
course_data2 <- data.frame(course = character(),
                           enrollment = numeric())
for(i in 1:nrow(course_data)){
  nameOut <- (paste0(course_data[i, "course_prefix"],
              " ",
              as.character(course_data[i, "course_num"]),
              ": ",
              course_data[i, "course_name"]))
  df2 <- data.frame(course=nameOut, enrollment=course_data[i, "enrollment"])
  
  course_data2 <- rbind(course_data2, df2)
}
print(course_data2)
                        course enrollment
1 Geog 107: Physical Geography        210
2          Geog 350: GIScience         45
3       Geol 101: Planet Earth        235
4 Geol 104: Earth Through Time         80
5     Geog 455: Remote Sensing         35

For loops tend to be slower than vectorized operations in R. For example, if you want to take the square root of each value in a vector you could iterate over each element in the vector with a for loop and write the results to a new vector. Alternatively, you could simply do xSq = sqrt(x). This is a vectorized version where R performs the calculation for each element in the vector without you needing to do so using a for loop. In short, for loops are not always the best solution.

19.6.1 next and break

There are means to further control the execution of loops with associated control flow. next is used to skip the current iteration of the loop and move on to the next iteration if a condition evaluates to TRUE while break allows us to fully stop the execution of the loop when a condition evaluates to TRUE. In our example, if the current value is odd (i.e., results in a remainder of 1 when divided by 2), that iteration is skipped. If the current value is greater than 15, then the loop is stopped. This results in only only even values up to and including 14 being written to the new vector.

a <- seq(1, 21, by=1)
b <- c()
for (i in a) {
  if(i%%2 == 1){
    next
  }else if(i > 15){
    break
  }else{
      b <- c(b, i)
  }
}
print(b)
[1]  2  4  6  8 10 12 14

19.7 which

which() is used to return the index associated with an item in a vector that meets a certain condition. In our example, only indices associated with even values are returned. which can be used to querying data, as is also demonstrated. Again, which returns the indices associated with the items not the the items themselves. Printing the result of which shows the indices while using it in a query results in filtering out the items at the indices that meet the criteria.

y <- seq(2, 40, 3)
print(which(y%%2 == 0))
[1]  1  3  5  7  9 11 13
 [1]  2  5  8 11 14 17 20 23 26 29 32 35 38
print(y[which(y%%2 == 0)])
[1]  2  8 14 20 26 32 38

19.8 Final Base R Example

To tie these concepts together, we have built another function that incorporates control flow. This function allows for rescaling data using three different methods: min-max, z-score, and robust, which makes use of the median and interquartile range (IQR). The user must provide a vector to rescale along with the desired method. If an invalid method is defined, then a message is printed, as defined using message() within the default else statement. The second code block demonstrates the result if an invalid method name is provided while the last code block demonstrates the result when using the robust method.

scale2 <- function(data, method="min-max"){
  if(method=="min-max"){
    max1 <- max(data)
    min1 <- min(data)
    n <- data-min1
    d <- max1-min1
    s <- n/d
    return(s)
  }else if(method=="z-score"){
    sd1 <- sd(data)
    mn1 <- mean(data)
    s <- (data-mn1)/(sd1)
    return(s)
  }else if(method=="robust"){
    mdn1 <- median(data)
    iqr1 <- IQR(data)
    s <- (data - mdn1)/(iqr1)
    return(s)
  }else{
    message("No appropriate method provided.")
  }
}
x <- seq(1,255,3)

scale2(x, method="erfjpiiopdrfjpidr")
No appropriate method provided.
x <- seq(1,255,3)

scale2(x, method="robust")
 [1] -1.00000000 -0.97619048 -0.95238095 -0.92857143 -0.90476190 -0.88095238
 [7] -0.85714286 -0.83333333 -0.80952381 -0.78571429 -0.76190476 -0.73809524
[13] -0.71428571 -0.69047619 -0.66666667 -0.64285714 -0.61904762 -0.59523810
[19] -0.57142857 -0.54761905 -0.52380952 -0.50000000 -0.47619048 -0.45238095
[25] -0.42857143 -0.40476190 -0.38095238 -0.35714286 -0.33333333 -0.30952381
[31] -0.28571429 -0.26190476 -0.23809524 -0.21428571 -0.19047619 -0.16666667
[37] -0.14285714 -0.11904762 -0.09523810 -0.07142857 -0.04761905 -0.02380952
[43]  0.00000000  0.02380952  0.04761905  0.07142857  0.09523810  0.11904762
[49]  0.14285714  0.16666667  0.19047619  0.21428571  0.23809524  0.26190476
[55]  0.28571429  0.30952381  0.33333333  0.35714286  0.38095238  0.40476190
[61]  0.42857143  0.45238095  0.47619048  0.50000000  0.52380952  0.54761905
[67]  0.57142857  0.59523810  0.61904762  0.64285714  0.66666667  0.69047619
[73]  0.71428571  0.73809524  0.76190476  0.78571429  0.80952381  0.83333333
[79]  0.85714286  0.88095238  0.90476190  0.92857143  0.95238095  0.97619048
[85]  1.00000000

19.9 tidyverse Examples

The tidyverse, and specifically the dplyr and purrr packages, also provide functionality that can accomplish tasks similar to the methods discussed above. Here, we introduce a few methods; however, this is not meant to be a detailed discussion of the tidyverse functionality. For this section we use the us_county_data.csv file used in the last chapter.

fldrPth <- "gslrData/chpt19/data/"

cntyD <- read_csv(str_glue("{fldrPth}us_county_data.csv")) |> 
  mutate_if(is.character, as.factor)

19.9.1 Example 1: case_when()

The case_when() function from dplyr is a vectorized version of control flow. We would like to generate a new column in the table that categorizes the counties based on their mean elevations as either “low elevation” (< 200 meters), “moderate elevation” (>= 200 & < 1400 meters), or “high elevation” (>= 1400 meters). This can be accomplished using a combination of mutate() and case_when(). Note that it is also possible to specify a default value using the .default argument, which serves the same purpose as the else statement in control flow. Just to check the results, we also count the number of counties assigned to each elevation categorization.

cntyD <- cntyD |>
  mutate(elevCat = case_when(
    dem < 200 ~ "low elevation", 
    dem >= 200 & dem < 1400 ~ "moderate elevation",
    dem >= 1400 ~ "high elevation")
    )

cntyD |> 
  group_by(elevCat) |>
  count() |>
  ungroup() |> gt::gt()
elevCat n
high elevation 207
low elevation 1013
moderate elevation 1884

19.9.2 Example 2: across()

The across() function from dplyr is used to apply a function across multiple columns. This is a form of functional programming since across() can accept another function as an input. In our first example, we are calculating the mean by sub-region for all columns whose name contains “per_”. In the second block, we convert all columns whose name contains “per_” to proportions by dividing by 100.

cntyD |>
  group_by(SUB_REGION) |>
  summarize(across(contains("per_"), median, na.rm=TRUE)) |> gt::gt()
SUB_REGION per_desk_lap per_smartphone per_no_comp per_internet per_broadband per_no_internet per_for per_dev per_wet per_crop per_past_grass per_karst
E N Cen 72.27635 77.10856 11.276944 81.62938 81.19393 15.59331 19.95011 7.949186 3.324924 43.7808852 6.714078 6.1374249
E S Cen 61.65756 74.90408 16.298296 74.94303 74.49422 21.96604 51.17073 6.474498 2.016102 2.5922327 17.122029 9.3044943
Mid Atl 75.53235 76.15280 10.735288 83.16127 82.53561 14.18328 53.15554 10.903946 4.102675 4.4547073 11.939469 1.1394162
Mtn 76.38125 78.56885 9.522525 82.48311 81.67203 14.67247 14.08876 1.318203 1.098844 2.2233125 16.128277 0.4319281
N Eng 80.62692 79.37880 8.412642 86.10956 85.37345 11.14844 62.52341 10.912455 9.756564 0.4973931 5.391075 0.0000000
Pacific 78.92488 83.07681 7.673608 86.11314 85.70953 11.43221 35.71459 5.097165 1.325634 2.2844083 16.283601 0.0000000
S Atl 69.00578 76.99552 12.972620 78.00948 77.54210 19.07997 42.77621 9.009878 5.982395 2.1454124 9.733641 0.0000000
W N Cen 72.16934 76.04420 11.809872 80.29481 79.76734 16.66382 3.27932 4.641385 1.909883 48.3399893 24.056722 1.6463494
W S Cen 64.19260 79.37310 12.865344 76.60695 76.14822 20.76308 12.78507 4.969078 1.921824 5.6029459 22.462895 0.0000000
cntyD |>
  mutate(across(contains("per_"), ~ .x/100)) |>
  select(NAME, contains("per_")) |>
  head() |> gt::gt()
NAME per_desk_lap per_smartphone per_no_comp per_internet per_broadband per_no_internet per_for per_dev per_wet per_crop per_past_grass per_karst
Autauga County 0.7413609 0.8100561 0.08571826 0.8279605 0.8270792 0.1533930 0.4971597 0.06815302 0.108697204 0.0314176012 0.2079571 0.0000000000
Baldwin County 0.7773865 0.8463598 0.08239437 0.8552358 0.8506907 0.1191952 0.3313072 0.09449728 0.306357571 0.0919548790 0.1080689 0.0000000000
Barbour County 0.5245655 0.6869770 0.20220983 0.6499678 0.6463205 0.2896374 0.5658767 0.04227699 0.088715967 0.0453634268 0.1282374 0.1491452991
Bibb County 0.5617854 0.7361896 0.16779171 0.7616752 0.7612619 0.2093952 0.7028501 0.05258896 0.065578020 0.0006282062 0.1007816 0.1105620754
Blount County 0.6686631 0.7643952 0.14963452 0.8003301 0.7962273 0.1848621 0.5495027 0.08577787 0.006341722 0.0118411362 0.3043296 0.1819798459
Bullock County 0.5523476 0.6926218 0.20326626 0.6278798 0.6062992 0.3032954 0.5464416 0.02983319 0.130312271 0.0203029122 0.1764950 0.0006184292

19.9.3 Example 3: map()

The map() function from purrr is used to perform an operation for each element in a vector. Here, we use it to read all CSV files in a vector of CSV file paths. We use the data provided in the csvSet folder. You may have to change your working directory to execute the code. map() is used to read all the files in the list of file paths using read_csv() from readr. This results in a list object where the data from each CSV file are stored as separate tibble objects within the list. All the records are then collapsed to a single tibble using list_rbind(). The result is all 3,104 records from the 40 CSV files stored in the directory being aggregated to a single tibble.

csvPth <- "gslrData/chpt19/data/csvSets/"
csvFiles <- list.files(csvPth, pattern = "\\.csv$", full.names=TRUE) 

csvData <- csvFiles |>
  map(read_csv) |>
  list_rbind()

These few examples just scratch the surface of using dplyr and purrr. For further exploration, we recommend the text R for Data Science.

19.10 Concluding Remarks

Now that we have covered base R syntax, the tidyverse, and core coding concepts relating to creating your own functions and implementing loops and control flow, we will move on to discuss data visualization using ggplot2, which is the focus of the next two chapters. Following that, we will discuss designing tables using gt.

19.11 Questions

  1. Explain the difference between local and global scope for a variable.
  2. Explain the difference between a for loop and a while loop as implemented in R.
  3. Describe an example of a while loop that would result in an infinite loop.
  4. Why is it not necessary to include a condition with else()within control flow?
  5. Explain the difference between next and break.
  6. Explain the concept of functional programming.
  7. What is the purpose of the across() function?
  8. What is the purpose of the map() function?

19.12 Exercises

Overall Accuracy Function

Task 1

Create a function that will calculate overall accuracy for a classification when given the correct class and the predicted class.  A dataset has been provided (classification_data.csv) in the exercise folder for the chapter, which contains three columns: “class”, “spec”, and “spec_lidar”. The “class” column contains the correct classification (what the sample actually was) while the “spec” and “spec_lidar” columns contain the predicted classification (what an algorithm predicted the class to be). Specifically, the “spec” column is a result obtained using just spectral image bands while the “spec_lidar” column is a result obtained using a combination of spectral bands and light detection and ranging (lidar) data.

Create a function that will generate a confusion matrix from the correct and predicted data, calculate overall accuracy from the table, then return the overall accuracy result. Note that the table() function can be used to create a contingency table or confusion matrix. The diag() function can be used to extract values in the diagonal cells that represent the correct predictions. You will also need to use sum().

Use your new function to calculate the overall accuracy for the spectral only and spectral + lidar results. Which model yielded the highest overall accuracy?

Task 2

Combine a for loop and if else statement to write all rows or samples in the classification_data.csv file that were correctly predicted using the spectral and lidar (“spec_lidar”) data to a new data frame and all incorrectly classified rows to a different data frame.

Picture Editing Function/Loop

Task 1

You have been provided with a folder of pictures (pictures) in the exercise folder for the chapter. The photos are from the southwestern United States. This assignment asks you to write a function to perform some editing tasks on these images. Hint: the imager package can be used to process images in R. Create a function with the following characteristics.

  1. Function accepts the following parameters: an input image, a Boolean variable indicating whether or not to crop the image, lower bounds representing the percent of the image to crop, upper bounds representing the percent of the image to crop, a Boolean variable indicating whether or not to resize the image, a resizing factor, and a Boolean variable indicating whether or not to convert the image to grayscale.
  2. Function is able to crop an image by a random percentage in both the x and y directions within the specified upper and lower bounds. For example, if the lower bound is set to 20% and the upper bound is set to 40%, a random percentage in this range will be selected then used to crop the image. Different random values between the lower and upper bounds should be able to be applied for the x- and y direction crops.
  3. Function is able to resize the image by the specified factor. For example, if the factor is 1, this means that the image is not resized. If the factor is 0.5, the resolution is decreased by half. The image should maintain its original aspect ratio.
  4. Function should be able to convert the image to grayscale if the user specifies this.
  5. Function should return the image object.

Task 2

Use the function within a for loop to process all of the images in the folder. Each iteration of the loop should process one image. Save the results to a new folder on disk. Plot one original and processed image pair.