19 Base R (Part II)

19.1 Topics Covered

Define and use your own functions
Implement while and for loops
Implement control flow
Vectorized options with dplyr and purrr

19.2 Introduction

Now, that we have covered base R and tidyverse syntax, we will introduce some other core coding concepts and their implementations in R. Specifically, we will explore

functions: allow you to build your own tools or snippets of code that can be reused
Loops: allow for iteration over data or elements, such as rows in a data frame/tibble or items in a vector
control flow: use conditions to determine if or what code is executed
Some alternatives using the tidyverse

Although these topics are discussed in the context of R, they are core to programming and scripting in general.

19.3 Functions

A function is a set of code that can be reused. If you find yourself using the same lines of code repeatedly, you may want to convert them into a function. This can make your code both more concise and easier to edit. For example, if you want to update the code, you can simply update the function as opposed to updating every instance of repeating lines.

In our first example below, we are defining a function to rescale a vector. In R, function() is used to define a new function. Our function is given the name scale2, and it has two parameters: data and scale. The values or inputs assigned to the parameters are called arguments. It is possible to define default arguments for parameters. In our example, the default argument for scale is 1. What the function actually does is defined inside of {}. This function does the following:

Calculates the largest value in the vector using max()
Calculates the smallest value in the vector using min()
Subtracts the smallest value from the current value to obtain the numerator (n)
Calculates the range for the denominator (d)
Divides n by d to rescale the data from zero to one.
Further rescales the data by multiplying by the scale argument.

This function specifically implements min-max rescaling to a range of zero to one then multiplies by the scale argument to rescale to a new range.

This is a good time to discuss the concept of scope in code. All of the variables defined inside of the function have local scope: they can only be used or called inside of the function. This is in contrast to variables that are created or declared outside of a function, which have global scope and can be called or used anywhere in the code.

The return() function defines what the function returns or produces. In this example, the rescaled data are returned as a vector. Note that functions can return a variety of different types of objects including vectors, matrices, arrays, data frames, and lists.

Executing the code below will instantiate the scale2() function. Once instantiated, it can then be used. In the following blocks of code we generated a vector of values then rescale the values from a range of 0 to 100 and 0 to 1. When rescaling from a range of 0 to 1, we do not need to provide and argument for the scale parameter since 1 is the default.

scale2 <- function(data, scale=1){
  max1 <- max(data)
  min1 <- min(data)
  n <- data-min1
  d <- max1-min1
  s <- n/d
  s2 <- s*scale
  return(s2)
}

x <- c(1, 14, 21, 16, 18, 16, 19, 20, 6, 8, 9, 11, 17)
x100 <- scale2(x, 100)
x1 <- scale2(x)
print(x)

 [1]  1 14 21 16 18 16 19 20  6  8  9 11 17

print(x100)

 [1]   0  65 100  75  85  75  90  95  25  35  40  50  80

print(x1)

 [1] 0.00 0.65 1.00 0.75 0.85 0.75 0.90 0.95 0.25 0.35 0.40 0.50 0.80

Next, we define a more complicated function that calculates root mean square error (RMSE) to assess the precision of a georeferencing process. The function is named rmse_georef() and has 4 parameters:

x_c: correct x coordinates
y_c: correct y coordinates
x_p: predicted x coordinates
y_p: predicted y coordinates

Each argument should be a vector of coordinates, all of the same length. The code inside the function calculates RMSE_x, RMSE_y, and RMSE_Total. The output is a list object that contains vectors of residuals in the x and y directions and the three RMSE metrics. We also rename the objects within the list to make them more interpretable.

Once the function is initialized, it can be used by providing arguments for all four parameters. The objects contained in the resulting list can be accessed using $ and the object name.

rmse_georef <- function(x_c, y_c, x_p, y_p){
  x_residual <- x_c - x_p
  y_residual <- y_c - y_p
  x_residual_sq <- x_residual^2
  y_residual_sq <- y_residual^2
  rmse_x <- sqrt(sum(x_residual_sq)/length(x_residual))
  rmse_y <- sqrt(sum(y_residual_sq)/length(y_residual))
  rmse_total <- sqrt(rmse_x^2 + rmse_y^2)
  rmse_list <- list(x_residual, 
                    y_residual, 
                    rmse_x, 
                    rmse_y, 
                    rmse_total)
  names(rmse_list) <- c("X.Residuals", 
                        "Y.Residuals", 
                        "RMSE.X", 
                        "RMSE.Y", 
                        "RMSE.Total")
  return(rmse_list)
}

x_actual <- c(584026.624, 583179.7805, 589507.5837, 579463.0782, 585908.4986, 588190.2715)
y_actual <- c(4474131.442, 4479283.074, 4476648.449, 4478436.23, 4470697.021, 4480318.105)
x_predicted <-c(584041.7902, 583211.7964, 589496.2211, 579447.4653, 585909.7985, 588206.0155)
y_predicted <- c(4474159.608, 4479295.524, 4476664.073, 4478462.252, 4470719.12, 4480344.345)
example_rmse <-rmse_georef(x_c=x_actual, 
                           y_c =y_actual, 
                           x_p=x_predicted,
                           y_p=y_predicted)
print(example_rmse$RMSE.Total)

[1] 28.64713

print(example_rmse$RMSE.X)

[1] 17.68929

print(example_rmse$RMSE.Y)

[1] 22.53325

19.4 Loops

Loops allow for iterating over elements to perform an operation multiple times. There are two types of loops implemented in R.

While Loop: continues to iterate until a condition evaluates to FALSE
For Loop: iterates over all elements in an iterable object; these types of objects are capable of returning their contents separately, such as every item in a vector or every row in a data frame/tibble

19.4.1 `while` Loops

While loops are executed as long as a condition evaluates to TRUE. Once the condition evaluates to FALSE, the loop will stop. In the example below, the while loop is initiated with while() and the code within {} is what is executed if the condition evaluates to TRUE. The condition in this case is x1 > 90; as a result, the code executes as long as the variable x1 is greater than 90. x1 initially holds a value of 100. At the end of each iteration, 1 is subtracted from 100 (100, 99, 98, 97, …). When x1 reaches a value of 90, the condition evaluates to FALSE, and the loop stops.

If the condition never evaluates to FALSE, the loop will continue indefinitely (or until it is manually stopped or the system runs out of memory or experiences some other error). This is called an infinite loop.

x1 <- 100
while(x1 > 90) {
  print(x1)
  x1=x1-1
}

[1] 100
[1] 99
[1] 98
[1] 97
[1] 96
[1] 95
[1] 94
[1] 93
[1] 92
[1] 91

19.4.2 `for` Loops

In R, we generally find that we use for loops more often than while loops. The syntax is very similar. A for loop is defined using for(), and the code within {} is executed for each iteration of the loop. In the first example, we are iterating over the vector x of country names. The variable i represents the value being used for the current iteration of the loop. The print statement is generated for each country in the vector until all countries have been processed.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in x){
  print(paste("I would like to go to ", i, ".", sep=""))
}

[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

There is nothing special about the variable name i. Below, we have replicated the same code but using country as opposed to i. The important point is that the variable name defined inside of for() that represents the item for the current iteration must also be used inside of the code executed within the loop.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
  print(paste("I would like to go to ", country, ".", sep=""))
}

[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

For loops can be useful for executing a set of processes for a set of files. In the example below, we are listing all raster grid file names with the “.tif” extension within a directory into a vector. We then loop over this vector of file names to perform the following operations. (You cannot run this code since we did not provide the example raster data.)

Read in the raster grid using the terra package
Identify all cells in the grid with a value greater than 500 to generate a binary raster output
Save the result to disk with the same name but in a new directory

library(terra)
raster_list <- list.files(path = "C:/Teaching/elev_data", pattern = "\\.tif$")
new_dir <- c("C:/Teaching/elev_data_out/")
for(ras in raster_list){
  r1 <- rast(paste0("C:/Teaching/elev_data/", ras))
  r2 <- r1 > 500
  writeRaster(r2, filename=paste0(new_dir, ras), format="GTiff", overwrite=TRUE)
}

It is important to understand that there are different means to iterate over an object. In our countries example, which is again included below, we are iterating over each element in the vector. Another option, which is demonstrated in the next code block, is to iterate over the indices for each element in the vector (1:length(x)). The result is the same, but the syntax has to be adjusted (i.e., "I would like to go to ", country, ".", sep="" vs. "I would like to go to ", x[i], ".", sep="").

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
  print(paste("I would like to go to ", country, ".", sep=""))
}

[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in 1:length(x)){
  print(paste("I would like to go to ", x[i], ".", sep=""))
}

[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

As another example, below we demonstrate how to loop over rows in a data frame. Specifically, for(i in 1:nrow(course_data) indicates to loop over the row indices. We then grab data for the row of interest using course_data[i, "COLUMN NAME"].

course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)

course_data |> gt::gt()

course_prefix	course_num	course_name	enrollment
Geog	107	Physical Geography	210
Geog	350	GIScience	45
Geol	101	Planet Earth	235
Geol	104	Earth Through Time	80
Geog	455	Remote Sensing	35

for(i in 1:nrow(course_data)){
  print(paste0(course_data[i, "course_prefix"],
              " ",
              as.character(course_data[i, "course_num"]),
              ": ",
              course_data[i, "course_name"]))
}

[1] "Geog 107: Physical Geography"
[1] "Geog 350: GIScience"
[1] "Geol 101: Planet Earth"
[1] "Geol 104: Earth Through Time"
[1] "Geog 455: Remote Sensing"

It is also possible to loop over column indices. In the example below, we use this method to print the data type for each column in the data frame. In the following code block, we obtain the same result by looping over the column names as opposed to their associated indices.

for(i in 1:ncol(course_data)){
  print(paste0("The data type of ", 
               "'",
               names(course_data)[i],
               "'",
               " is ", typeof(course_data[1,i]),
               "."))
}

[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."

cNames <- names(course_data)
for(i in cNames){
  print(paste0("The data type of ", 
               "'",
               i,
               "'",
               " is ", typeof(course_data[1,i]),
               "."))
}

[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."

19.5 Control Flow

Control flow allows you to not execute code or execute different code depending on a condition.

19.5.1 `if`

When code is wrapped inside of if(){}, it will only be executed if the associated condition evaluates to TRUE. In the first example, the variable a holds a value of 4, and the condition defined within if() is a <= 6. Since 4 is less than or equal to 6, the condition evaluates to TRUE, and the statement is printed. If the value associated with a was larger than 6, the code would not execute. When using if(), you must include a condition that evaluates to TRUE or FALSE.

a <- 4
if(a <= 6){
  print("Value less than or equal to 6.")
}

[1] "Value less than or equal to 6."

19.5.2 `else`

else() allows you to include code to execute if the condition associated with if() does not evaluate to TRUE. You can think of this as defining the default code to execute or the default behavior. Below, the condition evaluates to FALSE since the value associated with a is not less than or equal to 6. The code associated with else is executed as opposed to the code associated with if(). else does not require a condition since it is the default behavior.

a <- 8
if(a <=6){
  print("Value less than or equal to 6.")
}else{
  print("Value greater than than 6.")
}

[1] "Value greater than than 6."

19.5.3 `else if`

You can test multiple conditions using if() and else if(). The first condition must use if() while all subsequent conditions must use else if(). Since 8 is between 6 and 10, the code associated with else if() is executed.

If your conditions are not mutually exclusive, the code associated with the first condition that evaluates to TRUE is executed. So, the order matters. If none of the conditions evaluate to TRUE, then the code associated with else is executed.

a <- 8
if(a <= 6){
  print("Value less than or equal to 6.")
}else if(a > 6 & a <10){
  print("Value is between 6 and 12.")
}else{
  print("Value is greater than or equal to 12.")
}

[1] "Value is between 6 and 12."

19.6 Combining For Loops and Control Flow

It is possible to combine loops and control flow to have different code execute for each iteration of the loop depending on the item used in the current iteration. As the first example demonstrates, you can generate different print statements for each iteration depending on the defined conditions.

b <- c(1, 3, 5, 7, 9, 11)
for(num in b){
  if(num <= 6){
    print("Value less than or equal to 6.")
  }else if(num > 6 & num <10){
    print("Value is between 6 and 10.")
  }else{
    print("Value is greater than or equal to  10.")
  }
}

[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value is between 6 and 10."
[1] "Value is between 6 and 10."
[1] "Value is greater than or equal to  10."

Instead of printing the results, you may want to save them to an object, such as a vector or a data frame. Below, we are defining an empty vector using c(). We then insert the generated character strings into this vector with each iteration of the loop. In the second example, we write the results to columns in a new data frame.

b <- c(1, 3, 5, 7, 9, 11)
c = c()
for(num in b){
  if(num <= 6){
    c=c(c, paste(num, "is less than or equal to 6.", sep=" "))
  }else if(num > 6 & num <10){
    c=c(c, paste(num, "is between 6 and 10.", sep=" "))
  }else{
    c=c(c, paste(num, "is greater than or equal to 10.", sep=" "))
  }
}
print(c)

[1] "1 is less than or equal to 6."      "3 is less than or equal to 6."     
[3] "5 is less than or equal to 6."      "7 is between 6 and 10."            
[5] "9 is between 6 and 10."             "11 is greater than or equal to 10."

course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)

course_data |> gt::gt()

course_prefix	course_num	course_name	enrollment
Geog	107	Physical Geography	210
Geog	350	GIScience	45
Geol	101	Planet Earth	235
Geol	104	Earth Through Time	80
Geog	455	Remote Sensing	35

course_data2 <- data.frame(course = character(),
                           enrollment = numeric())
for(i in 1:nrow(course_data)){
  nameOut <- (paste0(course_data[i, "course_prefix"],
              " ",
              as.character(course_data[i, "course_num"]),
              ": ",
              course_data[i, "course_name"]))
  df2 <- data.frame(course=nameOut, enrollment=course_data[i, "enrollment"])
  
  course_data2 <- rbind(course_data2, df2)
}
print(course_data2)

                        course enrollment
1 Geog 107: Physical Geography        210
2          Geog 350: GIScience         45
3       Geol 101: Planet Earth        235
4 Geol 104: Earth Through Time         80
5     Geog 455: Remote Sensing         35

For loops tend to be slower than vectorized operations in R. For example, if you want to take the square root of each value in a vector you could iterate over each element in the vector with a for loop and write the results to a new vector. Alternatively, you could simply do xSq = sqrt(x). This is a vectorized version where R performs the calculation for each element in the vector without you needing to do so using a for loop. In short, for loops are not always the best solution.

19.6.1 `next` and `break`

There are means to further control the execution of loops with associated control flow. next is used to skip the current iteration of the loop and move on to the next iteration if a condition evaluates to TRUE while break allows us to fully stop the execution of the loop when a condition evaluates to TRUE. In our example, if the current value is odd (i.e., results in a remainder of 1 when divided by 2), that iteration is skipped. If the current value is greater than 15, then the loop is stopped. This results in only only even values up to and including 14 being written to the new vector.

a <- seq(1, 21, by=1)
b <- c()
for (i in a) {
  if(i%%2 == 1){
    next
  }else if(i > 15){
    break
  }else{
      b <- c(b, i)
  }
}
print(b)

[1]  2  4  6  8 10 12 14

19.7 `which`

which() is used to return the index associated with an item in a vector that meets a certain condition. In our example, only indices associated with even values are returned. which can be used to querying data, as is also demonstrated. Again, which returns the indices associated with the items not the the items themselves. Printing the result of which shows the indices while using it in a query results in filtering out the items at the indices that meet the criteria.

y <- seq(2, 40, 3)
print(which(y%%2 == 0))

[1]  1  3  5  7  9 11 13

print(y)

 [1]  2  5  8 11 14 17 20 23 26 29 32 35 38

print(y[which(y%%2 == 0)])

[1]  2  8 14 20 26 32 38

19.8 Final Base R Example

To tie these concepts together, we have built another function that incorporates control flow. This function allows for rescaling data using three different methods: min-max, z-score, and robust, which makes use of the median and interquartile range (IQR). The user must provide a vector to rescale along with the desired method. If an invalid method is defined, then a message is printed, as defined using message() within the default else statement. The second code block demonstrates the result if an invalid method name is provided while the last code block demonstrates the result when using the robust method.

scale2 <- function(data, method="min-max"){
  if(method=="min-max"){
    max1 <- max(data)
    min1 <- min(data)
    n <- data-min1
    d <- max1-min1
    s <- n/d
    return(s)
  }else if(method=="z-score"){
    sd1 <- sd(data)
    mn1 <- mean(data)
    s <- (data-mn1)/(sd1)
    return(s)
  }else if(method=="robust"){
    mdn1 <- median(data)
    iqr1 <- IQR(data)
    s <- (data - mdn1)/(iqr1)
    return(s)
  }else{
    message("No appropriate method provided.")
  }
}

x <- seq(1,255,3)

scale2(x, method="erfjpiiopdrfjpidr")

No appropriate method provided.

x <- seq(1,255,3)

scale2(x, method="robust")

 [1] -1.00000000 -0.97619048 -0.95238095 -0.92857143 -0.90476190 -0.88095238
 [7] -0.85714286 -0.83333333 -0.80952381 -0.78571429 -0.76190476 -0.73809524
[13] -0.71428571 -0.69047619 -0.66666667 -0.64285714 -0.61904762 -0.59523810
[19] -0.57142857 -0.54761905 -0.52380952 -0.50000000 -0.47619048 -0.45238095
[25] -0.42857143 -0.40476190 -0.38095238 -0.35714286 -0.33333333 -0.30952381
[31] -0.28571429 -0.26190476 -0.23809524 -0.21428571 -0.19047619 -0.16666667
[37] -0.14285714 -0.11904762 -0.09523810 -0.07142857 -0.04761905 -0.02380952
[43]  0.00000000  0.02380952  0.04761905  0.07142857  0.09523810  0.11904762
[49]  0.14285714  0.16666667  0.19047619  0.21428571  0.23809524  0.26190476
[55]  0.28571429  0.30952381  0.33333333  0.35714286  0.38095238  0.40476190
[61]  0.42857143  0.45238095  0.47619048  0.50000000  0.52380952  0.54761905
[67]  0.57142857  0.59523810  0.61904762  0.64285714  0.66666667  0.69047619
[73]  0.71428571  0.73809524  0.76190476  0.78571429  0.80952381  0.83333333
[79]  0.85714286  0.88095238  0.90476190  0.92857143  0.95238095  0.97619048
[85]  1.00000000

19.9 tidyverse Examples

The tidyverse, and specifically the dplyr and purrr packages, also provide functionality that can accomplish tasks similar to the methods discussed above. Here, we introduce a few methods; however, this is not meant to be a detailed discussion of the tidyverse functionality. For this section we use the us_county_data.csv file used in the last chapter.

library(tidyverse)

fldrPth <- "gslrData/chpt19/data/"

cntyD <- read_csv(str_glue("{fldrPth}us_county_data.csv")) |> 
  mutate_if(is.character, as.factor)

19.9.1 Example 1: `case_when()`

The case_when() function from dplyr is a vectorized version of control flow. We would like to generate a new column in the table that categorizes the counties based on their mean elevations as either “low elevation” (< 200 meters), “moderate elevation” (>= 200 & < 1400 meters), or “high elevation” (>= 1400 meters). This can be accomplished using a combination of mutate() and case_when(). Note that it is also possible to specify a default value using the .default argument, which serves the same purpose as the else statement in control flow. Just to check the results, we also count the number of counties assigned to each elevation categorization.

cntyD <- cntyD |>
  mutate(elevCat = case_when(
    dem < 200 ~ "low elevation", 
    dem >= 200 & dem < 1400 ~ "moderate elevation",
    dem >= 1400 ~ "high elevation")
    )

cntyD |> 
  group_by(elevCat) |>
  count() |>
  ungroup() |> gt::gt()

elevCat	n
high elevation	207
low elevation	1013
moderate elevation	1884

19.9.2 Example 2: `across()`

The across() function from dplyr is used to apply a function across multiple columns. This is a form of functional programming since across() can accept another function as an input. In our first example, we are calculating the mean by sub-region for all columns whose name contains “per_”. In the second block, we convert all columns whose name contains “per_” to proportions by dividing by 100.

cntyD |>
  group_by(SUB_REGION) |>
  summarize(across(contains("per_"), median, na.rm=TRUE)) |> gt::gt()

SUB_REGION	per_desk_lap	per_smartphone	per_no_comp	per_internet	per_broadband	per_no_internet	per_for	per_dev	per_wet	per_crop	per_past_grass	per_karst
E N Cen	72.27635	77.10856	11.276944	81.62938	81.19393	15.59331	19.95011	7.949186	3.324924	43.7808852	6.714078	6.1374249
E S Cen	61.65756	74.90408	16.298296	74.94303	74.49422	21.96604	51.17073	6.474498	2.016102	2.5922327	17.122029	9.3044943
Mid Atl	75.53235	76.15280	10.735288	83.16127	82.53561	14.18328	53.15554	10.903946	4.102675	4.4547073	11.939469	1.1394162
Mtn	76.38125	78.56885	9.522525	82.48311	81.67203	14.67247	14.08876	1.318203	1.098844	2.2233125	16.128277	0.4319281
N Eng	80.62692	79.37880	8.412642	86.10956	85.37345	11.14844	62.52341	10.912455	9.756564	0.4973931	5.391075	0.0000000
Pacific	78.92488	83.07681	7.673608	86.11314	85.70953	11.43221	35.71459	5.097165	1.325634	2.2844083	16.283601	0.0000000
S Atl	69.00578	76.99552	12.972620	78.00948	77.54210	19.07997	42.77621	9.009878	5.982395	2.1454124	9.733641	0.0000000
W N Cen	72.16934	76.04420	11.809872	80.29481	79.76734	16.66382	3.27932	4.641385	1.909883	48.3399893	24.056722	1.6463494
W S Cen	64.19260	79.37310	12.865344	76.60695	76.14822	20.76308	12.78507	4.969078	1.921824	5.6029459	22.462895	0.0000000

cntyD |>
  mutate(across(contains("per_"), ~ .x/100)) |>
  select(NAME, contains("per_")) |>
  head() |> gt::gt()

NAME	per_desk_lap	per_smartphone	per_no_comp	per_internet	per_broadband	per_no_internet	per_for	per_dev	per_wet	per_crop	per_past_grass	per_karst
Autauga County	0.7413609	0.8100561	0.08571826	0.8279605	0.8270792	0.1533930	0.4971597	0.06815302	0.108697204	0.0314176012	0.2079571	0.0000000000
Baldwin County	0.7773865	0.8463598	0.08239437	0.8552358	0.8506907	0.1191952	0.3313072	0.09449728	0.306357571	0.0919548790	0.1080689	0.0000000000
Barbour County	0.5245655	0.6869770	0.20220983	0.6499678	0.6463205	0.2896374	0.5658767	0.04227699	0.088715967	0.0453634268	0.1282374	0.1491452991
Bibb County	0.5617854	0.7361896	0.16779171	0.7616752	0.7612619	0.2093952	0.7028501	0.05258896	0.065578020	0.0006282062	0.1007816	0.1105620754
Blount County	0.6686631	0.7643952	0.14963452	0.8003301	0.7962273	0.1848621	0.5495027	0.08577787	0.006341722	0.0118411362	0.3043296	0.1819798459
Bullock County	0.5523476	0.6926218	0.20326626	0.6278798	0.6062992	0.3032954	0.5464416	0.02983319	0.130312271	0.0203029122	0.1764950	0.0006184292

19.9.3 Example 3: `map()`

The map() function from purrr is used to perform an operation for each element in a vector. Here, we use it to read all CSV files in a vector of CSV file paths. We use the data provided in the csvSet folder. You may have to change your working directory to execute the code. map() is used to read all the files in the list of file paths using read_csv() from readr. This results in a list object where the data from each CSV file are stored as separate tibble objects within the list. All the records are then collapsed to a single tibble using list_rbind(). The result is all 3,104 records from the 40 CSV files stored in the directory being aggregated to a single tibble.

csvPth <- "gslrData/chpt19/data/csvSets/"
csvFiles <- list.files(csvPth, pattern = "\\.csv$", full.names=TRUE) 

csvData <- csvFiles |>
  map(read_csv) |>
  list_rbind()

These few examples just scratch the surface of using dplyr and purrr. For further exploration, we recommend the text R for Data Science.

19.10 Concluding Remarks

Now that we have covered base R syntax, the tidyverse, and core coding concepts relating to creating your own functions and implementing loops and control flow, we will move on to discuss data visualization using ggplot2, which is the focus of the next two chapters. Following that, we will discuss designing tables using gt.

19.11 Questions

Explain the difference between local and global scope for a variable.
Explain the difference between a for loop and a while loop as implemented in R.
Describe an example of a while loop that would result in an infinite loop.
Why is it not necessary to include a condition with else()within control flow?
Explain the difference between next and break.
Explain the concept of functional programming.
What is the purpose of the across() function?
What is the purpose of the map() function?

19.12 Exercises

Overall Accuracy Function

Task 1

Create a function that will calculate overall accuracy for a classification when given the correct class and the predicted class. A dataset has been provided (classification_data.csv) in the exercise folder for the chapter, which contains three columns: “class”, “spec”, and “spec_lidar”. The “class” column contains the correct classification (what the sample actually was) while the “spec” and “spec_lidar” columns contain the predicted classification (what an algorithm predicted the class to be). Specifically, the “spec” column is a result obtained using just spectral image bands while the “spec_lidar” column is a result obtained using a combination of spectral bands and light detection and ranging (lidar) data.

Create a function that will generate a confusion matrix from the correct and predicted data, calculate overall accuracy from the table, then return the overall accuracy result. Note that the table() function can be used to create a contingency table or confusion matrix. The diag() function can be used to extract values in the diagonal cells that represent the correct predictions. You will also need to use sum().

Use your new function to calculate the overall accuracy for the spectral only and spectral + lidar results. Which model yielded the highest overall accuracy?

Task 2

Combine a for loop and if else statement to write all rows or samples in the classification_data.csv file that were correctly predicted using the spectral and lidar (“spec_lidar”) data to a new data frame and all incorrectly classified rows to a different data frame.

Picture Editing Function/Loop

Task 1

You have been provided with a folder of pictures (pictures) in the exercise folder for the chapter. The photos are from the southwestern United States. This assignment asks you to write a function to perform some editing tasks on these images. Hint: the imager package can be used to process images in R. Create a function with the following characteristics.

Function accepts the following parameters: an input image, a Boolean variable indicating whether or not to crop the image, lower bounds representing the percent of the image to crop, upper bounds representing the percent of the image to crop, a Boolean variable indicating whether or not to resize the image, a resizing factor, and a Boolean variable indicating whether or not to convert the image to grayscale.
Function is able to crop an image by a random percentage in both the x and y directions within the specified upper and lower bounds. For example, if the lower bound is set to 20% and the upper bound is set to 40%, a random percentage in this range will be selected then used to crop the image. Different random values between the lower and upper bounds should be able to be applied for the x- and y direction crops.
Function is able to resize the image by the specified factor. For example, if the factor is 1, this means that the image is not resized. If the factor is 0.5, the resolution is decreased by half. The image should maintain its original aspect ratio.
Function should be able to convert the image to grayscale if the user specifies this.
Function should return the image object.

Task 2

Use the function within a for loop to process all of the images in the folder. Each iteration of the loop should process one image. Save the results to a new folder on disk. Plot one original and processed image pair.

19.1 Topics Covered

19.2 Introduction

19.3 Functions

19.4 Loops

19.4.1 while Loops

19.4.2 for Loops

19.5 Control Flow

19.5.1 if

19.5.2 else

19.5.3 else if

19.6 Combining For Loops and Control Flow

19.6.1 next and break

19.7 which

19.8 Final Base R Example

19.9 tidyverse Examples

19.9.1 Example 1: case_when()

19.9.2 Example 2: across()

19.9.3 Example 3: map()

19.10 Concluding Remarks

19.11 Questions

19.12 Exercises

19.4.1 `while` Loops

19.4.2 `for` Loops

19.5.1 `if`

19.5.2 `else`

19.5.3 `else if`

19.6.1 `next` and `break`

19.7 `which`

19.9.1 Example 1: `case_when()`

19.9.2 Example 2: `across()`

19.9.3 Example 3: `map()`