19 Base R (Part II)
19.1 Topics Covered
- Define and use your own functions
- Implement while and for loops
- Implement control flow
- Vectorized options with dplyr and purrr
19.2 Introduction
Now, that we have covered base R and tidyverse syntax, we will introduce some other core coding concepts and their implementations in R. Specifically, we will explore
- functions: allow you to build your own tools or snippets of code that can be reused
- Loops: allow for iteration over data or elements, such as rows in a data frame/tibble or items in a vector
- control flow: use conditions to determine if or what code is executed
- Some alternatives using the tidyverse
Although these topics are discussed in the context of R, they are core to programming and scripting in general.
19.3 Functions
A function is a set of code that can be reused. If you find yourself using the same lines of code repeatedly, you may want to convert them into a function. This can make your code both more concise and easier to edit. For example, if you want to update the code, you can simply update the function as opposed to updating every instance of repeating lines.
In our first example below, we are defining a function to rescale a vector. In R, function()
is used to define a new function. Our function is given the name scale2
, and it has two parameters: data
and scale
. The values or inputs assigned to the parameters are called arguments. It is possible to define default arguments for parameters. In our example, the default argument for scale
is 1
. What the function actually does is defined inside of {}
. This function does the following:
- Calculates the largest value in the vector using
max()
- Calculates the smallest value in the vector using
min()
- Subtracts the smallest value from the current value to obtain the numerator (
n
) - Calculates the range for the denominator (
d
) - Divides
n
byd
to rescale the data from zero to one. - Further rescales the data by multiplying by the
scale
argument.
This function specifically implements min-max rescaling to a range of zero to one then multiplies by the scale
argument to rescale to a new range.
This is a good time to discuss the concept of scope in code. All of the variables defined inside of the function have local scope: they can only be used or called inside of the function. This is in contrast to variables that are created or declared outside of a function, which have global scope and can be called or used anywhere in the code.
The return()
function defines what the function returns or produces. In this example, the rescaled data are returned as a vector. Note that functions can return a variety of different types of objects including vectors, matrices, arrays, data frames, and lists.
Executing the code below will instantiate the scale2()
function. Once instantiated, it can then be used. In the following blocks of code we generated a vector of values then rescale the values from a range of 0 to 100 and 0 to 1. When rescaling from a range of 0 to 1, we do not need to provide and argument for the scale
parameter since 1
is the default.
x <- c(1, 14, 21, 16, 18, 16, 19, 20, 6, 8, 9, 11, 17)
x100 <- scale2(x, 100)
x1 <- scale2(x)
print(x)
[1] 1 14 21 16 18 16 19 20 6 8 9 11 17
print(x100)
[1] 0 65 100 75 85 75 90 95 25 35 40 50 80
print(x1)
[1] 0.00 0.65 1.00 0.75 0.85 0.75 0.90 0.95 0.25 0.35 0.40 0.50 0.80
Next, we define a more complicated function that calculates root mean square error (RMSE) to assess the precision of a georeferencing process. The function is named rmse_georef()
and has 4 parameters:
-
x_c
: correct x coordinates -
y_c
: correct y coordinates -
x_p
: predicted x coordinates -
y_p
: predicted y coordinates
Each argument should be a vector of coordinates, all of the same length. The code inside the function calculates RMSEx, RMSEy, and RMSETotal. The output is a list object that contains vectors of residuals in the x and y directions and the three RMSE metrics. We also rename the objects within the list to make them more interpretable.
Once the function is initialized, it can be used by providing arguments for all four parameters. The objects contained in the resulting list can be accessed using $
and the object name.
rmse_georef <- function(x_c, y_c, x_p, y_p){
x_residual <- x_c - x_p
y_residual <- y_c - y_p
x_residual_sq <- x_residual^2
y_residual_sq <- y_residual^2
rmse_x <- sqrt(sum(x_residual_sq)/length(x_residual))
rmse_y <- sqrt(sum(y_residual_sq)/length(y_residual))
rmse_total <- sqrt(rmse_x^2 + rmse_y^2)
rmse_list <- list(x_residual,
y_residual,
rmse_x,
rmse_y,
rmse_total)
names(rmse_list) <- c("X.Residuals",
"Y.Residuals",
"RMSE.X",
"RMSE.Y",
"RMSE.Total")
return(rmse_list)
}
x_actual <- c(584026.624, 583179.7805, 589507.5837, 579463.0782, 585908.4986, 588190.2715)
y_actual <- c(4474131.442, 4479283.074, 4476648.449, 4478436.23, 4470697.021, 4480318.105)
x_predicted <-c(584041.7902, 583211.7964, 589496.2211, 579447.4653, 585909.7985, 588206.0155)
y_predicted <- c(4474159.608, 4479295.524, 4476664.073, 4478462.252, 4470719.12, 4480344.345)
example_rmse <-rmse_georef(x_c=x_actual,
y_c =y_actual,
x_p=x_predicted,
y_p=y_predicted)
print(example_rmse$RMSE.Total)
[1] 28.64713
print(example_rmse$RMSE.X)
[1] 17.68929
print(example_rmse$RMSE.Y)
[1] 22.53325
19.4 Loops
Loops allow for iterating over elements to perform an operation multiple times. There are two types of loops implemented in R.
-
While Loop: continues to iterate until a condition evaluates to
FALSE
- For Loop: iterates over all elements in an iterable object; these types of objects are capable of returning their contents separately, such as every item in a vector or every row in a data frame/tibble
19.4.1 while
Loops
While loops are executed as long as a condition evaluates to TRUE
. Once the condition evaluates to FALSE
, the loop will stop. In the example below, the while loop is initiated with while()
and the code within {}
is what is executed if the condition evaluates to TRUE
. The condition in this case is x1 > 90
; as a result, the code executes as long as the variable x1
is greater than 90. x1
initially holds a value of 100. At the end of each iteration, 1 is subtracted from 100 (100, 99, 98, 97, …). When x1
reaches a value of 90, the condition evaluates to FALSE
, and the loop stops.
If the condition never evaluates to FALSE
, the loop will continue indefinitely (or until it is manually stopped or the system runs out of memory or experiences some other error). This is called an infinite loop.
x1 <- 100
while(x1 > 90) {
print(x1)
x1=x1-1
}
[1] 100
[1] 99
[1] 98
[1] 97
[1] 96
[1] 95
[1] 94
[1] 93
[1] 92
[1] 91
19.4.2 for
Loops
In R, we generally find that we use for loops more often than while loops. The syntax is very similar. A for loop is defined using for()
, and the code within {}
is executed for each iteration of the loop. In the first example, we are iterating over the vector x
of country names. The variable i
represents the value being used for the current iteration of the loop. The print statement is generated for each country in the vector until all countries have been processed.
x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in x){
print(paste("I would like to go to ", i, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."
There is nothing special about the variable name i
. Below, we have replicated the same code but using country
as opposed to i
. The important point is that the variable name defined inside of for()
that represents the item for the current iteration must also be used inside of the code executed within the loop.
x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
print(paste("I would like to go to ", country, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."
For loops can be useful for executing a set of processes for a set of files. In the example below, we are listing all raster grid file names with the “.tif” extension within a directory into a vector. We then loop over this vector of file names to perform the following operations. (You cannot run this code since we did not provide the example raster data.)
- Read in the raster grid using the terra package
- Identify all cells in the grid with a value greater than 500 to generate a binary raster output
- Save the result to disk with the same name but in a new directory
library(terra)
raster_list <- list.files(path = "C:/Teaching/elev_data", pattern = "\\.tif$")
new_dir <- c("C:/Teaching/elev_data_out/")
for(ras in raster_list){
r1 <- rast(paste0("C:/Teaching/elev_data/", ras))
r2 <- r1 > 500
writeRaster(r2, filename=paste0(new_dir, ras), format="GTiff", overwrite=TRUE)
}
It is important to understand that there are different means to iterate over an object. In our countries example, which is again included below, we are iterating over each element in the vector. Another option, which is demonstrated in the next code block, is to iterate over the indices for each element in the vector (1:length(x)
). The result is the same, but the syntax has to be adjusted (i.e., "I would like to go to ", country, ".", sep=""
vs. "I would like to go to ", x[i], ".", sep=""
).
x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
print(paste("I would like to go to ", country, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."
x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in 1:length(x)){
print(paste("I would like to go to ", x[i], ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."
As another example, below we demonstrate how to loop over rows in a data frame. Specifically, for(i in 1:nrow(course_data)
indicates to loop over the row indices. We then grab data for the row of interest using course_data[i, "COLUMN NAME"]
.
course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)
course_data |> gt::gt()
course_prefix | course_num | course_name | enrollment |
---|---|---|---|
Geog | 107 | Physical Geography | 210 |
Geog | 350 | GIScience | 45 |
Geol | 101 | Planet Earth | 235 |
Geol | 104 | Earth Through Time | 80 |
Geog | 455 | Remote Sensing | 35 |
for(i in 1:nrow(course_data)){
print(paste0(course_data[i, "course_prefix"],
" ",
as.character(course_data[i, "course_num"]),
": ",
course_data[i, "course_name"]))
}
[1] "Geog 107: Physical Geography"
[1] "Geog 350: GIScience"
[1] "Geol 101: Planet Earth"
[1] "Geol 104: Earth Through Time"
[1] "Geog 455: Remote Sensing"
It is also possible to loop over column indices. In the example below, we use this method to print the data type for each column in the data frame. In the following code block, we obtain the same result by looping over the column names as opposed to their associated indices.
for(i in 1:ncol(course_data)){
print(paste0("The data type of ",
"'",
names(course_data)[i],
"'",
" is ", typeof(course_data[1,i]),
"."))
}
[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."
cNames <- names(course_data)
for(i in cNames){
print(paste0("The data type of ",
"'",
i,
"'",
" is ", typeof(course_data[1,i]),
"."))
}
[1] "The data type of 'course_prefix' is character."
[1] "The data type of 'course_num' is double."
[1] "The data type of 'course_name' is character."
[1] "The data type of 'enrollment' is double."
19.5 Control Flow
Control flow allows you to not execute code or execute different code depending on a condition.
19.5.1 if
When code is wrapped inside of if(){}
, it will only be executed if the associated condition evaluates to TRUE
. In the first example, the variable a
holds a value of 4
, and the condition defined within if()
is a <= 6
. Since 4 is less than or equal to 6, the condition evaluates to TRUE
, and the statement is printed. If the value associated with a
was larger than 6, the code would not execute. When using if()
, you must include a condition that evaluates to TRUE
or FALSE
.
a <- 4
if(a <= 6){
print("Value less than or equal to 6.")
}
[1] "Value less than or equal to 6."
19.5.2 else
else()
allows you to include code to execute if the condition associated with if()
does not evaluate to TRUE
. You can think of this as defining the default code to execute or the default behavior. Below, the condition evaluates to FALSE
since the value associated with a
is not less than or equal to 6. The code associated with else
is executed as opposed to the code associated with if()
. else
does not require a condition since it is the default behavior.
19.5.3 else if
You can test multiple conditions using if()
and else if()
. The first condition must use if()
while all subsequent conditions must use else if()
. Since 8 is between 6 and 10, the code associated with else if()
is executed.
If your conditions are not mutually exclusive, the code associated with the first condition that evaluates to TRUE
is executed. So, the order matters. If none of the conditions evaluate to TRUE
, then the code associated with else
is executed.
19.6 Combining For Loops and Control Flow
It is possible to combine loops and control flow to have different code execute for each iteration of the loop depending on the item used in the current iteration. As the first example demonstrates, you can generate different print statements for each iteration depending on the defined conditions.
b <- c(1, 3, 5, 7, 9, 11)
for(num in b){
if(num <= 6){
print("Value less than or equal to 6.")
}else if(num > 6 & num <10){
print("Value is between 6 and 10.")
}else{
print("Value is greater than or equal to 10.")
}
}
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value is between 6 and 10."
[1] "Value is between 6 and 10."
[1] "Value is greater than or equal to 10."
Instead of printing the results, you may want to save them to an object, such as a vector or a data frame. Below, we are defining an empty vector using c()
. We then insert the generated character strings into this vector with each iteration of the loop. In the second example, we write the results to columns in a new data frame.
b <- c(1, 3, 5, 7, 9, 11)
c = c()
for(num in b){
if(num <= 6){
c=c(c, paste(num, "is less than or equal to 6.", sep=" "))
}else if(num > 6 & num <10){
c=c(c, paste(num, "is between 6 and 10.", sep=" "))
}else{
c=c(c, paste(num, "is greater than or equal to 10.", sep=" "))
}
}
print(c)
[1] "1 is less than or equal to 6." "3 is less than or equal to 6."
[3] "5 is less than or equal to 6." "7 is between 6 and 10."
[5] "9 is between 6 and 10." "11 is greater than or equal to 10."
course_prefix <- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_num <- c(107, 350, 101, 104, 455)
course_name <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
enrollment <- c(210, 45, 235, 80, 35)
course_data <- data.frame(course_prefix, course_num, course_name, enrollment)
course_data |> gt::gt()
course_prefix | course_num | course_name | enrollment |
---|---|---|---|
Geog | 107 | Physical Geography | 210 |
Geog | 350 | GIScience | 45 |
Geol | 101 | Planet Earth | 235 |
Geol | 104 | Earth Through Time | 80 |
Geog | 455 | Remote Sensing | 35 |
course_data2 <- data.frame(course = character(),
enrollment = numeric())
for(i in 1:nrow(course_data)){
nameOut <- (paste0(course_data[i, "course_prefix"],
" ",
as.character(course_data[i, "course_num"]),
": ",
course_data[i, "course_name"]))
df2 <- data.frame(course=nameOut, enrollment=course_data[i, "enrollment"])
course_data2 <- rbind(course_data2, df2)
}
print(course_data2)
course enrollment
1 Geog 107: Physical Geography 210
2 Geog 350: GIScience 45
3 Geol 101: Planet Earth 235
4 Geol 104: Earth Through Time 80
5 Geog 455: Remote Sensing 35
For loops tend to be slower than vectorized operations in R. For example, if you want to take the square root of each value in a vector you could iterate over each element in the vector with a for loop and write the results to a new vector. Alternatively, you could simply do xSq = sqrt(x)
. This is a vectorized version where R performs the calculation for each element in the vector without you needing to do so using a for loop. In short, for loops are not always the best solution.
19.6.1 next
and break
There are means to further control the execution of loops with associated control flow. next
is used to skip the current iteration of the loop and move on to the next iteration if a condition evaluates to TRUE
while break
allows us to fully stop the execution of the loop when a condition evaluates to TRUE
. In our example, if the current value is odd (i.e., results in a remainder of 1 when divided by 2), that iteration is skipped. If the current value is greater than 15, then the loop is stopped. This results in only only even values up to and including 14 being written to the new vector.
19.7 which
which()
is used to return the index associated with an item in a vector that meets a certain condition. In our example, only indices associated with even values are returned. which
can be used to querying data, as is also demonstrated. Again, which
returns the indices associated with the items not the the items themselves. Printing the result of which
shows the indices while using it in a query results in filtering out the items at the indices that meet the criteria.
19.8 Final Base R Example
To tie these concepts together, we have built another function that incorporates control flow. This function allows for rescaling data using three different methods: min-max, z-score, and robust, which makes use of the median and interquartile range (IQR). The user must provide a vector to rescale along with the desired method. If an invalid method is defined, then a message is printed, as defined using message()
within the default else
statement. The second code block demonstrates the result if an invalid method name is provided while the last code block demonstrates the result when using the robust method.
scale2 <- function(data, method="min-max"){
if(method=="min-max"){
max1 <- max(data)
min1 <- min(data)
n <- data-min1
d <- max1-min1
s <- n/d
return(s)
}else if(method=="z-score"){
sd1 <- sd(data)
mn1 <- mean(data)
s <- (data-mn1)/(sd1)
return(s)
}else if(method=="robust"){
mdn1 <- median(data)
iqr1 <- IQR(data)
s <- (data - mdn1)/(iqr1)
return(s)
}else{
message("No appropriate method provided.")
}
}
x <- seq(1,255,3)
scale2(x, method="erfjpiiopdrfjpidr")
No appropriate method provided.
x <- seq(1,255,3)
scale2(x, method="robust")
[1] -1.00000000 -0.97619048 -0.95238095 -0.92857143 -0.90476190 -0.88095238
[7] -0.85714286 -0.83333333 -0.80952381 -0.78571429 -0.76190476 -0.73809524
[13] -0.71428571 -0.69047619 -0.66666667 -0.64285714 -0.61904762 -0.59523810
[19] -0.57142857 -0.54761905 -0.52380952 -0.50000000 -0.47619048 -0.45238095
[25] -0.42857143 -0.40476190 -0.38095238 -0.35714286 -0.33333333 -0.30952381
[31] -0.28571429 -0.26190476 -0.23809524 -0.21428571 -0.19047619 -0.16666667
[37] -0.14285714 -0.11904762 -0.09523810 -0.07142857 -0.04761905 -0.02380952
[43] 0.00000000 0.02380952 0.04761905 0.07142857 0.09523810 0.11904762
[49] 0.14285714 0.16666667 0.19047619 0.21428571 0.23809524 0.26190476
[55] 0.28571429 0.30952381 0.33333333 0.35714286 0.38095238 0.40476190
[61] 0.42857143 0.45238095 0.47619048 0.50000000 0.52380952 0.54761905
[67] 0.57142857 0.59523810 0.61904762 0.64285714 0.66666667 0.69047619
[73] 0.71428571 0.73809524 0.76190476 0.78571429 0.80952381 0.83333333
[79] 0.85714286 0.88095238 0.90476190 0.92857143 0.95238095 0.97619048
[85] 1.00000000
19.9 tidyverse Examples
The tidyverse, and specifically the dplyr and purrr packages, also provide functionality that can accomplish tasks similar to the methods discussed above. Here, we introduce a few methods; however, this is not meant to be a detailed discussion of the tidyverse functionality. For this section we use the us_county_data.csv file used in the last chapter.
19.9.1 Example 1: case_when()
The case_when()
function from dplyr is a vectorized version of control flow. We would like to generate a new column in the table that categorizes the counties based on their mean elevations as either “low elevation” (< 200 meters), “moderate elevation” (>= 200 & < 1400 meters), or “high elevation” (>= 1400 meters). This can be accomplished using a combination of mutate()
and case_when()
. Note that it is also possible to specify a default value using the .default
argument, which serves the same purpose as the else
statement in control flow. Just to check the results, we also count the number of counties assigned to each elevation categorization.
19.9.2 Example 2: across()
The across()
function from dplyr is used to apply a function across multiple columns. This is a form of functional programming since across()
can accept another function as an input. In our first example, we are calculating the mean by sub-region for all columns whose name contains “per_”. In the second block, we convert all columns whose name contains “per_” to proportions by dividing by 100.
cntyD |>
group_by(SUB_REGION) |>
summarize(across(contains("per_"), median, na.rm=TRUE)) |> gt::gt()
SUB_REGION | per_desk_lap | per_smartphone | per_no_comp | per_internet | per_broadband | per_no_internet | per_for | per_dev | per_wet | per_crop | per_past_grass | per_karst |
---|---|---|---|---|---|---|---|---|---|---|---|---|
E N Cen | 72.27635 | 77.10856 | 11.276944 | 81.62938 | 81.19393 | 15.59331 | 19.95011 | 7.949186 | 3.324924 | 43.7808852 | 6.714078 | 6.1374249 |
E S Cen | 61.65756 | 74.90408 | 16.298296 | 74.94303 | 74.49422 | 21.96604 | 51.17073 | 6.474498 | 2.016102 | 2.5922327 | 17.122029 | 9.3044943 |
Mid Atl | 75.53235 | 76.15280 | 10.735288 | 83.16127 | 82.53561 | 14.18328 | 53.15554 | 10.903946 | 4.102675 | 4.4547073 | 11.939469 | 1.1394162 |
Mtn | 76.38125 | 78.56885 | 9.522525 | 82.48311 | 81.67203 | 14.67247 | 14.08876 | 1.318203 | 1.098844 | 2.2233125 | 16.128277 | 0.4319281 |
N Eng | 80.62692 | 79.37880 | 8.412642 | 86.10956 | 85.37345 | 11.14844 | 62.52341 | 10.912455 | 9.756564 | 0.4973931 | 5.391075 | 0.0000000 |
Pacific | 78.92488 | 83.07681 | 7.673608 | 86.11314 | 85.70953 | 11.43221 | 35.71459 | 5.097165 | 1.325634 | 2.2844083 | 16.283601 | 0.0000000 |
S Atl | 69.00578 | 76.99552 | 12.972620 | 78.00948 | 77.54210 | 19.07997 | 42.77621 | 9.009878 | 5.982395 | 2.1454124 | 9.733641 | 0.0000000 |
W N Cen | 72.16934 | 76.04420 | 11.809872 | 80.29481 | 79.76734 | 16.66382 | 3.27932 | 4.641385 | 1.909883 | 48.3399893 | 24.056722 | 1.6463494 |
W S Cen | 64.19260 | 79.37310 | 12.865344 | 76.60695 | 76.14822 | 20.76308 | 12.78507 | 4.969078 | 1.921824 | 5.6029459 | 22.462895 | 0.0000000 |
cntyD |>
mutate(across(contains("per_"), ~ .x/100)) |>
select(NAME, contains("per_")) |>
head() |> gt::gt()
NAME | per_desk_lap | per_smartphone | per_no_comp | per_internet | per_broadband | per_no_internet | per_for | per_dev | per_wet | per_crop | per_past_grass | per_karst |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Autauga County | 0.7413609 | 0.8100561 | 0.08571826 | 0.8279605 | 0.8270792 | 0.1533930 | 0.4971597 | 0.06815302 | 0.108697204 | 0.0314176012 | 0.2079571 | 0.0000000000 |
Baldwin County | 0.7773865 | 0.8463598 | 0.08239437 | 0.8552358 | 0.8506907 | 0.1191952 | 0.3313072 | 0.09449728 | 0.306357571 | 0.0919548790 | 0.1080689 | 0.0000000000 |
Barbour County | 0.5245655 | 0.6869770 | 0.20220983 | 0.6499678 | 0.6463205 | 0.2896374 | 0.5658767 | 0.04227699 | 0.088715967 | 0.0453634268 | 0.1282374 | 0.1491452991 |
Bibb County | 0.5617854 | 0.7361896 | 0.16779171 | 0.7616752 | 0.7612619 | 0.2093952 | 0.7028501 | 0.05258896 | 0.065578020 | 0.0006282062 | 0.1007816 | 0.1105620754 |
Blount County | 0.6686631 | 0.7643952 | 0.14963452 | 0.8003301 | 0.7962273 | 0.1848621 | 0.5495027 | 0.08577787 | 0.006341722 | 0.0118411362 | 0.3043296 | 0.1819798459 |
Bullock County | 0.5523476 | 0.6926218 | 0.20326626 | 0.6278798 | 0.6062992 | 0.3032954 | 0.5464416 | 0.02983319 | 0.130312271 | 0.0203029122 | 0.1764950 | 0.0006184292 |
19.9.3 Example 3: map()
The map()
function from purrr is used to perform an operation for each element in a vector. Here, we use it to read all CSV files in a vector of CSV file paths. We use the data provided in the csvSet folder. You may have to change your working directory to execute the code. map()
is used to read all the files in the list of file paths using read_csv()
from readr. This results in a list object where the data from each CSV file are stored as separate tibble objects within the list. All the records are then collapsed to a single tibble using list_rbind()
. The result is all 3,104 records from the 40 CSV files stored in the directory being aggregated to a single tibble.
csvPth <- "gslrData/chpt19/data/csvSets/"
csvFiles <- list.files(csvPth, pattern = "\\.csv$", full.names=TRUE)
csvData <- csvFiles |>
map(read_csv) |>
list_rbind()
These few examples just scratch the surface of using dplyr and purrr. For further exploration, we recommend the text R for Data Science.
19.10 Concluding Remarks
Now that we have covered base R syntax, the tidyverse, and core coding concepts relating to creating your own functions and implementing loops and control flow, we will move on to discuss data visualization using ggplot2, which is the focus of the next two chapters. Following that, we will discuss designing tables using gt.
19.11 Questions
- Explain the difference between local and global scope for a variable.
- Explain the difference between a for loop and a while loop as implemented in R.
- Describe an example of a while loop that would result in an infinite loop.
- Why is it not necessary to include a condition with
else()
within control flow? - Explain the difference between
next
andbreak
. - Explain the concept of functional programming.
- What is the purpose of the
across()
function? - What is the purpose of the
map()
function?
19.12 Exercises
Overall Accuracy Function
Task 1
Create a function that will calculate overall accuracy for a classification when given the correct class and the predicted class. A dataset has been provided (classification_data.csv) in the exercise folder for the chapter, which contains three columns: “class”, “spec”, and “spec_lidar”. The “class” column contains the correct classification (what the sample actually was) while the “spec” and “spec_lidar” columns contain the predicted classification (what an algorithm predicted the class to be). Specifically, the “spec” column is a result obtained using just spectral image bands while the “spec_lidar” column is a result obtained using a combination of spectral bands and light detection and ranging (lidar) data.
Create a function that will generate a confusion matrix from the correct and predicted data, calculate overall accuracy from the table, then return the overall accuracy result. Note that the table()
function can be used to create a contingency table or confusion matrix. The diag()
function can be used to extract values in the diagonal cells that represent the correct predictions. You will also need to use sum()
.
Use your new function to calculate the overall accuracy for the spectral only and spectral + lidar results. Which model yielded the highest overall accuracy?
Task 2
Combine a for loop and if else statement to write all rows or samples in the classification_data.csv file that were correctly predicted using the spectral and lidar (“spec_lidar”) data to a new data frame and all incorrectly classified rows to a different data frame.
Picture Editing Function/Loop
Task 1
You have been provided with a folder of pictures (pictures) in the exercise folder for the chapter. The photos are from the southwestern United States. This assignment asks you to write a function to perform some editing tasks on these images. Hint: the imager package can be used to process images in R. Create a function with the following characteristics.
- Function accepts the following parameters: an input image, a Boolean variable indicating whether or not to crop the image, lower bounds representing the percent of the image to crop, upper bounds representing the percent of the image to crop, a Boolean variable indicating whether or not to resize the image, a resizing factor, and a Boolean variable indicating whether or not to convert the image to grayscale.
- Function is able to crop an image by a random percentage in both the x and y directions within the specified upper and lower bounds. For example, if the lower bound is set to 20% and the upper bound is set to 40%, a random percentage in this range will be selected then used to crop the image. Different random values between the lower and upper bounds should be able to be applied for the x- and y direction crops.
- Function is able to resize the image by the specified factor. For example, if the factor is 1, this means that the image is not resized. If the factor is 0.5, the resolution is decreased by half. The image should maintain its original aspect ratio.
- Function should be able to convert the image to grayscale if the user specifies this.
- Function should return the image object.
Task 2
Use the function within a for loop to process all of the images in the folder. Each iteration of the loop should process one image. Save the results to a new folder on disk. Plot one original and processed image pair.