R Language Part 2

Objectives

  1. Further explore the R language
  2. Create your own functions
  3. Use while and for loops
  4. Use if…else, next and break, and which

Overview

In this second section on the R language, we will explore more advanced scripting techniques including functions, loops, and if…else statements. Most programming and scripting languages provide these types of capabilities, and R is no exception. Such techniques can be very useful in helping you process large volumes of data efficiently or automate repetitive tasks. The link at the bottom of the page provides the example data and R Markdown file used to generate this module.

Create Your Own Function

As you’ve already seen, R provides a multitude of functions either from base R or from one of the many available packages. However, sometimes you may want to define your own function to perform an operation or analysis specific to your work or to make a task easier to implement. With this in mind, R has a built-in function() function, which allows you to define your own function. I will demonstrate this using a few examples.

In the first example, I am creating a function to rescale data. The function will accept an input vector or data frame column and rescale the data from 0 to 1. You can then define a value to multiply by to rescale the 0-1 data to a new scale. Note that the function is being stored to a variable (scale2), which can later be used to call the function. It accepts two arguments: data and scale. What the function actually does is defined within the curly brackets. Here is the process:

  1. The maximum value is calculated and stored as a variable (max1)
  2. The minimum value is calculated and stored as a variable (min1)
  3. The minimum is subtracted from each data point and stored as a vector (n)
  4. The data range is calculated and stored as a variable (d)
  5. The data are rescaled from 0-1
  6. The rescaled data are multiplied by a value to change the scale
  7. The final rescaled data are returned

Note that variables generated within a function cannot be used outside of the function. This is known as local scope. Or, variables defined in a function are local variables and can only be used within the function. In contrast, variables declared outside of a function can be used globally, or have global scope. The return() function defines what the function will return or produce. In the example, the rescaled data will be returned as a vector object.

scale2 <- function(data, scale){
    max1 <- max(data)
    min1 <- min(data)
    n <- data-min1
    d <- max1-min1
    s <- n/d
    s2 <- s*scale
    return(s2)
}

Once a function is defined, it can be used, which is really the whole point of creating it in the first place. In the example below I am creating a numeric vector (x). I am then using the new function to rescale the values stored in x from 0 to 100 and then from 0 to 1.

x <- c(1, 14, 21, 16, 18, 16, 19, 20, 6, 8, 9, 11, 17)
x100 <- scale2(x, 100)
x1 <- scale2(x, 1)
print(x)
 [1]  1 14 21 16 18 16 19 20  6  8  9 11 17
print(x100)
 [1]   0  65 100  75  85  75  90  95  25  35  40  50  80
print(x1)
 [1] 0.00 0.65 1.00 0.75 0.85 0.75 0.90 0.95 0.25 0.35 0.40 0.50 0.80

Functions that you create are not permanent. So, if you would like to use your function in a new script, it will need to be defined again in that script.

I have provided a second example for calculating root mean square error, or RMSE, for an assessment of georeferencing results. Here, four arguments are required (the correct and predicted coordinates in the x and y directions). The function then calculates RMSE components, including residuals, square residuals, RMSEx, RMSEy, and RMSETotal, and returns a list object holding this information. I then test the function on some example data and return the RMSE measures.

rmse_georef <- function(x_c, y_c, x_p, y_p){
  x_residual <- x_c - x_p
  y_residual <- y_c - y_p
  x_residual_sq <- x_residual^2
  y_residual_sq <- y_residual^2
  rmse_x <- sqrt(sum(x_residual_sq)/length(x_residual))
  rmse_y <- sqrt(sum(y_residual_sq)/length(y_residual))
  rmse_total <- sqrt(rmse_x^2 + rmse_y^2)
  rmse_list <- list(x_residual, y_residual, x_residual_sq, y_residual_sq, rmse_x, rmse_y, rmse_total)
  names(rmse_list) <- c("X.Residuals", "Y.Residuals", "X.Sq.Residuals", "Y.Sq.Residuals", "RMSE.X", "RMSE.Y", "RMSE.Total")
  return(rmse_list)
}

x_actual <- c(584026.624, 583179.7805, 589507.5837, 579463.0782, 585908.4986, 588190.2715)
y_actual <- c(4474131.442, 4479283.074, 4476648.449, 4478436.23, 4470697.021, 4480318.105)
x_predicted <-c(584041.7902, 583211.7964, 589496.2211, 579447.4653, 585909.7985, 588206.0155)
y_predicted <- c(4474159.608, 4479295.524, 4476664.073, 4478462.252, 4470719.12, 4480344.345)
example_rmse <-rmse_georef(x_actual, y_actual, x_predicted, y_predicted)
print(example_rmse$RMSE.Total)
[1] 28.64713
print(example_rmse$RMSE.X)
[1] 17.68929
print(example_rmse$RMSE.Y)
[1] 22.53325

Using While Loops

While loops are used to perform some process while a condition is TRUE. In the example, the variable x1 is initially 100. The loop then prints x1 followed by subtracting 1 from the current value. This process will continue until the condition is no longer TRUE. In this case it will continue until x1 reaches 90, at which point the condition will evaluate to FALSE and the loop will be exited. It is important to define a condition that will eventually evaluate to FALSE. If your condition always evaluates to TRUE, then the loop will never stop. This is know as an infinite loop.

x1 <- 100
while (x1 > 90) {
  print(x1)
  x1=x1-1
}
[1] 100
[1] 99
[1] 98
[1] 97
[1] 96
[1] 95
[1] 94
[1] 93
[1] 92
[1] 91

Using For Loops

I don’t use while loops that often in R. However, I tend to use for loops frequently. In contrast to while loops, for loops do not rely on a condition. Instead, a process is executed for all features. For example, you could process all data points in a vector or data frame columns or rows. You could perform the same process for all files in a list of files.

In the first example, I create a vector of country names. I then use a for loop to process each element. Specifically, the loop will print “I would like to go to” followed by the country name.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(i in x){
print(paste("I would like to go to ", i, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

Note that i is simply a variable and does not need to be called i as demonstrated in the following example.

x <- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
for(country in x){
print(paste("I would like to go to ", country, ".", sep=""))
}
[1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba."

I have found for loops to be especially useful for processing multiple files. Here is an example for raster grids. Note that some of the functions used here have not been discussed yet. We will discuss them when we talk about geospatial data in R. I am just providing this example to make a point that for loops can help you process geospatial data efficiently. For example, you could process thousands of files using only a few lines of code. I have not provided these data, so you will not be able to execute this example. Here I am reading all elevation grids in a folder then finding all cells that have an elevation greater than 500 meters. I then write the results out to binary raster grids.

#You won't be able to run this since I didn't provide this folder. 
library(raster)
raster_list <- list.files(path = "C:/Teaching/elev_data", pattern = "\\.tif$")
new_dir <- c("C:/Teaching/elev_data_out/")
for(ras in raster_list){
  r1 <- raster(paste0("C:/Teaching/elev_data/", ras))
  r2 <- r1 > 500
  writeRaster(r2, filename=paste0(new_dir, ras), format="GTiff", overwrite=TRUE)
}

If and If…Else

If is used to only perform some operation if the condition is TRUE. In this example, the statement is printed because the condition evaluated to TRUE, or because 4 is less than or equal to 6. If you change the value stored in the variable a to a number larger than 6, nothing will be printed.

a <- 4
if(a <= 6){
  print("Value less than or equal to 6.")
}
[1] "Value less than or equal to 6."

What if you want different operations to be performed based on whether a single condition is true? This can be accomplished using an if…else statement. If the condition evaluates to TRUE, then the operation in the if statement will be performed. If it evaluates to FALSE, then the condition in the else statement will be executed. In the example, “Value greater than 6” is returned because the condition evaluates to FALSE, so the operation defined within else is executed.

a <- 8
if(a <=6){
  print("Value less than 6.")
}else{
  print("Value greater than than 6.")
}
[1] "Value greater than than 6."

What if you want to include more than one criteria? Then you can include else if as shown below. Note that you can include multiple else if conditions.

a <- 8
if(a <= 6){
  print("Value less than or equal to 6.")
}else if(a > 6 & a <10){
  print("Value is between 6 and 10.")
}else{
  print("Value is greater than or equal to 10.")
}
[1] "Value is between 6 and 10."

In this example, I am now providing a vector with multiple elements. By combining a for loop and if…else, I obtain a result for each data point.

b <- c(1, 3, 5, 7, 9, 11)
for(num in b){
  if(num <= 6){
    print("Value less than or equal to 6.")
  }else if(num > 6 & num <10){
    print("Value is between 6 and 10.")
  }else{
    print("Value is greater than or equal to  10.")
  }
}
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value is between 6 and 10."
[1] "Value is between 6 and 10."
[1] "Value is greater than or equal to  10."

Lastly, this example shows how to combine the results to a single vector. In this case, each result, regardless of the evaluation of the conditions, is written to an initially empty vector (c).

b <- c(1, 3, 5, 7, 9, 11)
c = c()
for(num in b){
  if(num <= 6){
    c=c(c, paste(num, "is less than or equal to 6.", sep=" "))
  }else if(num > 6 & num <10){
    c=c(c, paste(num, "is between 10 and 6.", sep=" "))
  }else{
    c=c(c, paste(num, "is greater than or equal to 10.", sep=" "))
  }
}
print(c)
[1] "1 is less than or equal to 6."      "3 is less than or equal to 6."     
[3] "5 is less than or equal to 6."      "7 is between 10 and 6."            
[5] "9 is between 10 and 6."             "11 is greater than or equal to 10."

Next and Break

In a for loop it is possible to stop the loop prematurely. For example, the next statement will allow you to skip over the next statement in the loop while break will allow you to exit the loop completely. This is demonstrated in the example below. First, I generate a sequence of numbers from 1 to 21. Then I set up a for loop that will append the number to an empty vector (b) unless it is an odd number (in this case modulus will yield a remainder of 1). I do this using next inside of an if statement, which will cause the loop to skip the odd numbers. I also don’t want to append values larger then 15 to the vector, so I use break to stop the loop once it reaches 15.

a <- seq(1, 21, by=1)
b <- c()
for (i in a) {
  if(i%%2 == 1){
    next
  }
  if(i > 15){
    break
  }
  
  b <- c(b, i)
}

print(b)
[1]  2  4  6  8 10 12 14

Which

Which in R is used to return the index for features in a vector or rows in a data frame that meet a certain criteria. You could then use these indices for selection.

y <- sample(1:255, 15, replace=TRUE)
which(y > 100)
 [1]  2  4  5  7  8  9 11 12 13 14

Other Useful Packages

Below are a list of other packages with associated descriptions that are useful for making your code more concise, efficient, and/or powerful. We will not discuss these packages in this course. However, if you are interested there are many resources available online.

  1. purrr: tidyverse package for working with functions and lists
  2. furrr: tidyverse package for parallel processing and distributed computing
  3. doParallel: supports parallel computing using multiple CPU cores
  4. glue: allows for passing variables into strings and simplified string concatenation

Using loops to process large data sets, such as looping through every row in a large data frame, can be slow and not computationally efficient. To alleviate this issue, base R provides a series of apply() functions.

  • apply(): apply a defined function to every row or column in a data frame or matrix
  • lapply(): apply function to all elements in a list and return a list object
  • sapply(): apply function to all elements in a list and return a vector or matrix
  • tapply(): apply function to groups of elements in a vector or data frame column

Concluding Remarks

Now that you have an understanding of programming in R, we can move on to a discussion of data analysis in R. Throughout this course, we will apply the coding, data manipulation, and analysis techniques learned in these early sections. In the next section we will explore data summarization and simple statistical tests.