# The R Language Part II

# R Language Part 2

## Objectives

- Further explore the R language
- Create your own functions
- Use while and for loops
- Use if…else, next and break, and which

## Overview

In this second section on the R language, we will explore more
advanced scripting techniques including **functions**,
**loops**, and **if…else** statements. Most
programming and scripting languages provide these types of capabilities,
and R is no exception. Such techniques can be very useful in helping you
process large volumes of data efficiently or automate repetitive tasks.
The link at the bottom of the page provides the example data and R
Markdown file used to generate this module.

## Create Your Own Function

As you’ve already seen, R provides a multitude of functions either
from base R or from one of the many available packages. However,
sometimes you may want to define your own function to perform an
operation or analysis specific to your work or to make a task easier to
implement. With this in mind, R has a built-in *function()*
function, which allows you to define your own function. I will
demonstrate this using a few examples.

In the first example, I am creating a function to rescale data. The
function will accept an input vector or data frame column and rescale
the data from 0 to 1. You can then define a value to multiply by to
rescale the 0-1 data to a new scale. Note that the function is being
stored to a variable (*scale2*), which can later be used to call
the function. It accepts two arguments: *data* and
*scale*. What the function actually does is defined within the
curly brackets. Here is the process:

- The maximum value is calculated and stored to a variable
(
*max1*) - The minimum value is calculated and stored to a variable
(
*min1*) - The minimum is subtracted from each data point and stored as a
vector (
*n*) - The data range is calculated and stored as a variable
(
*d*) - The data are rescaled from 0-1
- The rescaled data are multiplied by a value to change the scale
- The final rescaled data are returned

Note that variables generated within a function cannot be used
outside of the function. This is known as **local scope**.
Or, variables defined in a function are local variables and can only be
used within the function. In contrast, variables declared outside of a
function can be used globally, or have **global scope**.
The *return()* function defines what the function will return or
produce. In the example, the rescaled data will be returned as a vector
object.

```
<- function(data, scale){
scale2 <- max(data)
max1 <- min(data)
min1 <- data-min1
n <- max1-min1
d <- n/d
s <- s*scale
s2 return(s2)
}
```

Once a function is defined, it can be used, which is really the whole
point of creating it in the first place. In the example below I am
creating a numeric vector (*x*). I am then using the new function
to rescale the values stored in *x* from 0 to 100 and then from 0
to 1.

```
<- c(1, 14, 21, 16, 18, 16, 19, 20, 6, 8, 9, 11, 17)
x <- scale2(x, 100)
x100 <- scale2(x, 1)
x1 print(x)
1] 1 14 21 16 18 16 19 20 6 8 9 11 17
[print(x100)
1] 0 65 100 75 85 75 90 95 25 35 40 50 80
[print(x1)
1] 0.00 0.65 1.00 0.75 0.85 0.75 0.90 0.95 0.25 0.35 0.40 0.50 0.80 [
```

Functions that you create are not permanent. So, if you would like to use your function in a new script, it will need to be defined again in that script. If you build several related functions or would like to make your functions available to others, you can include them in a new R package. However, this is outside the scope of this course.

I have provided a second example for calculating **root mean
square error**, or **RMSE**, for an assessment of
positional accuracy of x and y coordinates. Here, four arguments are
required (the correct and predicted coordinates in the x and y
directions). The function then calculates RMSE components, including
**residuals**, **square residuals**,
**RMSE _{x}**,

**RMSE**, and

_{y}**RMSE**, and returns a list object holding this information. I then test the function on some example data and return the RMSE measures.

_{Total}```
<- function(x_c, y_c, x_p, y_p){
rmse_georef <- x_c - x_p
x_residual <- y_c - y_p
y_residual <- x_residual^2
x_residual_sq <- y_residual^2
y_residual_sq <- sqrt(sum(x_residual_sq)/length(x_residual))
rmse_x <- sqrt(sum(y_residual_sq)/length(y_residual))
rmse_y <- sqrt(rmse_x^2 + rmse_y^2)
rmse_total <- list(x_residual, y_residual, x_residual_sq, y_residual_sq, rmse_x, rmse_y, rmse_total)
rmse_list names(rmse_list) <- c("X.Residuals", "Y.Residuals", "X.Sq.Residuals", "Y.Sq.Residuals", "RMSE.X", "RMSE.Y", "RMSE.Total")
return(rmse_list)
}
<- c(584026.624, 583179.7805, 589507.5837, 579463.0782, 585908.4986, 588190.2715)
x_actual <- c(4474131.442, 4479283.074, 4476648.449, 4478436.23, 4470697.021, 4480318.105)
y_actual <-c(584041.7902, 583211.7964, 589496.2211, 579447.4653, 585909.7985, 588206.0155)
x_predicted <- c(4474159.608, 4479295.524, 4476664.073, 4478462.252, 4470719.12, 4480344.345)
y_predicted <-rmse_georef(x_actual, y_actual, x_predicted, y_predicted)
example_rmse print(example_rmse$RMSE.Total)
1] 28.64713
[print(example_rmse$RMSE.X)
1] 17.68929
[print(example_rmse$RMSE.Y)
1] 22.53325 [
```

## Using While Loops

**While loops** are used to perform some process while a
condition is TRUE. In the example, the variable *x1* is initially
100. The loop then prints *x1* followed by subtracting 1 from the
current value. This process will continue until the condition is no
longer TRUE. In this case it will continue until *x1* reaches 90,
at which point the condition will evaluate to FALSE and the loop will be
exited. It is important to define a condition that will eventually
evaluate to FALSE. If your condition always evaluates to TRUE, then the
loop will never stop. This is know as an **infinite
loop**.

```
<- 100
x1 while (x1 > 90) {
print(x1)
=x1-1
x1
}1] 100
[1] 99
[1] 98
[1] 97
[1] 96
[1] 95
[1] 94
[1] 93
[1] 92
[1] 91 [
```

## Using For Loops

I don’t use while loops that often in R. However, I tend to use
**for loops** frequently. In contrast to while loops, for
loops do not rely on a condition. Instead, a process is executed for all
features. For example, you could process all data points in a vector or
data frame columns or rows. You could perform the same process for all
files in a list of files.

In the first example, I create a vector of country names. I then use a for loop to process each element. Specifically, the loop will print “I would like to go to” followed by the country name.

```
<- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
x for(i in x){
print(paste("I would like to go to ", i, ".", sep=""))
}1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba." [
```

Note that *i* is simply a variable and does not need to be
called *i* as demonstrated in the following example.

```
<- c("Austria", "New Zealand", "Norway", "Canada", "Cuba")
x for(country in x){
print(paste("I would like to go to ", country, ".", sep=""))
}1] "I would like to go to Austria."
[1] "I would like to go to New Zealand."
[1] "I would like to go to Norway."
[1] "I would like to go to Canada."
[1] "I would like to go to Cuba." [
```

I have found for loops to be especially useful for processing
multiple files. Here is an example for **raster** grids, a
type of digital map data. This type of data and the methods used may not
be familiar to you. That is okay; I am just providing this example to
make a point that for loops can help you process data efficiently. For
example, you could process thousands of files using only a few lines of
code. I have not provided these data, so you will not be able to execute
this example. Here I am reading all elevation grids in a folder then
finding all cells that have an elevation greater than 500 meters. I then
write the results out to binary raster grids.

```
#You won't be able to run this since I didn't provide this folder.
library(raster)
```

```
<- list.files(path = "D:/elev_data", pattern = "\\.img$")
raster_list <- c("D:/elev_data_out/")
new_dir for(ras in raster_list){
<- raster(paste0("C:/Teaching/elev_data/", ras))
r1 <- r1 > 500
r2 writeRaster(r2, filename=paste0(new_dir, ras), overwrite=TRUE)
}
```

## If and If…Else

**If** is used to only perform some operation if the
condition is TRUE. In this example, the statement is printed because the
condition evaluated to TRUE, or because 4 is less than or equal to 6. If
you change the value stored in the variable *a* to a number
larger than 6, nothing will be printed.

```
<- 4
a if(a <= 6){
print("Value less than or equal to 6.")
}1] "Value less than or equal to 6." [
```

What if you want different operations to be performed based on
whether a single condition is true? This can be accomplished using an
**if…else** statement. If the condition evaluates to TRUE,
then the operation in the if statement will be performed. If it
evaluates to FALSE, then the condition in the **else**
statement will be executed. In the example, “Value greater than 6” is
returned because the condition evaluates to FALSE, so the operation
defined within else is executed.

```
<- 8
a if(a <=6){
print("Value less than 6.")
else{
}print("Value greater than than 6.")
}1] "Value greater than than 6." [
```

What if you want to include more than one criteria? Then you can
include **else if** as shown below. Note that you can
include multiple else if conditions.

```
<- 8
a if(a <= 6){
print("Value less than or equal to 6.")
else if(a > 6 & a <10){
}print("Value is between 6 and 10.")
else{
}print("Value is greater than or equal to 10.")
}1] "Value is between 6 and 10." [
```

In this example, I am now providing a vector with multiple elements. By combining a for loop and if…else, I can obtain different results for each data point in the vector.

```
<- c(1, 3, 5, 7, 9, 11)
b for(num in b){
if(num <= 6){
print("Value less than or equal to 6.")
else if(num > 6 & num <10){
}print("Value is between 6 and 10.")
else{
}print("Value is greater than or equal to 10.")
}
}1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value less than or equal to 6."
[1] "Value is between 6 and 10."
[1] "Value is between 6 and 10."
[1] "Value is greater than or equal to 10." [
```

Lastly, this example shows how to combine the results to a single
vector. In this case, each result, regardless of the evaluation of the
conditions, is written to an initially empty vector (*c*).

```
<- c(1, 3, 5, 7, 9, 11)
b = c()
c for(num in b){
if(num <= 6){
=c(c, paste(num, "is less than or equal to 6.", sep=" "))
celse if(num > 6 & num <10){
}=c(c, paste(num, "is between 10 and 6.", sep=" "))
celse{
}=c(c, paste(num, "is greater than or equal to 10.", sep=" "))
c
}
}print(c)
1] "1 is less than or equal to 6." "3 is less than or equal to 6."
[3] "5 is less than or equal to 6." "7 is between 10 and 6."
[5] "9 is between 10 and 6." "11 is greater than or equal to 10." [
```

## Next and Break

In a for loop it is possible to stop the loop prematurely. For
example, the **next** statement will allow you to skip over
the next statement in the loop while **break** will allow
you to exit the loop completely. This is demonstrated in the example
below. First, I generate a sequence of numbers from 1 to 21. Then I set
up a for loop that will append the number to an empty vector
(*b*) unless it is an odd number (in this case
**modulus** will yield a remainder of 1). I do this using
next inside of an if statement, which will cause the loop to skip the
odd numbers. I also don’t want to append values larger then 15 to the
vector, so I use break to stop the loop once it reaches 15.

```
<- seq(1, 21, by=1)
a <- c()
b for (i in a) {
if(i%%2 == 1){
next
}if(i > 15){
break
}
<- c(b, i)
b
}
print(b)
1] 2 4 6 8 10 12 14 [
```

## Which

**Which** in R is used to return the index for features
in a vector or rows in a data frame that meet a certain criteria. You
can then use these indices for selection.

```
<- sample(1:255, 15, replace=TRUE)
y which(y > 100)
1] 1 3 4 5 6 8 11 12 13 14 [
```

## Other Useful Packages

Below are a list of other packages with associated descriptions that are useful for making your code more concise, efficient, and/or powerful. We will not discuss these packages in this course. However, if you are interested there are many resources available online.

**purrr**: tidyverse package for working with functions and lists**furrr**: tidyverse package for parallel processing and distributed computing**doParallel**: supports parallel computing using multiple CPU cores**glue**: allows for passing variables into strings and simplified string concatenation

Using loops to process large data sets, such as looping through every
row in a large data frame, can be slow and not computationally
efficient. To alleviate this issue, base R provides a series of
*apply()* functions.

*apply()*: apply a defined function to every row or column in a data frame or matrix*lapply()*: apply function to all elements in a list and return a list object*sapply()*: apply function to all elements in a list and return a vector or matrix*tapply()*: apply function to groups of elements in a vector or data frame column

## Concluding Remarks

Now that you have an understanding of programming in R, we can move on to a discussion of data analysis in R. Throughout this course, we will apply the coding, data manipulation, and analysis techniques learned in these early sections. In the next section we will explore data summarization and simple statistical tests.