The R Language
The R Language
Objectives
- Create and work with variables
- Understand the data types available in R and their uses
- Understand the data models available in R and their uses
- Read in your own data
- Generally start learning the R language
Looks like you’ve decide to learn some R. I think you will find that R is very useful and powerful for a variety of data and statistical analysis tasks. Before you can get into the specifics of analyzing data, you need to have a decent understanding of data types and data models available in R. This is the goal of this section. I have found that many issues or errors can arise because the data are not properly formatted or defined. Understanding the available data structures will help you troubleshoot such issues.
To allow you to follow along with the examples in each module, I have provided a link at the bottom of each module page to allow you to download the required data and the R Markdown file used to generated the page. I highly encourage you to follow along and experiment with the provided examples.
Download R and RStudio
Here are links to download R and RStudio. Please watch the introductory video that explains how to install the R software and RStudio integrated development environment (IDE).
Install Packages
In RStudio, packages can be installed using Install Packages in the Tools menu. Or, you can use the install.packages() function. Some packages require other packages to function, known as dependencies. When you install a package, its dependencies should install concurrently. If this doesn’t happen for some reason, you will need to install the dependencies manually.
Since R, RStudio, and specific packages are updated on a regular basis, you must use compatible versions. I have generally found that RStudio can adequately deal with versioning issues. You can check for package updates using Check for Package Updates in the Tools menu in RStudio.
Getting Help
Documentation for R packages is made available on The Comprehensive R Archive Network (or CRAN). Some authors also make use of GitHub. A large part of scripting is figuring out how to effectively alter other author’s code and search for help online. Google can be an analysts best friend. I recommend creating an account on Stack Overflow. I also like the R-bloggers site. If you are able to pay for training, I like Udemy and DataCamp
Within R, you can get help using the help() function or ? as demonstrated below for the c(), or combine, function.
help(c)
?c
Useful Resources
Here is a list of books that I have found useful:
- R in Action: Data Analysis and Graphics in R by Kabacoff
- R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Wickham and Grolemund
- R Graphics Cookbook by Chang
- Deep Learning with R by Chollet and Allaire
- An Introduction to R for Spatial Analysis & Mapping by Brunsdon and Comber
- Spatial Statistics & Geostatistics by Chun and Griffith
- An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani (Download free PDF Here)
Here is a list of websites that I have found useful:
Define a Variable
In R, you will primarily work with your data by referencing them using variables. The code below shows how to define a variable. Note that <- is commonly used for variable assignment in R (although you can also use =). A number does not need to be wrapped in quotes. In fact, if you do use quotes the number will be treated as a character or string. Any strings or characters must be placed in quotes. This is because R cannot differentiate between variable names and other texts without this distinction. TRUE and FALSE are available as Boolean data and do not make use of quotes. The print() function is used to print or write something to the console. Here, we are just printing the content of each variable. Note that you can pick your own variable names; however, there are some stipulations. For example, you cannot use reserved words, or those that have a special use in R, and variable names cannot start with numbers. Only dots/periods and underscore special characters can be used in variable names. I like to keep variable names short so that they are easy to work with and call. Variables can be overwritten if they are used more than once. So, all of your variables must have unique names if you want to maintain them throughout a script.
<- 1
x <- "GIS"
y <- TRUE
z print(x)
1] 1
[print(y)
1] "GIS"
[print(z)
1] TRUE [
Data Types
A variety of data types are available in R. Here are some explanations:
- Numeric (including double and integer): stores numbers and treats the data as numbers (i.e., you can perform mathematical operations on them).
- Characters: stores text or numbers treated as text. In many programming languages this would be referred to as a string.
- Logical: Either TRUE or FALSE. Note capitalization and no quotes.
- Dates: Dates store dates and will allow you to perform analyses on a time series.
Calling the typeof() or class() function on a variable will print the data type. Here, you can see that the variable x is initially defined as a double type. Using as.integer(), I convert it to an integer data type. It is also possible to convert numeric data to a character or a string using as.character(). Characters that represent numbers can be converted back to numeric using as.numeric(). Characters can be converted to dates using as.Date(); however, you will need to provide some additional info to define how dates are formatted. Later in this section I will introduce the factor data type. Note that there are many as.something() methods that allow you to change data types or data models. There are also is.something() methods that allow you to check data types. These functions will return TRUE if the data are of that type and FALSE if they are not.
<- 1
x typeof(x)
1] "double"
[<- as.integer(x)
y typeof(y)
1] "integer"
[<- as.character(y)
z typeof(z)
1] "character"
[<- as.numeric(z)
w typeof(w)
1] "double"
[<- TRUE
a typeof(a)
1] "logical"
[<- c("01/20/2020")
d typeof(d)
1] "character"
[<- as.Date(d, "%m/%d/%Y") d2
Data Models
Vectors
We will now move on to discuss data models (or data structures). These are the data structures that R uses to store your data. First, we will discuss vectors. Vectors are a 1-dimensional array. Instead of storing a single piece of information, you can provide a set of values or characters. Note that all data components must be of the same type (for example, numeric or character); you cannot mix data types in a vector. In the example I have created objects to store numeric data (x), string data (y), and TRUE/FALSE Boolean values (z). Vectors that store only a single piece of data or a constant are called scalars; however, they are treated the same as vectors.
The c(), or combine, function is used to combine pieces of data into a single vector. It is one of the most commonly used functions in R, so you’ll get used to seeing it and using it.
<- c(1, 2, 3, 4, 5,6, 7)
x <- c("GIS", "Spatial", "Analytics", "R", "Data Science", "Remote Sensing")
y <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
z print(x)
1] 1 2 3 4 5 6 7
[print(y)
1] "GIS" "Spatial" "Analytics" "R"
[5] "Data Science" "Remote Sensing"
[print(z)
1] TRUE FALSE TRUE TRUE FALSE [
To extract specific pieces of data, you can use square bracket notation. R starts indexing from 1 as opposed to 0, which is more common in other programming languages. To call a single data element, just call the index in the square brackets. You can also call a range of contiguous values by calling the index range using a colon. When doing so, the data points at the start and end index will be included in the subset along with all data points between them. In other languages it is common to not include the value at the last provided index. If you want to call discontinuous data points, you can use the c() function and provide a list of indices.
print(y[2])
1] "Spatial"
[print(y[3:5])
1] "Analytics" "R" "Data Science"
[print(y[c(1, 3, 5)])
1] "GIS" "Analytics" "Data Science"
[print(y[c(1, 3:6)])
1] "GIS" "Analytics" "R" "Data Science"
[5] "Remote Sensing" [
Matrices
A matrix allows you to create a 2-dimensional array (or, values stored in rows and columns). All the cells in a matrix must have the same data type (for example, a matrix of numeric values). A matrix is generated using the matrix() function. You can provide a set of values, the number of rows, and the number of columns. The byrow argument is used to determine how to populate the matrix with the provided data. If set to TRUE, the numbers will fill across the rows sequential. In other words, all columns in a row will be filled before moving on to the next row. FALSE means that columns will be filled sequentially. You can also provide column and row names using the dimnames() argument.
<- matrix(1:50, nrow=10, ncol=5)
m print(m)
1] [,2] [,3] [,4] [,5]
[,1,] 1 11 21 31 41
[2,] 2 12 22 32 42
[3,] 3 13 23 33 43
[4,] 4 14 24 34 44
[5,] 5 15 25 35 45
[6,] 6 16 26 36 46
[7,] 7 17 27 37 47
[8,] 8 18 28 38 48
[9,] 9 19 29 39 49
[10,] 10 20 30 40 50
[<- c("A1", "B1", "C1", "A2", "B2", "C2", "A3", "B3", "C3")
data1 <- c("1", "2", "3")
rnames <- c("A", "B", "C")
cnames <- matrix(data1, nrow=3, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
matrix1 print(matrix1)
A B C 1 "A1" "B1" "C1"
2 "A2" "B2" "C2"
3 "A3" "B3" "C3"
Since you are now working in 2-dimensions, you will need to define two indices to extract a specific row/column or cell location from the matrix. The first value represents the row while the second value represents the column. So, [2,1] would indicate the value at row 2 and column 1. A blank in either position indicates to select all rows or all columns. So, selecting data is similar for matrices and vectors except that you will need to specify a different number of indices. Note that it is also possible to subset based on the row or column names as opposed to indices.
1,]
matrix1[
A B C "A1" "B1" "C1"
1]
matrix1[,1 2 3
"A1" "A2" "A3"
1,2:3]
matrix1[
B C "B1" "C1"
1:2,1:2]
matrix1[
A B 1 "A1" "B1"
2 "A2" "B2"
1, c(1,3)]
matrix1[
A C "A1" "C1"
"1", c("A", "B")]
matrix1[
A B "A1" "B1"
Arrays
What if you need to expand to more than two dimensions? Enter the array. A matrix is just a 2-dimensional array. However, you can expand into more than two dimensions. For example, a 3-dimensional array would be similar to an image where the first dimension represents rows, the second represents columns, and the third represents the color channels (red, green, and blue). You could also think of a 3-dimensional array as a cube where each cell in the cube is a smaller cube defined by its position in the 3-dimensional space (This type of data structure is often referred to as a voxel). A 4-dimensional array could be used to add a time component to the data, which would be difficult to visualize since we only have three spatial dimensions to work with. For example, a video could be represented as a 4-dimensional array of row, height, color channel, and frame dimensions. Similar to matrices, all values stored in an array must be of the same type (for example, a numeric array). If you work in Python, matrices and arrays in R are comparable to numpy arrays.
<- seq(from=1, to=150, by=2)
data2 <- c("R1", "R2", "R3", "R4", "R5")
rnames <- c("C1", "C2", "C3", "C4", "C5")
cnames <- c("B1", "B2", "B3")
bnames <- array(data2, c(5, 5, 3), dimnames=list(rnames, cnames, bnames))
array1 print(array1)
, , B1
C1 C2 C3 C4 C51 11 21 31 41
R1 3 13 23 33 43
R2 5 15 25 35 45
R3 7 17 27 37 47
R4 9 19 29 39 49
R5
, , B2
C1 C2 C3 C4 C551 61 71 81 91
R1 53 63 73 83 93
R2 55 65 75 85 95
R3 57 67 77 87 97
R4 59 69 79 89 99
R5
, , B3
C1 C2 C3 C4 C5101 111 121 131 141
R1 103 113 123 133 143
R2 105 115 125 135 145
R3 107 117 127 137 147
R4 109 119 129 139 149 R5
Since you now have more dimensions, you will need to provide more indices to extract specific values or ranges of values. The first argument will specify the indices for the first dimension (rows), the second will specify the second dimension (columns), and the third would be the third dimension (for example, color channels). Also similar to a matrix, you can define dimension names and use them to subset the data.
1, 1, 1]
array1[1] 1
[1:3, 1:3, 1]
array1[
C1 C2 C31 11 21
R1 3 13 23
R2 5 15 25 R3
Data Frames
Both matrices and arrays can only store data of the same type. Or, you cannot create columns with different data types. So, there is a need for yet another data model. A data frame is similar to a matrix; however, each column can hold different types of data. A data frame is very similar to a Microsoft Excel spreadsheet or a Pandas data frame in Python. I have found data frames to be the most common data type that I use in R. They are generally considered the workhorse of R data models.
In the provided example, I am creating a data frame to store information about courses. First, I generate vectors to store each column of data. Note that each column must have the same length or the same number of data points if you want to combine them into a data frame. Here, I am generating a mix of numeric and character vectors. Using the data.frame() function, I then combine the vectors into a data frame. Once it is printed, you can see that each column took the name of the vector variable.
<- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_prefix <- c(107, 350, 101, 104, 455)
course_num <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
course_name <- c(210, 45, 235, 80, 35)
enrollment <- data.frame(course_prefix, course_num, course_name, enrollment)
course_data print(course_data)
course_prefix course_num course_name enrollment1 Geog 107 Physical Geography 210
2 Geog 350 GIScience 45
3 Geol 101 Planet Earth 235
4 Geol 104 Earth Through Time 80
5 Geog 455 Remote Sensing 35
Extracting elements from a data frame is identical to extracting elements from a matrix since they are also 2-dimensional. You must specify indices for both the rows and the columns. You can also use the column names or row names. Column names will automatically be generated when a data frame is created. You will use a *\(* when referencing a column using its name (for example, df\)Col1).
print(course_data[,1])
1] "Geog" "Geog" "Geol" "Geol" "Geog"
[print(course_data[1,])
course_prefix course_num course_name enrollment1 Geog 107 Physical Geography 210
print(course_data[1,3])
1] "Physical Geography"
[print(course_data[,"course_name"])
1] "Physical Geography" "GIScience" "Planet Earth"
[4] "Earth Through Time" "Remote Sensing"
[
print(course_data$course_name)
1] "Physical Geography" "GIScience" "Planet Earth"
[4] "Earth Through Time" "Remote Sensing"
[print(course_data$enrollment)
1] 210 45 235 80 35 [
In this course, you will primarily make use of vectors and data frames to work with your data, so you will get very used to working with these data structures.
Lists
I think of lists as containers that can store other data sets. Lists can be used to store multiple vectors, matrices, arrays, data frames, and even other lists. To call an element in a list, use the $ sign. You can then use the same selection methods for data models already discussed. I find that I don’t tend to create many lists. However, it is common for analyses to generate list objects that you will then need to work with or extract data or output from. This will be our primary use of lists in this course.
<- sample(1:1000, 50, replace=TRUE)
vec1 <- rnorm(200, mean=250, sd = 100)
vec2 <- c("A1", "B1", "C1", "A2", "B2", "C2", "A3", "B3", "C3")
data1 <- c("1", "2", "3")
rnames <- c("A", "B", "C")
cnames <- matrix(data1, nrow=3, ncol=3, byrow=TRUE, dimnames=list(rnames, cnames))
matrix1 <- seq(from=1, to=150, by=2)
data2 <- c("R1", "R2", "R3", "R4", "R5")
rnames <- c("C1", "C2", "C3", "C4", "C5")
cnames <- c("B1", "B2", "B3")
bnames <- array(data2, c(5, 5, 3), dimnames=list(rnames, cnames, bnames))
array1 <- list(Vector_1 = vec1, Vector_2 = vec2, Matrix1 = matrix1, Array_1 = array1)
list1 print(list1$Vector_1)
1] 671 205 323 925 789 645 732 455 118 850 341 944 650 701 481 322 872 510 465
[20] 145 645 293 800 116 985 872 832 940 511 813 350 26 664 501 920 808 515 386
[39] 783 326 59 871 140 945 532 189 851 396 937 330
[print(list1$Array_1[1, 1, 1])
1] 1 [
Special Classes
R allows for other data models. For example, object-oriented classes can be used including S3, S4, and S5. Classes define the type of object and its characteristics. Methods are functions associated with an object. If you have coded in Python, you are likely familiar with these object-oriented programming concepts. Later in this course, you will see some example use cases of specific R classes.
Factors
What if you would like to create character data in which only certain values or levels are allowed? This is the use of the factor data type; it is similar to the character data type but with defined levels or values.
In the example, I am generating a random vector containing 1,500 records of different academic years. I am then defining the vector as a factor using the factor() function. To check to make sure the data were successfully defined as a factor, I then use is.factor() (again, there are a lot of is.() and as.() functions). This returns TRUE, so we know that the data are now stored as factors. Using the levels() function, we can obtain a list of the available levels, in this case the academic years.
One component of factors that is a bit confusing is that each unique category will be assigned a placeholder integer value. So, the data will actually be stored as integers, and each integer will be associated with a specific category.
<- rep(c("Freshman","Sophmore","Junior","Senior", "Graduate"), 1500*c(0.35,0.20,0.15,0.20, 0.10))
ac_year <- factor(ac_year)
ac_year2 is.factor(ac_year2)
1] TRUE
[levels(ac_year2)
1] "Freshman" "Graduate" "Junior" "Senior" "Sophmore" [
It is also possible to specify an order for the factor levels to produce an ordered factor. When I printed the levels above, they printed in alphabetical order. However, it would make more sense to specify the order based on the academic progression. Whenever you create a factor, you can specify an order using order=TRUE then list the levels in the order desired. Checking the levels, you can see that they are now in the desired order.
<- factor(ac_year, order=TRUE, levels=c("Freshman","Sophmore","Junior","Senior", "Graduate"))
ac_year3 levels(ac_year3)
1] "Freshman" "Sophmore" "Junior" "Senior" "Graduate" [
If you subset your data, you may need to remove levels that are no longer being used or in the data set. This can be accomplished using the droplevels() function. By default, this function will remove any levels not used in the data subset. Here, I have subsetted out only “Senior” and “Graduate.” However, after printing the levels, you can see that all levels are still defined. Using droplevels() can fix this issue. The result can be checked using the levels() functions.
<- ac_year3[ac_year == "Senior" | ac_year == "Graduate"]
ac_year4 levels(ac_year4)
1] "Freshman" "Sophmore" "Junior" "Senior" "Graduate"
[<- droplevels(ac_year4)
ac_year5 levels(ac_year5)
1] "Senior" "Graduate" [
I tend to use factors a lot, especially when I am working with nominal, ordinal, or categorical data.
Read in Data
In all the examples provided in this section so far, I have generated data to experiment with. However, this would be impractical or impossible for a large data set. More commonly, you will read data into R as opposed to create it from scratch. Here, I will show you how to read in tables.
Tables can be read in using the read.table() or read.csv() functions. read.csv() is specifically used to read comma separate value files (.csv). To read in data you will need to either set a working directory where the data are housed using setwd() or call the entire file path. I would recommend setting a working directory. Note that R uses the forward slash in folder paths as opposed to the backslash, as is used by the Windows operating system. So, you will have to switch these around in your code if you copy and paste from Windows File Explorer. Alternatively, you can double-up the backslashes (i.e., use an escape character).
In the example, I am reading in a file called matts_movies.csv from my working directory. I am specifying that the separator is commas, which is used by default in CSV files as the name implies, and that there is a header, so the first row will be treated as column names as opposed to data.
#Will need to set your own directory!
setwd("D:/mydata/r_language_p1")
#This also would work: setwd("D:\\ossa\\r_language_p1")
<- read.csv("matts_movies.csv", sep=",", header=TRUE, stringsAsFactors=TRUE) movies
Once data are read in, it is generally a good idea to explore or inspect them to make sure there are no issues and that they read in as anticipated. The head() function will print the first five records in the table while the tail() function will plot the last 5. You can specify an additional n argument if you want a different number than the default 5 records. The str() function will provide information about the structure of the data, including the data type for each column. If the data type is incorrectly defined, you can use the appropriate as.() function to make conversions. Note that these data were read in as a data frame without directly stating this since there are multiple columns of different data types. When reading in data tables that contain character or string data, it is important to consider whether you want the data to be represented at a character or a factor. In versions of R prior to 4.0, the default was to convert all string data to factors. However, the default in 4.0 or higher is to maintain them as characters. The read.csv() function has an optional stringsAsFactors argument that can be used to change this behavior. Alternatively, you can use factor() or as.factor() to augment specific columns. In the example above, I used the stringsAsFactors argument to read in all characters as factors.
tail(movies)
Movie.Name Director Release.Year1847 The Nutty Proffessor II: The Klumps Peter Segal 2000
1848 Dreamcatcher Lawrence Kasdan 2003
1849 Jumper Doug Liman 2008
1850 Baby Geniuses Bob Clark 1999
1851 The Postman Kevin Costner 1997
1852 The Last Airbender M. Night Shyamalan 2010
My.Rating Genre Own1847 1.76 Comedy No
1848 1.65 Horror No
1849 1.22 Action No
1850 1.01 Family No
1851 0.88 Drama No
1852 0.67 Action No
str(movies)
'data.frame': 1852 obs. of 6 variables:
$ Movie.Name : Factor w/ 1852 levels " Mortified Nation",..: 98 1622 594 428 320 122 1160 907 992 1186 ...
$ Director : Factor w/ 801 levels "Aaron Katz","Aaron Schneider",..: 103 237 284 632 30 794 787 130 378 166 ...
$ Release.Year: int 2000 1994 1993 2001 2006 1977 1998 2000 2007 1995 ...
$ My.Rating : num 9.99 9.98 9.96 9.95 9.94 9.93 9.92 9.91 9.9 9.88 ...
$ Genre : Factor w/ 18 levels "Action","Classic",..: 6 6 4 13 13 4 11 16 16 16 ...
$ Own : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
The name() function can be used to print the column names of a table or store them in a vector. You can also change the names by providing a vector of new names. If you would like to only change a subset of names, you can provide an index or indices in square brackets.
names(movies)
1] "Movie.Name" "Director" "Release.Year" "My.Rating" "Genre"
[6] "Own"
[names(movies) <- c("Name", "Director", "Release Year", "Rating", "Genre", "Own")
names(movies)
1] "Name" "Director" "Release Year" "Rating" "Genre"
[6] "Own"
[names(movies)[4] <- "Category"
names(movies)
1] "Name" "Director" "Release Year" "Category" "Genre"
[6] "Own" [
Microsoft Excel spreadsheets can be read in using the read.xlsx() function from the xlsx package. You need to load in the xlsx package before you can use this function. This can be accomplished using the library() or require() functions. Generally, to read in packages it is preferred to use library() unless you are calling from inside a function. In that case, it is best to use require(). Remember that you must install packages before they can be used.
Note that there are additional functions and packages available to call in other data types including XML, SPSS, SAS, Stata, NetCDF, HDF5, and database files. The data.table and readr packages are useful when working with large data sets and tables.
Write Files Out
There are also functions available to write results out to permanent files on disk. For example, write.csv() or write.table() can be used to save results as CSV or text files.
The foreign package provides the write.dbf() function for saving to .dbf format. The xlxs package provides write.xlxs() for saving results to Excel spreadsheet format.
If a folder path is not specified, the result will be written to the working directory. If you do not want to save the results to the working directory, you must specify the entire desired file path.
In later sections, we will explore means to export raster and vector graphics to save generated figures.
Attach and Detach
The attach() and detach() functions are commonly used in R to alleviate the need to specifically state the data frame in your code. In the example below, I am creating a new data frame. Once I use attach(), I no longer need to use the data frame name in the code. The called column is assumed to be from the attached data frame. To end attach(), use the detach() function. You should always call detach() once you are done using attach().
My personal preference is to avoid using these methods. However, feel free to make use of them if you find them to be valuable. This is just a personal preference.
<- c("Geog", "Geog", "Geol", "Geol", "Geog")
course_prefix <- c(107, 350, 101, 104, 455)
course_num <- c("Physical Geography", "GIScience", "Planet Earth", "Earth Through Time", "Remote Sensing")
course_name <- c(210, 45, 235, 80, 35)
enrollment <- data.frame(course_prefix, course_num, course_name, enrollment)
course_data attach(course_data)
<- course_num > 200
upper_level print(upper_level)
1] FALSE TRUE FALSE FALSE TRUE
[detach(course_data)
Some Useful Functions
Here a list of functions that I have found to be very useful for general purposes.
- ncol(): return number of columns in a data frame or matrix
- nrow(): return number of rows in vector, data frame, or matrix
- length(): return number of data points in a data frame column or vector
- rbind(): merge rows from multiple data objects with the same number of columns
- cbind(): merge columns from multiple data objects with the same number of rows
- merge(): merge two data frames based on common row or column names
- getwd(): returns the working directory path as a string
- table(): creates a contingency table of counts of each combination of factor levels
- sec(): generates a sequence of values with a specified increment and length
- rep(): replicates a data elements a defined number of times
- rnorm(): creates a specified number of random values based on a normal distribution
- sample(): selects a specified number or random samples from data with or without replacement
For a list of common R functions take a look at this reference card.
Quitting R
The q() function can be used to end your R session and save your work. You can also use the save methods available in the File menu in RStudio.
Concluding Remarks
That’s it! It might seem to you that you haven’t learned much R yet. However, data types and structures are a large component of working in this environment. So, this is an accomplishment. You will get practice working with many of the techniques discussed here through the remainder of the course.
Comments
Comments are meant to make your code more interpretable. They are meant for humans as opposed to computers. Commented lines will not be executed. I highly recommend commenting your code, as you may forget how or why you did something or someone else may want to use or manipulate your code. You can also comment out lines that you don’t want to execute temporarily, perhaps during the debugging process.
Different programming languages define comments differently. R uses #. Any line beginning with # will not be executed. The code block below provides an example of commenting.