The R Language

Objectives

  1. Create and work with variables
  2. Understand the data types available in R and their uses
  3. Understand the data models available in R and their uses
  4. Read in your own data
  5. Generally start learning the R language

Looks like you’ve decide to learn some R. I think you will find that R is very useful and powerful for a variety of data and statistical analysis tasks. Before we can get into the specifics of analyzing spatial data, you need to have a decent understanding of data types and data models in R. This is the goal of this section. I have found that many issues or errors can arise because the data are not properly formatted or defined. Understanding the data structures will help you troubleshoot such issues.

Download R and RStudio

Here are the links to download R and RStudio. Please watch the introductory video that explains how to install the R software and RStudio integrated development environment (IDE).

Install Packages

In RStudio, packages can be installed using Install Packages in the Tools menu. Or, you can use the install.packages() function. Some packages require other packages to function, known as dependencies. When you install a package, its dependencies should be install concurrently. If this doesn’t happen for some reason, you will need to install the dependencies manually.

Since R, RStudio, and specfic packages are updated on a regular basis, you must use compatible versions. I have generally found the RStudio can adequately deal with versioning issues. You can check for package updates using Check for Package Updates in the Tools menu.

Getting Help

Documentation for R packages are made available on The Comprehensive R Archive Network (or CRAN). Some authors also make use of GitHub. A large part of scripting is figuring out how to effectively alter other author’s code and search for help online. Google can be an analysts best friend. I recommend creating an account on Stack Exchange and Stack Overflow. I also like the R-bloggers site. If you are able to pay for training, I like Udemy and DataCamp

Within R, you can get help using the help() function or ? as demonstrated below for the c() or combine function.

Useful Resources

Here is a list of books that I have found useful:

  • R in Action: Data Analysis and Graphics in R by Kabacoff
  • R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Wickham and Grolemund
  • R Graphics Cookbook by Chang
  • Deep Learning with R by Chollet and Allaire
  • An Introduction to R for Spatial Analysis & Mapping by Brunsdon and Comber
  • Spatial Statistics & Geostatistics by Chun and Griffith
  • An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani (Download free PDF Here)

Here is a list of websites that I have found useful:

Define a Variable

Similar to other object-based programming and scripting languages, you will primarily work with your data in R through variables. The code below shows how to define a variable. Note that <- is commonly used for variable assignment in R (although you can also use =). Numbers do not need to be wrapped in quotes. In fact, if you do use quotes the numbers will be treated as characters or strings. Any strings or characters must be placed in quotes. This is because R cannot differentiate between variables and other texts without this distinction. TRUE and FALSE are available as logical operators and do not make use of quotes. The print() function is used to print or write something to the console. Here, we are just printing the content of each variable. Note that you can pick your own variable names; however, there are some stipulations. For example, you cannot use reserved words, or those that have a special use in R, and variable names cannot start with numbers. Only dots/periods and underscore special characters can be used in variable names. I like to keep variable names short so that they are easy to work with and call. Note that variables can be overwritten if they are used more than once. So, all of your variables must have unique names if you want to maintain them throughout the script.

Data Types

A variety of data types are available in R. Here are some explanations:

  • Numeric (Including double and integer): stores numbers and treats the data as numbers (i.e., you can perform mathematical operations on them).
  • Characters: stores text or numbers treated as text. In many programming languages this would be referred to as a string.
  • Logical: Either TRUE or FALSE. Note capitalization and no quotes.
  • Dates: Dates store dates and will allow you to perform analyses on a time series.

Calling the typeof() or class() function on a variable will print the data type. Here, you can see that the variable x is initially defined as a double type. Using as.integer(), we can convert it to an integer data type. It is also possible to convert numeric data to a character or a string using as.character(). Characters that represent numbers can be converted back to numeric using as.numeric(). Characters can be converted to dates using as.Date(); however, you will need to provide some additional info on how dates are formatted. Later in this section I will introduce the factor data type. Note that there are many as.something() methods that allow you to change data types or data models. There are also is.something() methods that allow you to check data types. These functions will return TRUE if the data are of that type and FALSE if they are not.

Vectors

We will now move on to discuss data models. These are the data structures that R uses to store your data. First, we will discuss vectors. Vectors are a 1-dimensional array. Instead of storing a single piece of information, you can provide a set of values or characters. Note that all data components must be of the same type (for example, numeric or character); you cannot mix data types in a vector. In the example I have created objects to store numeric data (x), string data (y), and logicals (z). Vectors that store only a single piece of data or a constant are called scalars; however, they are treated the same as vectors.

The c(), or combine, function is used to combine pieces of data into a single vector. It is one of the most commonly used functions in R, so you’ll get used to seeing it and using it.

To extract specific pieces of data, you can use square bracket notation. R starts indexing from 1 as opposed to 0, which is more common in other programming languages. To call a single data element, just call the index in the square brackets. You can also call a range of contiguous values by calling the index range using a colon. When doing so, the data points at the start and end index will be included in the subset along with all data points between them. If you want to call discontinuous data points, you can use the c() function and provide a list of indices.

Matrices

A matrix allows you to create a 2-dimensional array (or, values stored in rows and columns). In GIS and remote sensing, this is similar to a single-band raster grid where each cell is defined by a row and column combination. Note that all the cells in a matrix must have the same data type (for example, a matrix of numeric values). A matrix is generated using the matrix() function. You can provide a set of values, the number of rows, and the number of columns. The byrow argument is used to determine how to populate the matrix with the provided data. If set to TRUE, the numbers will fill across the rows sequential. Or, all columns in a row will be filled before moving on to the next row. FALSE means that columns will be filled sequentially. You can also provide column and row names using the dimnames() argument.

Since you are now working in 2-dimensions, you will need to define two indices to extract a specific row/column or cell location from the matrix. The first value represents the row while the second value represents the column. So, [2,1] would indicate the value at row 2 and column 1. A blank in either position would mean to select all rows or all columns. So, selecting data is similar for matrices and vectors except that you will need to specify a different number of indices. Note that it is also possible to subset based on the row or column names as opposed to indices.

Arrays

What if you need to expand to more than two dimensions? Enter the array. A matrix is just a 2-dimensional array. However, you can expand into more than two dimensions. For example, a 3-dimensional array would be similar to a multi-band image where the first dimension represents rows, the second represents columns, and the third represents the image bands. You could also think of a 3-dimensional array as a cube where each cell in the cube is a smaller cube defined by its position in the 3-dimensional space (This type of data structure is often referred to as a voxel). A 4-dimensional array could be used to add a time component to the data, which would be difficult to visualize since we only have three spatial dimensions to work with. Similar to matrices, all values stored in an array must be of the same type (for example, a numeric array). If you work in Python, matrices and arrays in R are comparable to numpy arrays.

Since you now have more dimensions, you will need to provide more indices to extract specific values or ranges of values. The first argument will specify the indices for the first dimension (rows), the second will specify the second dimension (columns), and the third would be the third dimension (for example, image bands). Also similar to a matrix, you can define dimension names and use them to subset the data.

Data Frames

Both matrices and arrays can only store data of the same type. Or, you cannot create columns with different data types. So, there is a need for yet another data model. A data frame is similar to a matrix; however, each column can hold different types of data. A data frame is very similar to a Microsoft Excel spreadsheet or a Pandas data frame in Python. I have found data frames to be the most common data type that I use in R. They are generally considered the workhorse of R data models.

In the provided example, I am creating a data frame to store information about courses. First, I generate vectors to store each column of data. Note that each column must have the same length or the same number of data points if you want to combine them into a data frame. Here, I am generating a mix of numeric and character vectors. Using the data.frame() function, I then combine the vectors into a data frame. Once it is printed, you can see that each column took the name of the vector variable.

Extracting elements from a data frame is identical to extracting elements from a matrix since they are also 2-dimensional. You must specify indices for both the rows and the columns. You can also use the column names or row names. Column names will automatically be generated when a data frame is created.

In this course, we will primarily make use of vectors and data frames to work with our data, so you will get very used to working with these data structures.

Factors

What if you would like to create character data in which only certain values or levels are allowed? This is the use of the factor data type; it is similar to the character data type but with defined levels or values.

In the example, I am generating a random list of 1,500 records of different academic years. I am then defining it as a factor using the factor() function. To check to make sure the data were successfully defined as a factor, I then use is.factor() (again, there are a lot of is.() and as.() functions). This returns TRUE, so we know that the data are now stored as factors. Using the levels() function, we can obtain a list of the available levels, in this case the academic years.

One component of factors that is a bit confusing is that each unique category will be assigned a placeholder integer value. So, the data will actually be stored as integers, and each integer will be associated with a specific category.

It is also possible to specify an order for the factor levels to produce an ordered factor. When we printed the levels above, they printed in alphabetical order. However, it would make more sense to specify the order based on the academic progression. Whenever you create a factor, you can specify an order using order=TRUE then list the levels in the order desired. Checking the levels, you can see that they are now in the desired order.

If you subset your data, you may need to remove levels that are no longer being used or in the data set. This can be accomplished using the droplevels() function. By default, this function will remove any levels not used in the data subset. Here, I have subsetted out only “Senior” and “Graduate.” However, after printing the levels, you can see that all levels are still defined. Using droplevels() can fix this issue. The result can be checked using the levels() functions.

I tend to use factors a lot, especially when I am working with nominal, ordinal, or categorical data.

Read in Data

In all the examples provided in this section, I have generated data to experiment with. However, this would be impractical or impossible for a large data set. More commonly, you will read data into R as opposed to create it from scratch. In later sections, we will explore reading in vector and raster spatial data. Here, I will show you how to read in tables.

Tables can be read in using the read.table() or read.csv() functions. read.csv() is specifically used to read comma separate value files (.csv). To read in data you will need to either set a working directory where the data are housed using setwd() or call the entire file path. I would recommend setting a working directory. Note that R uses the forward slash in folder paths as opposed to the backslash, as is used by the Windows operating system. So, you will have to switch these around in your code if you copy and paste from Windows File Explorer.

In the example, I am reading in a file called matts_movies.csv from my working directory. I am specifying that the separator is commas, which is use by default in CSV files as the name implies, and that there is a header, so the first row will be treated as column names as opposed to data.

Once data is read in, it is generally a good idea to explore or inspect it to make sure there are no issues and that it read in as anticipated. The head() function will print the first five records in the table while the tail() function will plot the last 5. You can specify an additional n argument if you want a different number than the default 5 records. The str() function will provide information about the structure of the data, including the data type for each column. If the data type is incorrectly defined, you can use the appropriate as.() function to make conversions. Note that these data were called in as a data frame without directly stating this since there are multiple columns of different data types.

The name() function can be used to print the column names of a table or store them in a vector. You can also change the names by providing a vector of new names. If you would like to only change a subset of names, you can provide an index or indices in square brackets.

Microsoft Excel spreadsheets can be read in using the read.xlsx() function from the xlsx package. You need to load in the xlsx package before we can use this function. This can be accomplished using the library() or require() functions. Generally, to read in packages it is preferred to use library() unless you are calling from inside a function. In that case, it is best to use require(). Remember that you must install packages before they can be used.

Note that there are additional functions and packages available to call in other data types including XML, SPSS, SAS, Stata, NetCDF, HDF5, and database files. We will discuss reading in vector and raster spatial data in a later module.

Write Files Out

Ther are also functions available to write results out to permanent files on disk. For example, write.csv() or write.table() can be used to save results as text or CSV files.

The foreign package provides the write.dbf() function for saving to .dbf format. The xlxs package provides write.xlxs() for saving results to Excel spreadsheet format.

If a folder path is not specfied, the result will be written to the working directory. If you do not want to save the results to the working directory, you must specify the entired desired file path.

In later sections, we will explore means to export raster and vector graphics to save generated figures. We will also investigate means to write results as vector and raster geospatial data.

Attach and Detach

The attach() and detach() functions are commonly used in R to alleviate the need to specifically state the data frame in your code. In the example below, I am creating a new data frame. Once you use attach(), you no longer need to use the data frame name in the code. The called column is assumed to be from the attached data frame. To end attach(), use the detach() function. You should always call detach() once you are done using attach().

My personal preference is to avoid using these methods. However, feel free to make use of them if you find them to be valuable. This is just a personal preference.

Some Useful Functions

Here a list of functions that I have found to be very useful for general purposes.

  • ncol(): return number of columns in a data frame or matrix
  • nrow(): return number of rows in vector, data frame, or matrix
  • length(): return number of data points in a data frame column or vector
  • rbind(): merge rows from multiple data objects with the same number of columns
  • cbind(): merge columns from multiple data objects with the same number of rows
  • merge(): merge two data frames based on common row or column names
  • getwd(): returns the working directory path
  • table(): creates a contingency table of counts of each combination of factor levels
  • sec(): generates a sequence of values with a specified increment and length
  • rep(): replicates a data elements a defined number of times
  • rnorm(): creates a specified number of random values based on a normal distribution
  • sample(): selects a specified number or random samples from data with or without replacement

For a list of common R functions take a look at this reference card.

Comments

Comments are meant to make your code more interpretable. They are meant for humans as opposed to computers. Commented lines will not be executed. I highly recommend commenting your code, as you may forget how or why you did something or someone else may want to use or manipulate your code. You can also comment out lines that you don’t want to execute temporarily, perhaps during the debugging process.

Different programming languages define comments differently. R uses #. Any line beginning with # will not be executed. The code block below provides an example of commenting.

Quitting R

The q() function can be used to end your R session and save your work. You can also use the save methods available in the File menu in RStudio.

That’s it! It might seem to you that you haven’t learned much R yet. However, data types and structures are a large component of working in this environment. So, this is an accomplishment. You will get practice working with many of the techniques discussed here through the remainder of the course.

Back to Course Page

Back to WV View

Download Data