# ggplot2 Part I

## ggplot2 Part I

## Objectives

- Learn to apply
**aesthetic mappings**in**ggplot2** - Produce a variety of
**univariate**graphs including**histograms**,**kernel density plots**,**box plots**, and**violin plots** - Compare groups in graph space
- Create
**multivariate**graphs including**scatter plots**and**line graphs** - Explore tecniques to combat
**overplotting**

## Overview

Multiple packages are available to produce graphs in R. Graphing functionality is even made available through base R. We will make use of some base R graphics in this course, but we will not focus on this topic. The **grids** package provides a low-level graphics system that other packages have used to expand the graphing capabilities of R. These packages include **lattice** and **ggplot2**. In this course, we will explore **ggplot2**, which is based on the *Grammar of Graphics* by Leland Wilkinson. This philosophy suggests breaking graphs up into semantic components such as scales and layers. In fact, the package name means “Grammar of Graphics Plots 2”. Practically, this translates to mapping constants, such as specific colors, sizes, or symbols, or variables, such as your data values, to graphical parameters, such as position along the *x*-axis or *y*-axis, size or shape of point symbols, width of lines, and color of point symbols or lines. This is the concept of **aesthetic mappings**.

**ggplot2** is my preferred method for making graphs. I have found it to be very intuitive, powerful, and adaptable to a wide variety of needs. Once a graph is produced, you can easily export it as a vector graphic for editing outside of R. In this first section on **ggplot2** we will focus on the basics of the package and how to define **aesthetic mappings**. We will also explore a variety of different types of graphs. In the second section, we will explore methods to improve and refine graphs.

Before working through these examples, you will need to install the **ggplot2** package. Also, you will need to load in some example data. First, we will work with the *high_plains_data.csv* file. The elevation (“elev”), temperature (“temp”), and precipitation (“precip2”) data were extacted from **raster** grids provided by the PRISM Climate Group at Oregon State University. The elevation data are provided in feet, the temperature data represent 30-year annual normal mean temperature in Celsius, and the precipitation data represent 30-year annual normal precipitation in inches. The original raster grids have a resolution of 4 by 4 kilometers and can be obtained here. I also summarized percent forest by county (“per_for”") from the 2011 National Land Cover Database (NLCD). NLCD data can be obtained from the Multi-Resolution Land Characteristics Consortium (MRLC).

```
library(ggplot2)
setwd("ENTER YOUR FILE PATH HERE")
hp_data <- read.csv("high_plains_data.csv", sep=",", header = TRUE)
```

Here is a link to a cheat sheet for **ggplot2**. This sheet can also be accessed in RStudio under Help –> Cheatsheets.

# Univariate Graphs

**Univariate graphs** involve visualizing a single variable and include **histograms**, **kernel density plots**, and **box plots**.

## Histograms

A **histrogram** plots the count or number of data points in each defined data range or bin. The *x*-axis represents the data values and the *y*-axis represents the count. My goal with the first graph is to inspect the distribution of percent forest cover by county. Since this is the first example of a **ggplot2** graph, I will explain the syntax in detail. All **ggplot2** graphs start with the *ggplot()* function. Within this function I specify the **data frame** being referenced and the **aesthetic mappings**. Here, I only need to define the *x* value. Next, I must define the geometry desired. *geom_histogram()* is used to create **histrograms**. The *binwidth* argument specifies the range of values to aggregate into each bin for counting. I used 5, so each bin will contain a range of 5%. Note that mulitple lines of code are stringed together hear using *+*. This is a key component of how **ggplot2** works. Also, note that field names do not need to be placed in quotes.

This graph is the same as the one created above. However, I have changed the *binwidth* to 2. So, each bin is now narrower or aggregates a smaller range of values, and more detail in the distribution is visualized. There is not necessarily a correct *binwidth*, as this is case specific and partially depends on whether you are interested in visualizing a general trend or more local information.

At this point, you might be thinking that these graphs don’t look great. Remember that we are focusing on exploring the basics of **ggplot2** and **aesthetic mappings** in this section. You will learn to make your graphs pretty in the next module.

Here is another example of a **histrogram** representing the distribution of mean annual temperature by county. Take some time to make sure you understand the **aesthetic mappings**.

## Kernel Density Plots

Another means to represent the distribution of a single variable is a **kernel density plot**, in which a kernel density function is used to represent a generalized or smoothed version of the distribution of a variable. The syntax is very similar to that for the **histograms** created above. The *x* variable is assigned to temperature. I have added a *..density..* argument in the *aes()* function to indicate that I want the *y*-axis to show density. I am now using *geom_density()* as opposed to *geom_histogram()*. Lastly, I have provided a *fill* argument to make the density area green.

In the next graph I have changed *..density..* to *..count..* to alter the *y*-axis mapping. I also changed the *fill* to red, the *color* to gray, and the *size* to 3. *color* for a **kernel density plot** will define the color of the outline while *size* will define the thickness of the outline.

In this example, I have set *fill* to *NA* so that there is no fill color applied.

Similar to the *binwidth* argument for a **histogram**, the *adjust* argument can be used to indicate how generalized the kernel density function will be. Smaller values will result in a more detailed distribution that shows more local patterns, as demonstrated in the next graph. In the following graph a larger value is used, which results in a very generalized pattern. Similar to **histrogram** binning, there isn’t necessarily a correct value, as this depends on the purpose of the graph and the patterns being explored.

## Box Plots

Data distribution can also be summarized using **box plots**. A **box plot** shows the **minimum**, **1st quartile**, **median**, **3rd quartile**, **maximum**, and **interquartile range** (**IQR**), as shown in the image below. Note that the **mean** is not generally included.

I am using *geom_boxplot()* to show the distribution of mean elevation by county. Note that any small values that are more than 1.5 **IQR** from the **1st quartile** and any high values that are more than 1.5 **IQR** from the **3rd quartile** are plotted as points. If you would like to not show these points, then you can set *outlier.shape* equal to *NA*, as demonstrated in the second example. However, this may be a misrepresentation of your data as it implies that these more extreme values don’t exist. If you do choose to not plot the **outliers**, I would recommend at least noting that this was done so that viewers are aware.

## Comparing Categories

It is common to compare the distribution of a continuous variable for different categories in the same graph space. In this example, I first subset out the counties for three states using the *filter()* function from **dplyr**. I then generate a **kernel density plot** in which temperature is mapped to the *x*-axis, density to the *y*-axis, and the state name to the *fill* color. The result is three separate density curves. In order to better visualize the distribtions in overlapping areas, I have also applied a 50% transparency using the *alpha* argument. Note that **ggplot2** has automatically generated a legend to explain the different colors. We will talk more about legends and legend design in the next section.

```
library(dplyr)
hp_UT_ND_KS <- hp_data %>% dplyr::filter(STATE_NAME == "North Dakota" | STATE_NAME == "Utah" | STATE_NAME == "Kansas")
ggplot(hp_UT_ND_KS, aes(x=temp, ..density.., fill=STATE_NAME))+
geom_density(alpha=0.5)
```

The next example provies the same comparison but for precipitation.

Similar comparison could be made using **box plots**. In this example, I am comparing the mean elevation by county for all states in the data set. Note that I did not provide a mapping to the *x*-axis in the prior **box plot** examples. Now, I am mapping a categorical variable to this axis.

You can map the same variable to more than one graphical parameter. In this example, I am mapping the state name to the *x*-axis and to the *fill* color. Note that a legend is automatically generated.

This is a similar example, but for precipitation as opposed to elevation.

Similar to **kernal density plots**, the *fill* argument relates to the fill color while the *color* argument applies to the outline color. This is demonstrated in the code block below by changing the *fill* argument to *color*.

It is possible to define a desired order of categorical variables. In this example, I have reordered the states based on the median mean county elevation. This was accomplished using *fct_reorder()* from the *forcats* package.

```
ggplot(hp_data, aes(x=fct_reorder(STATE_NAME, elev, .fun= median, .desc=TRUE), y=elev, fill=STATE_NAME))+
geom_boxplot()
```

An alternative to a **box plot** is a **violin plot**, which is basically a mirrored kernel density surface turned on its side.

Since **box plots** and **violin plots** can share the same axes assignments, they can be plotted in the same space. Here, we are specifying two *geoms*. The last *geom* provided will plot above prior *geoms*.

In order to make the **box plots** fit inside of the **violin plots**, here I have decreased the *width* to 0.1. I have also made the **box plots** gray so that they stand out against the violin plots.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin()+
geom_boxplot(width=0.1, fill="gray")
```

This example is similar to the prior. However, I have made the **violin plots** gray and the **box plot** colors different for each state. Note that mappings set in the *geom* arguments will supersede those in *ggplot()*.

```
ggplot(hp_data, aes(x=STATE_NAME, y= precip2, fill=STATE_NAME))+
geom_violin(fill="gray")+
geom_boxplot(width=0.1)
```

Lastly, here is an example with just three states.

## Additional Univariate Graphs

There are some alternatives for comparing a continuous variable between categories graphically. In this example *geom_point()* is being used to map the mean elevation for six randomly chosen counties.

```
hp_sub <- hp_data %>% sample_n(6, replace=FALSE)
ggplot(hp_sub)+
geom_point(aes(x=NAME_1, y=elev), size=2)
```

**Bar graphs** can be created using *geom_bar()*. *stat=“identity”* indicates that the height of the bar should represent the magnitude of the variable mapped to the *y*-axis.

# Multivariate Graphs

**Multivariate** graphs involve visualizing the relationship between two or more variables and they include **scatter plots** and **line graphs**. We will now explore these graph types.

## Scatter Plots

**Scatter plots** in **ggplot2** are created using *geom_point()*, and continuous variables should be mapped to both the *x*-axis and *y*-axis. In this first example, I am mapping elevation to the *x*-axis and temperature to the *y*-axis. You can see evidence of a inverse or indirect relationship between temperature and elevation: higher elevation tends to result in lower temperature.

You can change the point symbol used by specifying a *shape* argument. You can also change the *size* of the point symbol. Some point symbols will have both a *fill* and *color* aesthetic while others will only have a single color that can be defined. Quick-R provides a summary of available point symbols. In this example, I have picked a new symbol, increased the size, and defined outline and fill colors.

```
hp_data_ord <- hp_data[hp_data$per_for,]
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point(shape=23, color="blue", fill="red", size=2)
```

You can also add trend lines to **scatter plots** using *geom_smooth()*. Here I have added a **linear** trend (*method=lm*).

```
hp_data_ord <- hp_data[hp_data$per_for,]
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm)
```

You can change the line symbol using the *lty* argument and the width using *lwd*. A list of line styles are also summarized on Quick-R.

```
hp_data_ord <- hp_data[hp_data$per_for,]
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm, lty=2, lwd=3)
```

You may have noticed that a gray outline is placed around the trend. This is a **95% confidence interval** and is added by default. To not show this, simply add *se=FALSE*.

```
hp_data_ord <- hp_data[hp_data$per_for,]
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=lm, se=FALSE)
```

Instead of plotting a linear trend, you can also use a **loess** function (*method=loess*). This is basically a local-window or moving average and would be more appropriate when a **linear** relationship does not exist between the two variables being graphed.

```
hp_data_ord <- hp_data[hp_data$per_for,]
ggplot(hp_data, aes(x=elev, y=temp))+
geom_point()+
geom_smooth(method=loess)
```

If you would like to show more than two variables on a **scatter plot**, then you can use additional **aesthetic mappings** and graphical parameters. In this example, I have mapped percent forest cover (a continuous variable) to the point *size* and the state to the point *color* (a categorical variable). I would argue that this is not necessarily effective; it is simply an example of how you can apply additional **aesthetic mappings**.

You can also label data points by adding *geom_text()* and providing a text variable or text constant using *label*.

## Line Graphs

*Line graphs* connect adjacent data points with a line as opposed to representing each data value as a separate point. They are often used to represent trends, such as trends over time in a **time series graph**. Since the *x*-axis and *y*-axis can be the same for **scatter plots** and **line graphs**, they can be combined to one plot using multiple *geoms*.

We will now read in a new dataset: **maple_leaf.csv**. This is spectral reflectance data for a maple leaf from the United States Geologic Survey (USGS) Spectral Library. The original data can be found here.

**Line graphs** are created using *geom_line()*. In this example, I am plotting wavelength (in micrometers) to the *x*-axis and and spectral reflectance (as a proportion) to the *y*-axis. This generates a **spectral reflectance curve** where peaks represent reflectance and troughs represent absorption. This should be familiar if you have studied remote sensing.

As another example, we will now read in data from the AVIRIS hyperspectral sensor showing spectral reflectance at four different locations or pixels. Using the *str()* function, you can see that the center wavelength of each band has been provided (in nm) and that data for each location is in separate columns (A through D). Spectral reflecance values have been multiplied by 100 to store them using the integer as opposed to double data type.

```
setwd("ENTER YOUR FILE PATH HERE")
aviris <- read.csv("aviris_spectral_data.csv", header=TRUE, sep=",")
str(aviris)
'data.frame': 207 obs. of 5 variables:
$ Center: num 366 376 385 395 405 ...
$ A : int 2299 2404 2576 2679 2719 2932 3048 3070 3104 3103 ...
$ B : int 2983 2795 2793 2958 3016 3289 3444 3491 3540 3570 ...
$ C : int 1381 1787 1896 1957 2005 2174 2267 2253 2237 2209 ...
$ D : int 1865 1954 2051 2150 2200 2341 2454 2483 2463 2457 ...
```

In this first example, we have plotted spectral reflectance at Site A. Note that the wavelenths were assigned to the *x*-axis and the spectral reflectance to the *y*-axis. Within the *aes()* function we have divided the values in A by 100 to return spectral reflecance as a percent.

In this example, I have plotted all four locations as separate lines using different colors. You might be wondering why I simply didn’t assign the sample location as a categorical variable to the *color* aesthetic. This is because each location is stored as a separate column as opposed to storing the sample as a categorical variable in a column. So, these data are in a different shape then the examples above and must to be treated differently. Note again that you can add multiple *geoms* to the same plot.

# Overplotting

We will now discuss means to deal with **overplotting**. **Overplotting** occurs when you have a large number of data points that cannot be easily distinguished in the graph space due to crowding and overlap.

## Hexagonal Binning

As an example, the graph below contains 15,000 points, which were generatd using the *rnorm()* function. There is too much data in the plot space to see a clear pattern.

```
d1 <- rnorm(15000, mean=0, sd=1)
d2 <- rnorm(15000, mean=0, sd=1)
d3 <- data.frame(d1, d2)
ggplot(d3, aes(x=d1, y=d2))+
geom_point()
```

One method to deal with this issue is to use **hexagonal binning** where the two-dimensional graph space is divided into hexagons of equal size and the number of data points in each hexagon are counted. Note that you can change the *size* argument to obtain larger or smaller hexagons.

## Rectangular Binning

As opposed to using hexagonal bins, you can obtain rectangular or square bins with *geom_bin2d()*. Here, each bin measures 1 by 1 units.

## 2D Kernel Density

Another option is using a **two-dimensional kernel-density surface**. This is the same thing as a **raster**-based kernel density surface that is used to estimate the density of point features over a map space. I am using *stat_density2d()* to generate the kernel density estimate. I am then plotting contours using *geom_density2d()*.

## Jitter

You can also add random noise to data points to provide more separation. this can be accomplished using *geom_jitter()*. In the first graph, the points are being plotted at the the exact specifed coordinates. In the second graph, random noise has been added in the *x* and *y* directions.

# A Note on Graphical Primitives

All graph elements in **ggplot2** are created using graphic primitives to plot line segments, paths, polygons, etc. This is demonstrated in this last graph where I have defined coordinates to represent a polygon, a rectangle, and a line segment to plot in the graph space. These primitives can be used to generate more complex graph features.

# A Note on Coordiate Systems

We have been making use of **Cartesian coordinate systems** throughout this section. However, other systems are available. Below is an example of a graph created using polar coordinates. This is a **Coxcomb Plot** where the angle represented a category and the size represents a quantity. You can also make maps using map coordinates and **ggmap**.

```
library(ggplot2)
x1 <- c("A", "B", "C", "D", "E")
y1 <- c(14, 28, 31, 17, 14)
data1 <- data.frame(x1, y1)
pcord <- ggplot(data1, aes(x=x1, y=y1))+
geom_bar(stat="identity")
pcord + coord_polar()
```

I hope this introduction to **ggplot2** has provided an adequate understanding of the package’s terminology, sytax, and **aesthetic mappings**. The graphs generated here would need some work before they could be considered final products for a presentation, report, write-up, or publication. In the next section, we will learn a variety of technqiues to make **ggplot2** graphs pretty including adding text, editing axis scales and labels, changing fonts and colors, using themes, customizing legends, adding geometric features, and combining multiple graphs to a single layout. We will also explore how to export graphs as raster and vector graphics.