Data Summarization and Statistical Tests

Summary Statistics and Statistical Tests

Objectives

  1. Obtain summary statistics using base R, psych, and dplyr
  2. Obtain summary statistics to compare groups
  3. Explore variable correlation
  4. Statistically compare the means of two groups using the t-Test and Mann-Whitney U Test
  5. Compare multiple groups using ANOVA and the Kruskal-Wallis Test

Summary Statistics

Summary Statistics using dplyr and psych

There are many packages available in R that provide functions to summarize data. Here, I will primarily demonstrate dplyr, which we explored in the data manipulation module, and psych. I tend to use dplyr for data manipulation and summarization taskS. However, pysch is also very useful as it allows you to obtain a variety of summary statistics quickly.

First, you will need to read in the required packages. If you haven’t installed them, you will need to do so. Also, you will need to load in some example data. We will work with the high_plains_data.csv file. The elevation (“elev”“), temperature (”temp"“), and precipitation (”precip2"“) data were extacted from raster grids provided by the PRISM Climate Group at Oregon State University. The elevation data are provided in feet, the temperature data represent 30-year annual normal mean temperature in Celsius, and the precipitation data represent 30-year annual normal precipitation in inches. The original raster grids have a resolution of 4 by 4 kilometers and can be obtained here. I also summarized percent forest (”per_for"") from the 2011 National Land Cover Database (NLCD). NLCD data can be obtained from the Multi-Resolution Land Characteristics Consortium (MRLC).

We do not need all of the columns to work through the examples. So, I am using dplyr to select out only the required data into a new data frame. Once the required columns have been extracted, I use the summary() function from base R to obtain summary information for each column. The minimum, 1st quartile. median, mean, 3rd quartile, and maximum values will be returned for continuous variables. For factors, the factors levels and associated counts will be returned.

Calling levels() on a factor column will print the list of factor levels. This data set contains county-level data across eight states.

Generally, I find that it is a good idea to use summary() to inspect data prior to performing any analyses.

The base R head() and tail() functions can be used to print the first five or last five rows of data, respectively. If you would like to print a different number of rows, you can use the optional n argument. Similar to summary(), these functions are useful for inspecting data prior to an analysis.

The describe() function from the psych package can be used to obtain a variety of summary statistics for all or individual columns of data. By default it will return the mean, standard deviation, median, trimmed mean, median absolute deviation, minimum, maximum, range, skewness, kurtosis, and standard error for continuous data.

To obtain individual statistics, you can use the summarize() function from dplyr. In the examples below, we are obtaining the mean, standard deviation, and median for the elevation data. The result can be saved to a variable for later use.

The base R hist() function can be used to generate a histogram to visuzlize the distribution of a single continuous variable. In a later module, you will learn to make more refined histrograms using the ggplot2 package; however, this base R function can be useful for simple data exploration.

Summary Statistics by Group

As opposed to obtaining global summary statistics, you may be interested in summarizing data by group for comparison. The describeBy() function from the psych package expands upon describe() to allow for grouping. In the example below, I have obtained summary statistics by state. By default the result will be returned as a list object; however, you can set the mat argument to TRUE to obtain the output as a data frame.

Alternatively, you can use the dplyr group_by() function to obtain summary statistics by group. In the example below I am piping the data frame into group_by() followed by summarize() to obtain the mean county percent forest by state. If you would like to calculate a summary statistic for more than one column at once, you can use the summarize_all() function as opposed to summarize() as demonstrated in the second code block.

Statistical Methods

We will now explore some common statistical tests. Since this is not a statistics course, I will not demonstrate more complex tests; however, if you need to conduct more complex tests, I think you will find that the syntax will be similar to the examples provided here.

Correlation

The base R cor() function allows for the calculation of correlation between pairs of continuous variables. The following methods are available:

  • Pearson: parametric and assesses the linear correlation between two variables (similar to R-squared)
  • Spearman: nonparametric measure based on ranks
  • Kendal: nonparametric measure based on ranks and reported as probabilities

In the example below, I have compared the correlation between population density, percent forest cover, temperature, elevation and precipitation using all three methods. Large positive values indicate direct or positive correlation between the two variables while negative values indicate indirect or inverse correlation between the two variables. You may have noticed that the diagonal values are 1 for all methods. This is because a variable is perfectly correlated with itself.

t-Test

A t-test is used to statistically assess whether the means of two groups are different. The null hypothesis is that the means are not different and the alternative hypothesis is that they are different. In the example, I first use dplyr to subset out only counties in Utah or North Dakota since we can only compare two groups with a t-test. I then use the base R t.test() function to peform the test. Specifically, I am attempting to assess whether the mean annual county temperatures are different for the two states. I provide a formula and a data set.

Once the test executes, the results will be printed to the console. A p-value of ~4e-8 is obtained, which is much lower than 0.05. This indicates failure to reject the null hypothesis. So, mean annual temperature by county in these two states are suggested to be statistically different.

Mann-Whitney U Test

A t-test is a parametric test. This means that it has some distribution assumptions. Specifically, the t-test assumes that the two groups are independent of each other, samples within groups are independent, the dependent variable is normally distributed, and the two groups have similar variance. It is likely that at least some of these assumptions is violated. For example, due to spatial autocorrelation, it is unlikely that samples are independent.

The Mann-Whitney U Test offers a nonparametric alternative to the t-test with relaxed distribution assumptions. However, it is till assumed that the samples are independent. The example below provides the result for the comparison of Utah and North Dakota. A statistically significant p-value is obtained, again suggesting that mean annual temperature by county is different between the two states.

ANOVA

What if you would like to compare more than two groups? This can be accomplished using analysis of variance or ANOVA. There are many different types of analysis of variance that vary based on study design, such as One-Way ANOVA, Two-Way ANOVA, MANOVA, factorial ANOVA, ANCOVA, and MANCOVA. Here, I will simply demonstrate one-way ANOVA where we are comparing the mean of a single variable between groups. However, if you need to implement one of the more complex methods, you can use similar syntax.

Similar to the example above, I am assessing whether the mean annual temperature by county is different between states. However, now I am assessing all states in the data set as oppose to only two. The null hypothesis is that no pair of means are different while the alternative hypothesis is that at least one pair of means is different. The test results in a statistically significant p-value, suggesting that at least one pair of states have statistically different mean annual temperature by county. However, the test does not tell us what pair or pairs are different.

Following a statistically significant ANOVA result, it is common to use a pair-wise test to compare each pair combination. There are different methods available. Here, I am usng Tukey’s Honest Significant Difference method, made available by the TukeyHSD() function to perform the comparisons. If the p-value is lower than 0.05 and the confidence interval does not include zero, the states are suggested to be statistically significantly different at the 95% confidence interval.

fit <- aov(temp ~ STATE_NAME, data=hp_data2)
summary(fit)
             Df Sum Sq Mean Sq F value Pr(>F)    
STATE_NAME    7   3356   479.5   176.9 <2e-16 ***
Residuals   481   1303     2.7                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(fit)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = temp ~ STATE_NAME, data = hp_data2)

$STATE_NAME
                                 diff          lwr        upr     p adj
Kansas-Colorado            5.79408322  4.999344528  6.5888219 0.0000000
Montana-Colorado          -1.04806783 -1.965074015 -0.1310616 0.0127131
Nebraska-Colorado          2.97853221  2.164607922  3.7924565 0.0000000
North Dakota-Colorado     -1.73668701 -2.667431873 -0.8059421 0.0000006
South Dakota-Colorado      0.87163656 -0.007538505  1.7508116 0.0538864
Utah-Colorado              1.42581618  0.304009224  2.5476231 0.0031044
Wyoming-Colorado          -0.96872817 -2.187076194  0.2496198 0.2336977
Montana-Kansas            -6.84215105 -7.671409522 -6.0128926 0.0000000
Nebraska-Kansas           -2.81555101 -3.529162961 -2.1019391 0.0000000
North Dakota-Kansas       -7.53077022 -8.375196232 -6.6863442 0.0000000
South Dakota-Kansas       -4.92244666 -5.709668399 -4.1352249 0.0000000
Utah-Kansas               -4.36826704 -5.419561536 -3.3169725 0.0000000
Wyoming-Kansas            -6.76281139 -7.916562307 -5.6090605 0.0000000
Nebraska-Montana           4.02660004  3.178936919  4.8742632 0.0000000
North Dakota-Montana      -0.68861917 -1.649007653  0.2717693 0.3639826
South Dakota-Montana       1.91970439  1.009205179  2.8302036 0.0000000
Utah-Montana               2.47388401  1.327362809  3.6204052 0.0000000
Wyoming-Montana            0.07933966 -1.161801720  1.3204810 0.9999995
North Dakota-Nebraska     -4.71521921 -5.577726274 -3.8527121 0.0000000
South Dakota-Nebraska     -2.10689565 -2.913481797 -1.3003095 0.0000000
Utah-Nebraska             -1.55271602 -2.618588096 -0.4868440 0.0003041
Wyoming-Nebraska          -3.94726038 -5.114309789 -2.7802110 0.0000000
South Dakota-North Dakota  2.60832357  1.683988946  3.5326582 0.0000000
Utah-North Dakota          3.16250319  2.004964184  4.3200422 0.0000000
Wyoming-North Dakota       0.76795883 -0.483367502  2.0192852 0.5734358
Utah-South Dakota          0.55417962 -0.562314593  1.6706738 0.8013743
Wyoming-South Dakota      -1.84036473 -3.053822760 -0.6269067 0.0001347
Wyoming-Utah              -2.39454436 -3.793823612 -0.9952651 0.0000078

Since ANOVA is a paramteric method, it has assumptions, similar to the t-test. Specifically, ANOVA assumes that samples are independent, the data and model residuals are normally distributed, and that variance is consistent between groups.

A Q-Q plot can be used to visually assess data normality by comparing the quantiles of the data to theoretical quanitles, or the quantiles the data would have if they were normally distributed. Divergence from the line suggests that the data are not normally distributed.

In the examples below I have created Q-Q plots using two different functions, qqnorm() and qqPlot(). The first plot represents the distribution of the temperature data while the second plot represents the distribution of the model residuals. Since the data points diverge from the line, this suggests that the data are not normally distributed and that an assumption of ANOVA is violated.

[1] 36 47

The Bartlett Test of Homogeneity of Variance is used to assess whether variance is the same between groups. A statistically significant results, as obtained here, suggests that this assumption is also violated.

ANOVA is also senstive to outliers. The Bonferroni Outlier Test is used to determine whether outliers are present. The results in this case do suggest the presence of outliers.

Kruskal-Wallis Test

Since many of the assumptions of ANOVAare violated based on our analysis above, it would be good to assess difference between group means using a nonparametric alternative. The Kruskal-Wallis Rank Sum Test provides this alternative. Similar to ANOVA, a statistically significant result, as obtained here, suggests that at least one pair of groups are different. It does not tell you what pair. So, I use the pairw.kw() function from the asbio package to perform a nonparametric pair-wise comparison. A p-value lower than 0.05 and a confidence interval that does not include zero sugggests difference between the two groups.

kruskal.test(temp ~ STATE_NAME, data=hp_data2)

    Kruskal-Wallis rank sum test

data:  temp by STATE_NAME
Kruskal-Wallis chi-squared = 367.65, df = 7, p-value < 2.2e-16
pairw.kw(hp_data2$temp, hp_data2$STATE_NAME, conf=0.95)

95% Confidence intervals for Kruskal-Wallis comparisons 

                                                Diff      Lower      Upper
Avg.rankColorado-Avg.rankKansas           -253.02917 -323.02863 -183.02971
Avg.rankColorado-Avg.rankMontana            67.86607  -12.90254  148.63468
Avg.rankKansas-Avg.rankMontana             320.89524  247.85532  393.93515
Avg.rankColorado-Avg.rankNebraska          -131.1539  -202.8432   -59.4646
Avg.rankKansas-Avg.rankNebraska            121.87527   59.02134   184.7292
Avg.rankMontana-Avg.rankNebraska          -199.01997 -273.68094   -124.359
Avg.rankColorado-Avg.rankNorth Dakota      102.26769     20.289  184.24638
Avg.rankKansas-Avg.rankNorth Dakota        355.29686  280.92101   429.6727
Avg.rankMontana-Avg.rankNorth Dakota        34.40162  -50.18804  118.99128
Avg.rankNebraska-Avg.rankNorth Dakota      233.42159  157.45318  309.38999
Avg.rankColorado-Avg.rankSouth Dakota      -19.88068  -97.31718   57.55582
Avg.rankKansas-Avg.rankSouth Dakota        233.14848  163.81111  302.48586
Avg.rankMontana-Avg.rankSouth Dakota       -87.74675 -167.94224   -7.55127
Avg.rankNebraska-Avg.rankSouth Dakota      111.27322   40.23025  182.31618
Avg.rankNorth Dakota-Avg.rankSouth Dakota -122.14837 -203.56246  -40.73428
Avg.rankColorado-Avg.rankUtah              -52.97629 -151.78346   45.83088
Avg.rankKansas-Avg.rankUtah                200.05287  107.45634  292.64941
Avg.rankMontana-Avg.rankUtah              -120.84236 -221.82633   -19.8584
Avg.rankNebraska-Avg.rankUtah                78.1776   -15.7029  172.05811
Avg.rankNorth Dakota-Avg.rankUtah         -155.24398 -257.19838  -53.28958
Avg.rankSouth Dakota-Avg.rankUtah          -33.09561 -131.43484   65.24362
Avg.rankColorado-Avg.rankWyoming            57.00272  -50.30765  164.31309
Avg.rankKansas-Avg.rankWyoming             310.03188  208.41113  411.65263
Avg.rankMontana-Avg.rankWyoming            -10.86335 -120.18133   98.45462
Avg.rankNebraska-Avg.rankWyoming           188.15662   85.36455  290.94868
Avg.rankNorth Dakota-Avg.rankWyoming       -45.26497 -155.48003   64.95008
Avg.rankSouth Dakota-Avg.rankWyoming         76.8834  -29.99627  183.76307
Avg.rankUtah-Avg.rankWyoming               109.97901  -13.26753  233.22555
                                           Decision Adj. P-value
Avg.rankColorado-Avg.rankKansas           Reject H0            0
Avg.rankColorado-Avg.rankMontana             FTR H0     0.242811
Avg.rankKansas-Avg.rankMontana            Reject H0            0
Avg.rankColorado-Avg.rankNebraska         Reject H0            0
Avg.rankKansas-Avg.rankNebraska           Reject H0            0
Avg.rankMontana-Avg.rankNebraska          Reject H0            0
Avg.rankColorado-Avg.rankNorth Dakota     Reject H0     0.002729
Avg.rankKansas-Avg.rankNorth Dakota       Reject H0            0
Avg.rankMontana-Avg.rankNorth Dakota         FTR H0            1
Avg.rankNebraska-Avg.rankNorth Dakota     Reject H0            0
Avg.rankColorado-Avg.rankSouth Dakota        FTR H0            1
Avg.rankKansas-Avg.rankSouth Dakota       Reject H0            0
Avg.rankMontana-Avg.rankSouth Dakota      Reject H0     0.017672
Avg.rankNebraska-Avg.rankSouth Dakota     Reject H0      2.8e-05
Avg.rankNorth Dakota-Avg.rankSouth Dakota Reject H0      7.8e-05
Avg.rankColorado-Avg.rankUtah                FTR H0            1
Avg.rankKansas-Avg.rankUtah               Reject H0            0
Avg.rankMontana-Avg.rankUtah              Reject H0     0.005193
Avg.rankNebraska-Avg.rankUtah                FTR H0     0.260082
Avg.rankNorth Dakota-Avg.rankUtah         Reject H0      5.5e-05
Avg.rankSouth Dakota-Avg.rankUtah            FTR H0            1
Avg.rankColorado-Avg.rankWyoming             FTR H0            1
Avg.rankKansas-Avg.rankWyoming            Reject H0            0
Avg.rankMontana-Avg.rankWyoming              FTR H0            1
Avg.rankNebraska-Avg.rankWyoming          Reject H0            0
Avg.rankNorth Dakota-Avg.rankWyoming         FTR H0            1
Avg.rankSouth Dakota-Avg.rankWyoming         FTR H0     0.689842
Avg.rankUtah-Avg.rankWyoming                 FTR H0     0.148743

Again, my goal here was to provide a demonstration of data summarization and statistical tests that can be performed in R. As this is not a statistics course, I did not explain the methods in detail or provide examples of more complex methods. If you are interested in these topics, I would suggest taking a statistics and/or spatial statistics course. If you need to perform more complex statistcal analyses in R, you will find that they can be performed using similar syntax to the examples provided here.

We will discuss more data summarization in later modules including data visualization and graphing with ggplot2 and spatial data visualization and summarization with a variety of mapping and spatial analysis packages. In regards to statistical tests, we will explore linear regression and machine learning at the end of the course.

Back to Course Page

Back to WV View

Download Data