Regression Techniques

Regression Techniques

Objectives

  1. Create linear regression models to predict a continuous variable with another continuous variable.
  2. Use multiple linear regression to predict a continuous variable with more than one predictor variable.
  3. Assess regression results with root mean square error (RMSE) and R-squared.
  4. Perform regression diagnostics to assess whether assumptions are met.
  5. Assess global spatial autocorrelation using Moran’s I.
  6. Generate predictions using geographically weighted regression (GWR).

Overview

Before we investigate machine learning methods, this module will provide some examples of linear regression techniques to predict a continuous variable. Similar to machine learning, linear regression equations are generated by fitting to training samples. Once the model is generated, it can then be used to predict new samples. If more than one independent variable is used to predict the dependent variable, this is known as multiple regression. Since regression techniques are parametric and have assumptions, you will also need to perform diagnostics to make sure that assumptions are not violated. If data occurs over a geographic extent, patterns and relationships may change over space due to spatial heterogeniety. Since the patterns are not consistent, it would not be appropriate to use a single model or equation to predict across the entire spatial extent. Instead, it would be better if the coefficients and model could change over space. Geographically weighted regression (GWR) allows for variable models across space. We will explore this technique at the end of the module.

Regression techniques are heavily used for data analysis; thus, this is a broad topic, which we will not cover in detail. For example, it is possible to use regression to predict categories or probabilities using logistic regression. It is possible to incorporate categorical predictor variables as dummy variables, model non-linear relationships with higher order terms and polynomials, and incorporate interactions. Regression techniques are also used for statistical inference; for example, ANOVA is based upon regression techniques. If you have an interest in learning more about regression, there are a variety of resources on this topic. Also, the syntax demonstrated here can be adapted to more complex techniques.

Quick-R provides examples of regression techniques here and regression diagnostics here.

In this example, we will attempt to predict the percentage of a county population that has earned at least a bachelor’s degree using other continuous variables. I obtained these data from the United States Census as estimated data as opposed to survey results. Here is a description of the provided variables.

  • per_no_hus: percent of households with no father present
  • per_with_child_u18: percent of households with at least one child younger than 18
  • avg_fam_size: average size of families
  • percent_fem_div_15: percent of females over 15 years old that are divorced
  • per_gp_gc: percent of grand parents that are caretakers for a grandchild
  • per_25_bach: percent of population over 25 years old with at least a bachelor’s degree
  • per_native_born: percent of population born in United States
  • per_eng_only: percent of population that only speaks English
  • per_broadband: percent of households with broadband access

Preparation

Before beginning the analysis, you will need to read in the required packages and data. In order to join the table to the county boundaries, I create a new field from the “GEOID” field that will be unique to each county and the same as the “Unique” code provided in the Census data table.

I have provided the Census data as a CSV file with no associated spatial information; however, I would like to display the data in map space, so I will join it to the provided county boundaries. This is accomplished using techniques that we have already discussed. Using dplyr and the left_join() function, I join the data using the common “Unique” field derived from the county FIPS code.

Once the data have been joined, I use tmap to visualize the percent of the population with at least a bachelor’s degree.

Linear Regression

Before I attempt to predict the percent of the county population with at least a bachelor’s degree with all the other provided variables, I will make a prediction using only the percent with broadband access. To visualize the relationship, I first generate a scatter plot using ggplot2. Although not perfect, it does appear that there is a relationship between these two variables. Or, higher percent access to broadband seems to be correlated with secondary degree attainment rate. This could be completely coincidental; correlation is not causation.

Fitting a linear model in R is accomplished with the base R lm() function. Here, I am providing a formula. The left-side of the formula will contain the dependent variable, or the variable being predicted, while the right-side will reference one or more predictor variables. You also need to provide a data argument.

Calling summary() on the linear model will print a summary of the results. The fitted model is as follows:

per_25_bach = -14.6 + 0.41*per_broadband

Since the coefficient for the percent broadband variable is positive, this suggests that there is a positive or direct relationship between the two variable, as previously suggested by the graph generated above. The residuals vary form -15.1% to 22.5% with a median of -0.4%. The F-statistic is statistically significant, which suggest that this model is performing better than simply returning the mean value of the data set as the prediction. The R-squared value is 0.50, suggesting that roughly 50% of the variability in percent secondary degree attainment is explained by percent broadband access. In short, there is some correlation between the variables but not enough to make a strong prediction.

Fitted models and model summaries are stored as list objects. So, you can find specific components of the model or results within the list. In the code block below, I am extracting the R-squared value from the summary list object.

The Metric package can be used to calculate a variety of summary statistics. In the example below, I am using it to obtain the root mean square error (RMSE) of the model. To do so, I must provide the correct values from the original data table and the predicted or fitted values from the model list object. RMSE is reported in the units of the dependent variable, in this case percent.

You can also calculate RMSE manually as demonstrated below. This will yield the same answer as the rmse() function.

Multiple Linear Regression

We will now try to predict the percent of the county population with at least a bachelor’s degree using all other variables in the Census data set. First, I generate a scatter plot matrix using the car package to visualize the distribution of the variables and the correlation between pairs. Note also that the percent of grandparents who are caretakers for at least one grandchild variable read in as a factor, so I convert it to a numeric variable before reading it in.

As an alternative to a scatter plot matrix, you could also just create separate scatter plots to compare the dependent variable with each independent variable separately. Take some time to analyze the patterns and correlations. What variables do you think will be most predictive?

Generating a multiple linear regression model is the same as producing a linear regression model using only one independent variable accept that multiple variables will now be provided on the right-side of the equation. Once the model is generated, the summary() function can be used to explore the results. When more than one predictor variable is used, adjusted R-squared should be used for assessment as opposed to R-squared. The proportion of variance explained increased from 0.50 to 0.56, suggesting that this multiple regression model is performing better than the model created above using only percent broadband access. The F-statistic is also statistically significant, suggesting that the model is better than simply predicting using only the mean value of the dependent variable. When using mutltiple regression, you should interpret coefficients with caution as they don’t take into account variable correlation and can, thus, be misleading.

The RMSE decreased from 3.95% to 3.69%, again suggesting improvement in the model performance.

Regression Diagnostics

Once a linear regression model is produced, your work is not done. Since linear regression is a parametric method, you must test to make sure assumptions are not violated. If they are violated, you will need to correct the issue; for example, you may remove variables, transform variables, or use a different modeling method more appropriate for your data.

Here are the assumptions of linear regression:

  1. Residuals are normally distributed
  2. No patterns in the residual relative to values of the dependent variable (homoscedaticity)
  3. Linear relationship exists between the dependent variable and each independent variable
  4. No or little multicollinearity between independent variables
  5. Samples are independent

Normality of the resdiuals can be assessed using a normal Q-Q plot. This plot can be interpreted as follows:

  • On line = normal
  • Concave = skewed to right
  • Convex = skewed to the left
  • High/low off line = kurtosis

The code block below generates a Q-Q plot using the qqPlot() function from the car package. I have also included a simple historgram created with base R. Generally, this suggest that there may be some kurtosis issues since high and low values diverge from the line.

[1] 363 614

Optimally, the variability in the residuals should show no pattern when plotted against the values of the dependent variable. Or, the model should not do a better or worse job at predicting high, low, or specific ranges of values of the dependent variable. This is the concept of homoscedasticity. This can be assessed using spread-level plots. In this plot, the residuals are plotted against the fitted values, and no pattern should be observed. The Score Test For Non-Constant Error Variance provides a statistic test of homoscedasticity and can be calculated in R using the ncvTest() function from the car package. In this case I obtain a statistically significant result, which suggest issues in homogeniety of variance. The test also provides a suggested power by which to transform the data.


Suggested power transformation:  0.8898529 

Component+Residual (Partial Residual) Plots are used to visually assess whether a linear relationship exists between the dependent variable and each independent variable. The trend should be approximated with a line. Curves and other non-linear patterns suggest that the relationship between the variables may not be linear. As an example, the relationship with percent broadband may be better modeled using a function other than a linear one. To make the relationship more linear, a data transformation could be applied.

Multicolinearity between predictor variables can be assessed using the variable inflation factor. The square root of this value should not be greater than 2. This measure suggests issues with the percent native born and percent English speaking only variables. It would make sense that these variables are correlated, so it might be best to remove one of them from the model. The vif() function is provided by the car package.

The gvlma package provides the gvlma() function to test all the assumptions at once. Here are brief explanations of the components of this test.

  • Global Stat: combination of all tests, if any component is not satisfied this test will not be satisfied
  • Skewness: assess skewness of the residuals relative to a normal distribution
  • Kurtosis: assess kurtosis of residuals relative to a normal distribution
  • Link Function: assess if linear relationship exists between the dependent variable and all independent variables.
  • Heteroscedasticity: assess for homoscedasticity.

According to this test, the only assumption that is met is homoscedasity, which was actually found to be an issue in the test performed above. So, this regression analysis is flawed and would need to be augmented to obtain usable results. We would need to experiment with variable transformations and consider dropping some variables. If these augmentations are not successful in satisfying the assumptions, then we may need to use a method other than linear regression.

Regession techniques are also generally susceptible to the impact out outliers. The outlierTest() function from the car package provides an assessment of the presence of outliers. Using this test, eight records are flagged as outliers. It might be worth experimenting with removing them from the data set to see if this improves the model and/or helps to satisfy the assumptions.

Leverage samples are those that have an odd or different combination of indepedent variable values and, thus, can have a large or disproportionate impact on the model. High leverage points can be assessed using the Cook’s Distance, as shown graphically below. Samples 499, 1042, and 1857 are identified as leverage samples that are having a large impact on the model.

Geographically Weighted Regression

Since the data used in this example occur over map space, they may not be independent or spatial hetrogeneity may exist. So, it may not be appropriate to use a single equation to represent the relationships.

To determine if this is an issue, I will first plot the residuals over map space by adding the residuals to the data set then mapping them with tmap.

Based on a visual inspection of the map, it does appear that there may be some spatial patterns in the residuals. However, this is difficult to say with certainty. Luckily, a statistical test is available for assessing global spatial autocorrelation: Moran’s I. The null hypothesis for this test is that there is not spatial autocorrelation, or that the spatial pattern of the values is random. The alternative hypothesis is that the pattern is not random, so the values are either clustered or dispersed.

Moran’s I is based on comparing a feature to it neighbors, so we will need to obtain a matrix of weights that define spatial relationships. There are multiple means to accomplish this in R. First, I convert the sf object to an sp object, since the function of interest will not accept sf objects. Next, I build a neighbor list using the poly2nb() function from the spdep package. Using this method, only features that share boundaries will be considered neighbors. Note that there are other means to define neighbors; for example, you could use inverse distance weighting to build the spatial weights matrix. Once neighbors are defined, I use the nb2listw() function from the spdep package to build a spatial weights matrix. I can then calculate Moran’s I using the moran.test() function from spdep.

The reported Moran’s I index is positive and statistically significant. This suggests that the resiudals are spatially clustered. So, a geographically weighted regression may be more appropriate than traditional regression methods.

I will use the gwr() function from the spgwr package to perform geographically weighted regression. Before running the model, I will need to determine an optimal bandwidth, which defines the optimal search distance for incorporating spatial weights into the model. This can be accomplished with the gwr.sel() function.

Once the optimal bandwidth is determined, I can build the model. Note that this syntax is very similar to the regression techniques implemented above.

Once the model is built, I convert the result to an sf object. I then print the first five records to visualize the results. Note that different coefficients are now provided for each feature as opposed to a single, global equation. So, the model is able to vary over space to incorporate spatial heterogeneity.

m3_sf <- st_as_sf(m3$SDF)
head(m3_sf)
Simple feature collection with 6 features and 13 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -123.7283 ymin: 33.99541 xmax: -96.45284 ymax: 46.38562
epsg (SRID):    4269
proj4string:    +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
      sum.w X.Intercept.   per_no_hus per_with_child_u18 avg_fam_size
1 130.19234    -34.78181  0.038060340        -0.01801299    3.2429415
2  31.30195     38.18783 -0.295027294        -0.34048247    0.6579797
3  40.43323     13.81937 -0.233988324        -0.20326283   -2.7515267
4 132.65015    -31.05976  0.017663536        -0.01379966    2.6152415
5 117.09808    -31.08970 -0.011434203        -0.02195058    2.8794456
6 117.27645    -34.40608 -0.004853622        -0.09685361    4.4452054
  per_fem_div_15 as.numeric.per_gp_gc. per_native_born per_eng_only
1   -0.224201763         -0.0009659514    -0.117924202   0.22813623
2   -0.626900386         -0.0025895092    -0.741398748   0.21842901
3   -0.001002165          0.0013499760     0.053298953  -0.03889115
4   -0.211048926         -0.0018496629    -0.161416177   0.24162466
5   -0.193267089         -0.0022627102    -0.008425324   0.14107862
6   -0.128569038          0.0001864582    -0.015420210   0.12001726
  per_broadband     gwr.e     pred   localR2
1     0.4631832  3.368081 12.33192 0.6107726
2     0.5853535 -1.457919 10.55792 0.8675335
3     0.2216439 -6.294063 11.99406 0.6085916
4     0.4787902  3.454713 20.84529 0.6193532
5     0.4038966 -2.125579 14.22558 0.5458257
6     0.4362002  3.024006 18.67599 0.5988537
                        geometry
1 MULTIPOLYGON (((-97.01952 4...
2 MULTIPOLYGON (((-123.4364 4...
3 MULTIPOLYGON (((-104.5674 3...
4 MULTIPOLYGON (((-96.91075 4...
5 MULTIPOLYGON (((-98.27367 4...
6 MULTIPOLYGON (((-97.12928 4...

Finally, I print the result then obtain an RMSE for the prediction. The RMSE using multiple regression obtained above was 3.69% while that for GWR is 2.7%. This suggests improvement in the prediction.

The example used here was not optimal for linear regression as many of the model assumptions were not met. GWR improved the results; however, it also has similar assumptions that cannot be violated. So, this problem might be better tackled using a different method. We will revisit it again in the context of machine learning.

Back to Course Page

Back to WV View

Download Data