# Regression Techniques

# Regression Techniques

## Objectives

- Create
**linear regression**models to predict a continuous variable with another continuous variable. - Use
**multiple linear regression**to predict a continuous variable with more than one predictor variable. - Assess regression results with
**root mean square error**(**RMSE**) and**R-squared**. - Perform regression diagnostics to assess whether assumptions are met.
- Assess global
**spatial autocorrelation**using**Moran’s I**. - Generate predictions using
**geographically weighted regression**(**GWR**).

## Overview

Before we investigate **machine learning** methods, this module will provide some examples of **linear regression** techniques to predict a continuous variable. Similar to **machine learning**, linear regression equations are generated by fitting to training samples. Once the model is generated, it can then be used to predict new samples. If more than one **independent variable** is used to predict the **dependent variable**, this is known as **multiple regression**. Since regression techniques are **parametric** and have assumptions, you will also need to perform diagnostics to make sure that assumptions are not violated. If data occurs over a geographic extent, patterns and relationships may change over space due to **spatial heterogeniety**. Since the patterns are not consistent, it would not be appropriate to use a single model or equation to predict across the entire spatial extent. Instead, it would be better if the coefficients and model could change over space. **Geographically weighted regression** (**GWR**) allows for variable models across space. We will explore this technique at the end of the module.

Regression techniques are heavily used for data analysis; thus, this is a broad topic, which we will not cover in detail. For example, it is possible to use regression to predict categories or probabilities using **logistic regression**. It is possible to incorporate categorical predictor variables as **dummy variables**, model non-linear relationships with higher order terms and polynomials, and incorporate interactions. Regression techniques are also used for statistical inference; for example, **ANOVA** is based upon regression techniques. If you have an interest in learning more about regression, there are a variety of resources on this topic. Also, the syntax demonstrated here can be adapted to more complex techniques.

Quick-R provides examples of regression techniques here and regression diagnostics here.

In this example, we will attempt to predict the percentage of a county population that has earned at least a bachelor’s degree using other continuous variables. I obtained these data from the United States Census as estimated data as opposed to survey results. Here is a description of the provided variables.

*per_no_hus*: percent of households with no father present*per_with_child_u18*: percent of households with at least one child younger than 18*avg_fam_size*: average size of families*percent_fem_div_15*: percent of females over 15 years old that are divorced*per_gp_gc*: percent of grand parents that are caretakers for a grandchild*per_25_bach*: percent of population over 25 years old with at least a bachelor’s degree*per_native_born*: percent of population born in United States*per_eng_only*: percent of population that only speaks English*per_broadband*: percent of households with broadband access

## Preparation

Before beginning the analysis, you will need to read in the required packages and data. In order to join the table to the county boundaries, I create a new field from the “GEOID” field that will be unique to each county and the same as the “Unique” code provided in the Census data table.

```
library(ggplot2)
library(GGally)
library(tmap)
library(dplyr)
library(sf)
library(Metrics)
library(car)
library(gvlma)
library(spdep)
library(spgwr)
```

```
setwd("ENTER YOUR FILE PATH HERE")
counties <- st_read("us_counties_contig.shp")
counties$Unique <- paste0("x", as.factor(counties$GEOID))
census <- read.csv("census_data.csv")
```

I have provided the Census data as a **CSV** file with no associated spatial information; however, I would like to display the data in map space, so I will join it to the provided county boundaries. This is accomplished using techniques that we have already discussed. Using **dplyr** and the *left_join()* function, I join the data using the common “Unique” field derived from the county FIPS code.

Once the data have been joined, I use **tmap** to visualize the percent of the population with at least a bachelor’s degree.

## Linear Regression

Before I attempt to predict the percent of the county population with at least a bachelor’s degree with all the other provided variables, I will make a prediction using only the percent with broadband access. To visualize the relationship, I first generate a **scatter plot** using **ggplot2**. Although not perfect, it does appear that there is a relationship between these two variables. Or, higher percent access to broadband seems to be correlated with secondary degree attainment rate. This could be completely coincidental; correlation is not causation.

```
ggplot(data, aes(x=per_broadband, y=per_25_bach))+
geom_point()+
geom_smooth(method="lm", col="red")+
ggtitle("Broadband Access vs. Perecent of Population with at Least a Bachelor's Degree")+
labs(x="% with Broadband", y="% with At Least Bachelor's Degree")
```

Fitting a linear model in R is accomplished with the base R *lm()* function. Here, I am providing a formula. The left-side of the formula will contain the **dependent variable**, or the variable being predicted, while the right-side will reference one or more predictor variables. You also need to provide a *data* argument.

Calling *summary()* on the linear model will print a summary of the results. The fitted model is as follows:

per_25_bach = -14.6 + 0.41*per_broadband

Since the coefficient for the percent broadband variable is positive, this suggests that there is a positive or direct relationship between the two variable, as previously suggested by the graph generated above. The **residuals** vary form -15.1% to 22.5% with a median of -0.4%. The **F-statistic** is statistically significant, which suggest that this model is performing better than simply returning the mean value of the data set as the prediction. The **R-squared** value is 0.50, suggesting that roughly 50% of the variability in percent secondary degree attainment is explained by percent broadband access. In short, there is some correlation between the variables but not enough to make a strong prediction.

```
m1 <- lm(per_25_bach ~ per_broadband, data=data)
summary(m1)
Call:
lm(formula = per_25_bach ~ per_broadband, data = data)
Residuals:
Min 1Q Median 3Q Max
-15.084 -2.760 -0.435 2.280 22.492
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.613645 0.515371 -28.36 <2e-16 ***
per_broadband 0.406604 0.007306 55.65 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.947 on 3106 degrees of freedom
Multiple R-squared: 0.4993, Adjusted R-squared: 0.4991
F-statistic: 3097 on 1 and 3106 DF, p-value: < 2.2e-16
```

Fitted models and model summaries are stored as **list** objects. So, you can find specific components of the model or results within the **list**. In the code block below, I am extracting the **R-squared** value from the summary **list** object.

The **Metric** package can be used to calculate a variety of summary statistics. In the example below, I am using it to obtain the **root mean square error** (**RMSE**) of the model. To do so, I must provide the correct values from the original data table and the predicted or fitted values from the model **list** object. **RMSE** is reported in the units of the **dependent variable**, in this case percent.

You can also calculate **RMSE** manually as demonstrated below. This will yield the same answer as the *rmse()* function.

## Multiple Linear Regression

We will now try to predict the percent of the county population with at least a bachelor’s degree using all other variables in the Census data set. First, I generate a **scatter plot matrix** using the **car** package to visualize the distribution of the variables and the correlation between pairs. Note also that the percent of grandparents who are caretakers for at least one grandchild variable read in as a **factor**, so I convert it to a numeric variable before reading it in.

```
data$per_gp_gc <- as.numeric(data$per_gp_gc)
scatterplotMatrix(~per_25_bach + per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_gp_gc + per_native_born + per_eng_only + per_broadband, data=data, main="Variable Comparison")
```

As an alternative to a **scatter plot matrix**, you could also just create separate **scatter plots** to compare the **dependent variable** with each **independent variable** separately. Take some time to analyze the patterns and correlations. What variables do you think will be most predictive?

```
ggplot(data, aes(x=per_with_child_u18, y=per_25_bach))+
geom_point()+
geom_smooth(method="lm", col="red")
```

```
ggplot(data, aes(x=per_fem_div_15, y=per_25_bach))+
geom_point()+
geom_smooth(method="lm", col="red")
```

```
ggplot(data, aes(x=per_native_born, y=per_25_bach))+
geom_point()+
geom_smooth(method="lm", col="red")
```

```
ggplot(data, aes(x=per_broadband, y=per_25_bach))+
geom_point()+
geom_smooth(method="lm", col="red")
```

Generating a **multiple linear regression** model is the same as producing a **linear regression** model using only one **independent variable** accept that **multiple variables** will now be provided on the right-side of the equation. Once the model is generated, the *summary()* function can be used to explore the results. When more than one predictor variable is used, **adjusted R-squared** should be used for assessment as opposed to **R-squared**. The proportion of variance explained increased from 0.50 to 0.56, suggesting that this **multiple regression** model is performing better than the model created above using only percent broadband access. The **F-statistic** is also statistically significant, suggesting that the model is better than simply predicting using only the mean value of the **dependent variable**. When using **mutltiple regression**, you should interpret coefficients with caution as they don’t take into account variable correlation and can, thus, be misleading.

```
m2 <- lm(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + per_gp_gc + per_native_born + per_eng_only + per_broadband, data = data)
summary(m2)
Call:
lm(formula = per_25_bach ~ per_no_hus + per_with_child_u18 +
avg_fam_size + per_fem_div_15 + per_gp_gc + per_native_born +
per_eng_only + per_broadband, data = data)
Residuals:
Min 1Q Median 3Q Max
-19.5162 -2.4374 -0.3252 2.0338 21.7646
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6231146 2.1663691 2.596 0.009486 **
per_no_hus 0.0203535 0.0208764 0.975 0.329658
per_with_child_u18 -0.1837126 0.0157027 -11.699 < 2e-16 ***
avg_fam_size 1.4942562 0.3118564 4.791 1.73e-06 ***
per_fem_div_15 -0.2667058 0.0264546 -10.082 < 2e-16 ***
per_gp_gc -0.0029902 0.0004742 -6.305 3.28e-10 ***
per_native_born -0.1804944 0.0223711 -8.068 1.01e-15 ***
per_eng_only 0.0427295 0.0116993 3.652 0.000264 ***
per_broadband 0.3748628 0.0096013 39.043 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.691 on 3099 degrees of freedom
Multiple R-squared: 0.563, Adjusted R-squared: 0.5618
F-statistic: 499 on 8 and 3099 DF, p-value: < 2.2e-16
```

The **RMSE** decreased from 3.95% to 3.69%, again suggesting improvement in the model performance.

## Regression Diagnostics

Once a **linear regression** model is produced, your work is not done. Since **linear regression** is a **parametric** method, you must test to make sure assumptions are not violated. If they are violated, you will need to correct the issue; for example, you may remove variables, transform variables, or use a different modeling method more appropriate for your data.

Here are the assumptions of linear regression:

**Residuals**are**normally distributed**- No patterns in the
**residual**relative to values of the**dependent variable**(**homoscedaticity**) - Linear relationship exists between the
**dependent variable**and each**independent variable** - No or little
**multicollinearity**between**independent variables** - Samples are independent

**Normality** of the **resdiuals** can be assessed using a normal **Q-Q plot**. This plot can be interpreted as follows:

- On line = normal
- Concave =
**skewed**to right - Convex =
**skewed**to the left - High/low off line =
**kurtosis**

The code block below generates a **Q-Q plot** using the *qqPlot()* function from the **car** package. I have also included a simple **historgram** created with base R. Generally, this suggest that there may be some **kurtosis** issues since high and low values diverge from the line.

`[1] 363 614`

Optimally, the variability in the **residuals** should show no pattern when plotted against the values of the **dependent variable**. Or, the model should not do a better or worse job at predicting high, low, or specific ranges of values of the **dependent variable**. This is the concept of **homoscedasticity**. This can be assessed using **spread-level** plots. In this plot, the **residuals** are plotted against the fitted values, and no pattern should be observed. The **Score Test For Non-Constant Error Variance** provides a statistic test of **homoscedasticity** and can be calculated in R using the *ncvTest()* function from the **car** package. In this case I obtain a statistically significant result, which suggest issues in **homogeniety of variance**. The test also provides a suggested power by which to transform the data.

```
ncvTest(m2)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 126.5095, Df = 1, p = < 2.22e-16
spreadLevelPlot(m2)
```

```
Suggested power transformation: 0.8898529
```

**Component+Residual (Partial Residual) Plots** are used to visually assess whether a linear relationship exists between the **dependent variable** and each **independent variable**. The trend should be approximated with a line. Curves and other non-linear patterns suggest that the relationship between the variables may not be linear. As an example, the relationship with percent broadband may be better modeled using a function other than a linear one. To make the relationship more linear, a data transformation could be applied.

**Multicolinearity** between predictor variables can be assessed using the **variable inflation factor**. The square root of this value should not be greater than 2. This measure suggests issues with the percent native born and percent English speaking only variables. It would make sense that these variables are correlated, so it might be best to remove one of them from the model. The *vif()* function is provided by the **car** package.

```
vif(m2)
per_no_hus per_with_child_u18 avg_fam_size
1.838522 1.773230 1.921022
per_fem_div_15 per_gp_gc per_native_born
1.128985 1.271686 4.175245
per_eng_only per_broadband
4.209660 1.974071
sqrt(vif(m2)) > 2 # problem?
per_no_hus per_with_child_u18 avg_fam_size
FALSE FALSE FALSE
per_fem_div_15 per_gp_gc per_native_born
FALSE FALSE TRUE
per_eng_only per_broadband
TRUE FALSE
```

The **gvlma** package provides the *gvlma()* function to test all the assumptions at once. Here are brief explanations of the components of this test.

*Global Stat*: combination of all tests, if any component is not satisfied this test will not be satisfied*Skewness*: assess**skewness**of the**residuals**relative to a**normal distribution***Kurtosis*: assess**kurtosis**of**residuals**relative to a**normal distribution***Link Function*: assess if linear relationship exists between the**dependent variable**and all**independent variables**.*Heteroscedasticity*: assess for**homoscedasticity**.

According to this test, the only assumption that is met is **homoscedasity**, which was actually found to be an issue in the test performed above. So, this regression analysis is flawed and would need to be augmented to obtain usable results. We would need to experiment with variable transformations and consider dropping some variables. If these augmentations are not successful in satisfying the assumptions, then we may need to use a method other than **linear regression**.

```
gvmodel <- gvlma(m2)
summary(gvmodel)
Call:
lm(formula = per_25_bach ~ per_no_hus + per_with_child_u18 +
avg_fam_size + per_fem_div_15 + per_gp_gc + per_native_born +
per_eng_only + per_broadband, data = data)
Residuals:
Min 1Q Median 3Q Max
-19.5162 -2.4374 -0.3252 2.0338 21.7646
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6231146 2.1663691 2.596 0.009486 **
per_no_hus 0.0203535 0.0208764 0.975 0.329658
per_with_child_u18 -0.1837126 0.0157027 -11.699 < 2e-16 ***
avg_fam_size 1.4942562 0.3118564 4.791 1.73e-06 ***
per_fem_div_15 -0.2667058 0.0264546 -10.082 < 2e-16 ***
per_gp_gc -0.0029902 0.0004742 -6.305 3.28e-10 ***
per_native_born -0.1804944 0.0223711 -8.068 1.01e-15 ***
per_eng_only 0.0427295 0.0116993 3.652 0.000264 ***
per_broadband 0.3748628 0.0096013 39.043 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.691 on 3099 degrees of freedom
Multiple R-squared: 0.563, Adjusted R-squared: 0.5618
F-statistic: 499 on 8 and 3099 DF, p-value: < 2.2e-16
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = m2)
Value p-value Decision
Global Stat 1404.3408 0.0000 Assumptions NOT satisfied!
Skewness 257.2111 0.0000 Assumptions NOT satisfied!
Kurtosis 619.3569 0.0000 Assumptions NOT satisfied!
Link Function 526.9427 0.0000 Assumptions NOT satisfied!
Heteroscedasticity 0.8302 0.3622 Assumptions acceptable.
```

Regession techniques are also generally susceptible to the impact out **outliers**. The *outlierTest()* function from the **car** package provides an assessment of the presence of **outliers**. Using this test, eight records are flagged as **outliers**. It might be worth experimenting with removing them from the data set to see if this improves the model and/or helps to satisfy the assumptions.

```
outlierTest(m2)
rstudent unadjusted p-value Bonferroni p
363 5.944600 3.0800e-09 9.5728e-06
614 5.810894 6.8449e-09 2.1274e-05
2753 5.500083 4.1048e-08 1.2758e-04
1857 -5.350584 9.4053e-08 2.9232e-04
2203 4.975268 6.8711e-07 2.1355e-03
1624 4.495318 7.2013e-06 2.2382e-02
2703 4.419873 1.0215e-05 3.1748e-02
1042 4.367878 1.2957e-05 4.0271e-02
```

**Leverage** samples are those that have an odd or different combination of **indepedent variable** values and, thus, can have a large or disproportionate impact on the model. High **leverage** points can be assessed using the **Cook’s Distance**, as shown graphically below. Samples 499, 1042, and 1857 are identified as **leverage** samples that are having a large impact on the model.

## Geographically Weighted Regression

Since the data used in this example occur over map space, they may not be independent or **spatial hetrogeneity** may exist. So, it may not be appropriate to use a single equation to represent the relationships.

To determine if this is an issue, I will first plot the **residuals** over map space by adding the **residuals** to the data set then mapping them with **tmap**.

Based on a visual inspection of the map, it does appear that there may be some spatial patterns in the **residuals**. However, this is difficult to say with certainty. Luckily, a statistical test is available for assessing **global spatial autocorrelation**: **Moran’s I**. The **null hypothesis** for this test is that there is not **spatial autocorrelation**, or that the spatial pattern of the values is random. The **alternative hypothesis** is that the pattern is not random, so the values are either **clustered** or **dispersed**.

**Moran’s I** is based on comparing a feature to it neighbors, so we will need to obtain a **matrix** of weights that define spatial relationships. There are multiple means to accomplish this in R. First, I convert the **sf** object to an **sp** object, since the function of interest will not accept **sf** objects. Next, I build a neighbor list using the *poly2nb()* function from the **spdep** package. Using this method, only features that share boundaries will be considered neighbors. Note that there are other means to define neighbors; for example, you could use **inverse distance weighting** to build the **spatial weights matrix**. Once neighbors are defined, I use the *nb2listw()* function from the **spdep** package to build a **spatial weights matrix**. I can then calculate Moran’s I using the *moran.test()* function from **spdep**.

The reported **Moran’s I index** is positive and statistically significant. This suggests that the **resiudals** are spatially **clustered**. So, a **geographically weighted regression** may be more appropriate than traditional regression methods.

```
data_sp <- as(data, Class="Spatial")
continuity.nb <- poly2nb(data, queen = FALSE)
plot(continuity.nb, coordinates(data_sp))
```

```
continuity.listw <- nb2listw(continuity.nb)
summary(continuity.listw)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 3108
Number of nonzero links: 17754
Percentage nonzero weights: 0.1837952
Average number of links: 5.712355
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 13
13 26 68 341 858 1067 518 166 39 10 1 1
13 least connected regions:
294 309 367 391 680 1074 1140 1152 1498 1778 2131 2136 2351 with 1 link
1 most connected region:
1958 with 13 links
Weights style: W
Weights constants summary:
n nn S0 S1 S2
W 3108 9659664 3108 1126.713 12653.96
moran.test(data_sp$resid, continuity.listw)
Moran I test under randomisation
data: data_sp$resid
weights: continuity.listw
Moran I statistic standard deviate = 28.362, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.3055944640 -0.0003218539 0.0001163444
```

I will use the *gwr()* function from the **spgwr** package to perform **geographically weighted regression**. Before running the model, I will need to determine an optimal **bandwidth**, which defines the optimal search distance for incorporating **spatial weights** into the model. This can be accomplished with the *gwr.sel()* function.

```
gwr.sel(per_25_bach ~ per_no_hus + per_with_child_u18 + avg_fam_size + per_fem_div_15 + as.numeric(per_gp_gc) + per_native_born + per_eng_only + per_broadband, data = data_sp)
Bandwidth: 2116.32 CV score: 41786.33
Bandwidth: 3420.859 CV score: 42262.43
Bandwidth: 1310.071 CV score: 40751.89
Bandwidth: 811.7811 CV score: 38443.56
Bandwidth: 503.8212 CV score: 35306.44
Bandwidth: 313.4915 CV score: 32872
Bandwidth: 195.8613 CV score: 31878.83
Bandwidth: 123.1619 CV score: 33167.8
Bandwidth: 223.9776 CV score: 31990.14
Bandwidth: 170.4074 CV score: 31913.29
Bandwidth: 189.96 CV score: 31871.96
Bandwidth: 188.3912 CV score: 31871.33
Bandwidth: 187.2035 CV score: 31871.21
Bandwidth: 187.3643 CV score: 31871.21
Bandwidth: 187.3765 CV score: 31871.21
Bandwidth: 187.3754 CV score: 31871.21
Bandwidth: 187.3754 CV score: 31871.21
Bandwidth: 187.3754 CV score: 31871.21
Bandwidth: 187.3754 CV score: 31871.21
[1] 187.3754
```

Once the optimal **bandwidth** is determined, I can build the model. Note that this syntax is very similar to the regression techniques implemented above.

Once the model is built, I convert the result to an **sf** object. I then print the first five records to visualize the results. Note that different coefficients are now provided for each feature as opposed to a single, global equation. So, the model is able to vary over space to incorporate **spatial heterogeneity**.

```
m3_sf <- st_as_sf(m3$SDF)
head(m3_sf)
Simple feature collection with 6 features and 13 fields
geometry type: MULTIPOLYGON
dimension: XY
bbox: xmin: -123.7283 ymin: 33.99541 xmax: -96.45284 ymax: 46.38562
epsg (SRID): 4269
proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
sum.w X.Intercept. per_no_hus per_with_child_u18 avg_fam_size
1 130.19234 -34.78181 0.038060340 -0.01801299 3.2429415
2 31.30195 38.18783 -0.295027294 -0.34048247 0.6579797
3 40.43323 13.81937 -0.233988324 -0.20326283 -2.7515267
4 132.65015 -31.05976 0.017663536 -0.01379966 2.6152415
5 117.09808 -31.08970 -0.011434203 -0.02195058 2.8794456
6 117.27645 -34.40608 -0.004853622 -0.09685361 4.4452054
per_fem_div_15 as.numeric.per_gp_gc. per_native_born per_eng_only
1 -0.224201763 -0.0009659514 -0.117924202 0.22813623
2 -0.626900386 -0.0025895092 -0.741398748 0.21842901
3 -0.001002165 0.0013499760 0.053298953 -0.03889115
4 -0.211048926 -0.0018496629 -0.161416177 0.24162466
5 -0.193267089 -0.0022627102 -0.008425324 0.14107862
6 -0.128569038 0.0001864582 -0.015420210 0.12001726
per_broadband gwr.e pred localR2
1 0.4631832 3.368081 12.33192 0.6107726
2 0.5853535 -1.457919 10.55792 0.8675335
3 0.2216439 -6.294063 11.99406 0.6085916
4 0.4787902 3.454713 20.84529 0.6193532
5 0.4038966 -2.125579 14.22558 0.5458257
6 0.4362002 3.024006 18.67599 0.5988537
geometry
1 MULTIPOLYGON (((-97.01952 4...
2 MULTIPOLYGON (((-123.4364 4...
3 MULTIPOLYGON (((-104.5674 3...
4 MULTIPOLYGON (((-96.91075 4...
5 MULTIPOLYGON (((-98.27367 4...
6 MULTIPOLYGON (((-97.12928 4...
```

Finally, I print the result then obtain an **RMSE** for the prediction. The **RMSE** using **multiple regression** obtained above was 3.69% while that for **GWR** is 2.7%. This suggests improvement in the prediction.

The example used here was not optimal for **linear regression** as many of the model assumptions were not met. **GWR** improved the results; however, it also has similar assumptions that cannot be violated. So, this problem might be better tackled using a different method. We will revisit it again in the context of **machine learning**.