Statistical Tests and Linear Regression

This module provides an introduction to performing statistical tests of a hypothesis and implementing linear regression to make a prediction of a continuous variable. For implementing statistical tests, we will make use of the stats module of scipy and the statsmodels library. Here, I provide examples of performing a Two-Sample T-Test and a One-Way ANOVA and assessing correlation with the Spearman and Pearson methods. Other statistical tests use similar syntax. To undertake linear regression, we will use scikit-learn, which is explored in more detail in the machine learning module. Before attempting these examples, make sure your Python environment has the numpy, pandas, scipy, statsmodels, and scikit-learn libraries installed.

After working through this module you will be able to:

  1. Subsample rows from a dataset.
  2. Prepare data as input to statistical tests.
  3. Perform tests and interpret the output.
  4. Generate, explore, and assess a linear regression model.
import numpy as np
import pandas as pd
from scipy import stats as st
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [10, 8]

T-Test

In the first part of this module, I will use the "matts_movies.csv" data. I will investigate whether my brother has ranked different genres of movies differently. To begin, I first read in the data table as a Pandas Data Frame. We will primarily make use of the "My Rating" and "Genre" columns.

movies_df = pd.read_csv("D:/mydata/matts_movies.csv", sep=",", header=0, encoding="ISO-8859-1")
print(movies_df.head(10))
                 Movie Name             Director  Release Year  My Rating  \
0             Almost Famous        Cameron Crowe          2000       9.99   
1  The Shawshank Redemption       Frank Darabont          1994       9.98   
2             Groundhog Day         Harold Ramis          1993       9.96   
3              Donnie Darko        Richard Kelly          2001       9.95   
4           Children of Men       Alfonso Cuaron          2006       9.94   
5                Annie Hall          Woody Allen          1977       9.93   
6                  Rushmore         Wes Anderson          1998       9.92   
7                   Memento    Christopher Nolan          2000       9.91   
8    No Country for Old Men  Joel and Ethan Coen          2007       9.90   
9                     Seven        David Fincher          1995       9.88

         Genre  Own  
0        Drama  Yes  
1        Drama  Yes  
2       Comedy  Yes  
3       Sci-Fi  Yes  
4       Sci-Fi  Yes  
5       Comedy  Yes  
6  Independent  Yes  
7     Thriller  Yes  
8     Thriller  Yes  
9     Thriller  Yes

In order to mimic a sample from this larger population, I make use of the .sample() method to extract 30 samples for both the Drama and Action genres. Note that I am also setting a random seed in numpy so that I obtain the same random sample if I re-execute the code. I first subset out all records from each genre into separate data frames. I then subset out the samples.

The functions that we will use to perform the statistical tests expect data to be provided as numpy arrays. So, in the next code cell I convert the data frames to numpy arrays. I next use the describe() function from the stats module to obtain summary statistics for each group. The sampled Action movies have a mean rating of 6.64 while the Dramas have a mean rating of 7.66. Can this be considered statistically different? To assess this, we will use a Two-Sample T-Test. Also, note that the variance is different in each group. So, we will specifically use a Welch's T-Test, which does not assume equal variance between the groups.

mD = movies_df.query('Genre=="Drama"')
mA = movies_df.query('Genre=="Action"')
np.random.seed(42)
mA = mA.sample(30)
mD = mD.sample(30)
mAa = mA[mA.columns[3]].to_numpy()
mDa = mD[mD.columns[3]].to_numpy()
print(mDa)
print(mAa)
print(st.describe(mAa))
print(st.describe(mDa))
[7.1  7.95 8.45 7.86 6.59 7.94 6.87 7.03 7.44 6.93 9.47 6.84 8.52 7.95
 6.43 6.55 9.35 7.67 7.73 8.23 7.23 7.1  7.48 6.68 6.24 9.75 7.34 8.43
 9.39 7.21]
[6.02 6.76 8.89 7.21 5.35 3.77 7.33 7.55 8.72 4.82 4.46 8.9  5.03 7.48
 6.73 6.73 5.86 9.18 7.23 3.18 6.76 7.01 6.03 7.84 7.06 7.34 5.84 4.51
 8.13 7.33]
DescribeResult(nobs=30, minmax=(3.18, 9.18), mean=6.635000000000001, variance=2.3339982758620685, skewness=-0.409647590960622, kurtosis=-0.4053829710587875)
DescribeResult(nobs=30, minmax=(6.24, 9.75), mean=7.658333333333334, variance=0.9088626436781612, skewness=0.6692051986038903, kurtosis=-0.40339521648835763)

Before performing the test, I use matplotlib to create a grouped box plot to compare the central tendency and distribution of the two groups. Note that the interquartile ranges (IQR) overlap some and that there is a good bit of variability within the groups. So, it is not clear if there is actually a difference between the group means.

fig, ax = plt.subplots(1, 1)
ax.boxplot([mDa, mAa])
ax.set_xticklabels(["Drama", "Action"]) 
ax.set_ylabel("mean") 
plt.show()

png

I am now ready to perform the Two-Sample T-Test. This function requires the two numpy arrays storing the values for each group. I also set the equal_var argument to False so that a Welch's T-Test will be performed and equal variance between groups is not assumed.

I obtain a T-Value of 3.11 and a p-value of 0.0031. Using an alpha of 0.05, we would reject the null hypothesis that the means between the two groups are not different because the obtained p-value is smaller than 0.05. We would favor the alternative hypothesis that the means between the two groups are different.

st.ttest_ind(mDa, mAa, equal_var=False)
Ttest_indResult(statistic=3.1125302908342998, pvalue=0.003104205237585028)

One-Way ANOVA

In this section, I demonstrate the use of a One-Way ANOVA to assess for differences in the means of multiple groups. First, I extract out 30 samples from the Comedy genera and generate a numpy array of the ratings. Next, I create a box plot to compare the three groups: Action, Drama, and Comedy. Again, it is not clear if the means are different between pairs of the three groups.

mC = movies_df.query('Genre=="Comedy"')
np.random.seed(42)
mC = mC.sample(30)
mCa = mC[mC.columns[3]].to_numpy()
fig, ax = plt.subplots(1, 1)
ax.boxplot([mDa, mAa, mCa])
ax.set_xticklabels(["Drama", "Action", "Comedy"]) 
ax.set_ylabel("mean") 
plt.show()

png

To perform the ANOVA test, the numpy arrays of the ratings for each of the three groups are provided to the f_oneway() function. For my samples, I obtain an F-Value of 7.93 and a p-value of 0.000685. This suggest rejection of the null hypothesis in favor of the alternative hypothesis that at least two of the groups have different means. However, ANOVA does not tell us which of the groups are estimated to have different means. So, I next perform a Tukey-HSD test as a pairwise comparison. This is accomplished using the pairwise_tukeyhsd() function from the statsmodels library. This function expects the data to be provided differently than the T-Test and ANOVA already performed. Specifically, it requires all of the values to be provided as a single array and all of the group names to be provided as a separate array. To accomplish this, I copy one array and append all of the other rating values into it. I then create arrays containing 30 replicates of the group names then append them to a single array.

The Tukey-HSD test suggests that the Drama and Comedies and Drama and Action genres have different mean ratings while the Action and Comedies do not. Note that a p-value of less than 0.05 and a confidence interval that does not include zero suggests to reject the null hypothesis.

anovaF, anovaP = st.f_oneway(mAa, mDa, mCa)
print(anovaF)
print(anovaP)
7.932685401667139
0.0006845063246528356
forTHSD = np.copy(mAa)
forTHSD = np.append(forTHSD, mDa)
forTHSD = np.append(forTHSD, mCa)
grpsA = np.repeat(["Action"], 30)
grpsD = np.repeat(["Drama"], 30)
grpsC = np.repeat(["Comedy"], 30)
grps = np.copy(grpsA)
grps = np.append(grps, grpsD)
grps = np.append(grps, grpsC)
print(forTHSD)
print(grps)
[6.02 6.76 8.89 7.21 5.35 3.77 7.33 7.55 8.72 4.82 4.46 8.9  5.03 7.48
 6.73 6.73 5.86 9.18 7.23 3.18 6.76 7.01 6.03 7.84 7.06 7.34 5.84 4.51
 8.13 7.33 7.1  7.95 8.45 7.86 6.59 7.94 6.87 7.03 7.44 6.93 9.47 6.84
 8.52 7.95 6.43 6.55 9.35 7.67 7.73 8.23 7.23 7.1  7.48 6.68 6.24 9.75
 7.34 8.43 9.39 7.21 8.55 6.78 6.15 6.87 5.73 5.34 4.87 5.96 4.66 6.77
 7.47 6.09 5.48 3.83 4.01 6.74 6.03 7.87 7.35 5.23 8.21 7.57 6.38 5.8
 6.28 5.83 6.54 8.23 9.25 7.  ]
['Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action'
 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action'
 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Action'
 'Action' 'Action' 'Action' 'Action' 'Action' 'Action' 'Drama' 'Drama'
 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama'
 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama'
 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama' 'Drama'
 'Drama' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy'
 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy'
 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy'
 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy' 'Comedy']
tukRes = pairwise_tukeyhsd(forTHSD, grps)
print(tukRes)
Multiple Comparison of Means - Tukey HSD, FWER=0.05
===================================================
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
Action Comedy   -0.206  0.788 -0.9943 0.5823  False
Action  Drama   1.0233 0.0074   0.235 1.8116   True
Comedy  Drama   1.2293  0.001   0.441 2.0176   True
---------------------------------------------------

Correlations

In this section, I assess the correlation between the movie rating and release year. This assessment could be used to determine whether there is a pattern in my brother's ratings over time. For example, does my brother tend to rate newer or older movies higher? I will specifically explore correlation for all of the dramas in the dataset. So, I extract out all of the dramas then write the ratings and release year to separate numpy arrays.

The Pearson Correlation Coefficient can be obtained using the pearsonr() function from stats. This metric assesses for a linear correlation between two continuous variables. Here, I obtain a correlation of -0.028, which has an associated p-value of 0.685. This suggests that there is not a strong linear relationship between the rating and release year. Or, my brother's ratings for dramas do not seem to vary based on release year.

The Spearman Correlation Coefficient can be obtained using the spearmanr() function, which assesses for a monotonic relationship using ranks. Here, I obtain a correlation of 0.006 with an associated p-value of 0.914. This again suggests that there is not a strong monotonic relationship between the movie rating and release year.

mD = movies_df.query('Genre=="Drama"')
mDr = mD[mD.columns[3]].to_numpy()
mDy = mD[mD.columns[2]].to_numpy()
print(st.pearsonr(mDr, mDy))

(-0.022701968234130208, 0.6853270459162113)
print(st.spearmanr(mDr, mDy))
SpearmanrResult(correlation=0.006073936628682969, pvalue=0.9136784076967588)

Linear Regression

In this last section, I will demonstrate implementing linear regression using scikit-learn. First, I will estimate average mean county temperature using the mean county elevation. I will then undertake multiple linear regression by predicting the mean county temperature using elevation, percent forest cover, and mean county total annual precipitation.

First, I read in the "high_plains_data.csv" file. These data represent environmental and climate data for counties in the high plains states. The climate data were derived from PRISM, and the land cover data were derived from the National Land Cover Database (NLCD).


hp = pd.read_csv("D:/mydata/high_plains_data.csv", sep=",", header=0, encoding="ISO-8859-1")
hp.columns =[column.replace(" ", "_") for column in hp.columns] 
print(hp.head(10))
   GEOID10    ID             NAME_1 STATE_NAME ST_ABBREV         AREA  \
0     8053  8053    Hinsdale County   Colorado        CO  1123.212157   
1     8061  8061       Kiowa County   Colorado        CO  1785.889183   
2     8063  8063  Kit Carson County   Colorado        CO  2161.740894   
3     8071  8071  Las Animas County   Colorado        CO  4775.291392   
4     8073  8073     Lincoln County   Colorado        CO  2586.326731   
5     8075  8075       Logan County   Colorado        CO  1845.045782   
6     8079  8079     Mineral County   Colorado        CO   877.691373   
7     8085  8085    Montrose County   Colorado        CO  2242.693513   
8     8087  8087      Morgan County   Colorado        CO  1293.760087   
9     8089  8089       Otero County   Colorado        CO  1269.475894

   TOTPOP10  POPDENS10  MALES10  FEMALES10  ...  TOTHH00  FAMHH00  TOTHU00  \
0       843        0.8      443        400  ...      359      247     1304   
1      1398        0.8      687        711  ...      665      452      817   
2      8270        3.8     4638       3632  ...     2990     2081     3430   
3     15507        3.2     7948       7559  ...     6173     4095     7629   
4      5467        2.1     3163       2304  ...     2058     1389     2406   
5     22709       12.4    12924       9785  ...     7551     5064     8424   
6       712        0.8      362        350  ...      377      251     1119   
7     41276       18.4    20310      20966  ...    13043     9311    14202   
8     28159       22.0    13898      14261  ...     9539     6969    10410   
9     18831       14.9     9221       9610  ...     7920     5473     8813

   OWNER00  RENTER00    per_for         elev       temp      precip   precip2  
0      233       126  34.962704  3334.739362   1.410638  800.246066  8.002461  
1      474       191   0.002120  1269.637288  11.159492  384.201797  3.842018  
2     2151       839   0.021467  1348.005495  10.179698  435.414037  4.354140  
3     4360      1813  11.687193  1834.354115  10.439925  424.757731  4.247577  
4     1421       637   0.043238  1564.500000   9.573578  376.228555  3.762286  
5     5277      2274   0.124509  1278.251592   9.749204  425.278696  4.252787  
6      279        98  38.846741  3212.506757   1.830405  856.293446  8.562934  
7     9773      3270  33.571025  2122.834667   8.327253  447.202160  4.472022  
8     6525      3014   0.112152  1372.481651   9.731422  369.794908  3.697949  
9     5471      2449   0.024750  1355.050926  11.780972  338.015001  3.380150

[10 rows x 157 columns]

I then use matplotlib to create scatter plots to visualize the relationship between temperature and the three predictor variables. Generally, there seems to be a stronger correlation between temperature and elevation in comparison to temperature and the other two variables.

sp1 = plt.scatter(hp.elev, hp.temp)
plt.show(sp1)

png

sp1 = plt.scatter(hp.per_for, hp.temp)
plt.show(sp1)

png

sp1 = plt.scatter(hp.precip, hp.temp)
plt.show(sp1)

png

I subset out the temperature data and predictor variables into separate data frames. I then split the data into separate training and testing sets using the train_test_split() function from scikit-learn. This function requires that both the predictor variables and dependent variable be provided. You must also specify the proportion of the data to maintain for testing. A random state can be specified to allow for reproducibility.


yvar = hp[["temp"]]
xvars= hp[["elev", "per_for", "precip"]]
X_train, X_test, y_train, y_test = train_test_split(xvars, 
yvar, test_size=0.33, random_state=42)
X_trainElev = X_train[["elev"]]
X_testElev = X_test[["elev"]]
print(yvar.head())
print(xvars.head())
        temp
0   1.410638
1  11.159492
2  10.179698
3  10.439925
4   9.573578
          elev    per_for      precip
0  3334.739362  34.962704  800.246066
1  1269.637288   0.002120  384.201797
2  1348.005495   0.021467  435.414037
3  1834.354115  11.687193  424.757731
4  1564.500000   0.043238  376.228555

I next initiate a model and fit using the training data and only the elevation predictor variable. Once the model is fit, I can obtain the intercept and coefficient. So, the resulting formula is:

Temperature = 10.8 - 0.00219*Elevation.

I then predict to the withheld test data and use the results to calculate the root mean square error or RMSE. The model resultd in an RMSE of 2.66, which suggests that the predicted temperature is off by 2.66 degrees Celsius on average for the withheld counties.

lmMod = LinearRegression()
lmMod.fit(X_trainElev, y_train)
LinearRegression()
print(f"intercept: {lmMod.intercept_}")

print(f"coefficients: {lmMod.coef_}")
intercept: [10.80000493]
coefficients: [[-0.00219406]]
preds = lmMod.predict(X_testElev)
mean_squared_error(y_test, preds, squared=False)
2.658858018975682

I now fit a new model using all three predictor variables. This results in a fitted y-intercept and three coefficients, one for each predictor variable. The RMSE actually increases to 2.89, suggesting that including of the precipitation and percent forest cover variables did not improve the model performance in comparison to just using the mean county elevation.

lmMod = LinearRegression()
lmMod.fit(X_train, y_train)
LinearRegression()
print(f"intercept: {lmMod.intercept_}")

print(f"coefficients: {lmMod.coef_}")
intercept: [2.58132095]
coefficients: [[ 0.00113803 -0.20148305  0.0105595 ]]
preds = lmMod.predict(X_test)
mean_squared_error(y_test, preds, squared=False)
2.887478419012105

Concluding Remarks

This short module provided a brief introduction to performing statistical tests and linear regression in Python. If you need to perform different tests, I think you will find that the syntax is very similar. The general workflow is to (1) read in the required data, (2) prepare the data so that it can be provided to the test of interest, (3) implement the test, and (4) interpret the results. It is important to note that most tests have assumptions that should be evaluated to make sure the results are not misleading. We will further explore predictive modeling using scikit-learn in the machine learning module.