Skip to content

Latest commit

 

History

History
2553 lines (2097 loc) · 104 KB

chapter3.org

File metadata and controls

2553 lines (2097 loc) · 104 KB

Linear Regression

Notes

Accuracy

  • Residual sum of squares (RSS) = \(∑i=1^n (y_i - \hat{y}_i)^2\).
  • Residual standard error (RSE) = \(\sqrt{\frac{1}{n-2}\text{RSS}}\).
  • \(R^2\) = 1 - \(\frac{\text{RSS}}{\text{TSS}}\).
  • Total sum of squares (TSS) = \(∑i=1^n (y_i - \bar{y}_i)^2\).

Outliers & High Leverage points

  • Outlier is an observation for which the predicted response is far from the recorded response.
  • High leverage point is an observation for which the recorded predictors are unusual.
  • High leverage points usually have more influence on the model than outliers.
  • The studentized (standardized) residuals are typically expected to be between -3 and 3. If they lie outside then that indicates that there might be outliers in the data.
  • The average leverage is \((p + 1) / n\). If the leverage of some observation is significantly higher than the average leverage then that is a high leverage point. (Look in to “Cook’s distance”.)

Exercises

Question 1

Since there are three p-values in Table 3.4, there are three null hypotheses:

  • H_0^(1): sales does not depend on TV
  • H_0^(2): sales does not depend on radio
  • H_0^(3): sales does not depend on newspaper

The p-values corresponding to the first two null hypotheses are very small, less than 10^-4, whereas the p-value for the last one is close to 1. This means that the probability of seeing the observed sales numbers if H_0^(1) or H_0^(2) is true is almost zero, but the probability of seeing the observed sales numbers if H_0^(3) is true is almost 1. In other words sales depends on TV and radio but not on newspaper.

Question 2

A KNN classifier classifies an observation is to the class with the majority of nearest neighbors. In the KNN regression method the response is estimated as the average of all the training responses of the nearest neighbors.

Question 3

a) Assuming X_4 = X_1 X_2 and X_5 = X_1 X_3, the linear fit model is as follows: Y = 50 + 20 X_1 + 0.07 X_2 + 35 X_3 + 0.01 X_1 X_2 - 10 X_3 X_5. Since X_3 (Gender) is 1 for females and 0 for males, this fit essentially gives two models based on the gender: Y = 85 + 10 X_1 + 0.07 X_2 + 0.01 X_1 X_2 for females, and Y = 50 + 20 X_1 + 0.07 X_2 + 0.01 X_1 X_2 for males. If X_1 (GPA), and X_2 (IQ), are fixed for the two genders then the starting salary for males will be more on average than the starting salary of females if X_1 > 35 / 10 = 3.5; i.e. given the GPA is high enough, on average males will have a higher starting salary than females (iii). b) If IQ is 110 and GPA is 4.0 then a female is predicted to have a starting salary of $ 137,100. c) False. The scale for IQ is much larger than that of GPA or Gender. Hence the coefficients related to any IQ based predictor are expected to be small. (Possibly this is why it is often recommended to scale the predictors, so they are all on an equal footing and it is easier to figure out which ones are important.)

Question 4

a) Even though the true relationship is linear, we need to take into account the irreducible error \(ϵ\). The cubic model is more flexible than the linear model and therefore will be able to better fit to the irreducible error. I expect the cubic model to have lower training RSS than the linear model. b) The linear model will have a lower test RSS than the cubic model since it matches the true relationship. The cubic model would have overfitted to the irreducible error in the training data and therefore perform worse on the test data. c) Irrespective of the true relationship I would expect the cubic model to have lower training RSS than the linear model because it is more flexible than the linear model and thus fits the training data better. d) Since we do not the true relationship between X and Y, it is difficult to say which of the two models will have lower test RSS without more information.

Question 5

We just have to substitute equation 3.38 in to the equation for the \(i\)th fitted value and rearrange the equation: \begin{align} ai’ = \frac{x_i xi’}{∑k=1^n x_k^2}. \end{align}

Question 6

The equation for the least squares line is \(y = \hat{β}_0 + \hat{β}_1 x\). From equation 3.4 we have \(\hat{β}_0 = \bar{y} - \hat{β}_1 \bar{x}\). Then putting \(x = \bar{x}\) in the least squares line equation we see that \(y = \bar{y}\). Thus the least squares line always passes through \((\bar{x}, \bar{y})\).

Question 7

Assumption: \(\bar{x} = \bar{y} = 0\). From equation 3.4 we have \(\hat{y}_i = \hat{β}_1 x_i\), where \begin{align} \hat{β}_1 = \frac{∑i=1^n x_i y_i}{∑i=1^n x_i^2}. \end{align} And from equation 3.17 we have \begin{align} R^2 = 1 - \frac{∑i=1^n (y_i - \hat{y}_i)^2}{∑i=1^n y_i^2}. \end{align} Expanding the square in the numerator of the above fraction and substituting \(\hat{y}_i\) with \(\hat{β}_1 x_i\) we get \begin{align} R^2 = \frac{\hat{β}_1(2 ∑i=1^n x_iy_i - \hat{β}_1 ∑i=1^n x_i^2)}{∑i=1^n y_i^2}. \end{align} Now if we simply substitute the above form of \(\hat{β}_1\) in to this equation we will see that \(R^2 = \mathrm{Cor}(X, Y)^2\), where \(\mathrm{Cor}(X, Y)\) is given by equation 3.18.

Question 8

Simple linear regression on Auto data set

import pandas as pd
import numpy as np
from tabulate import tabulate

auto = pd.read_csv("data/Auto.csv")
print(tabulate(auto.head(), auto.columns, tablefmt="orgtbl"))

Recall from the last chapter horsepower has some missing values and needs to be converted to a numeric form before we can use this for linear regression.

auto.drop(auto[auto.horsepower == "?"].index, inplace=True)
auto["horsepower"] = pd.to_numeric(auto["horsepower"])

For simple linear regression we can use the ordinary least squares model from statsmodels or the linear regression model from scikit-learn. In this question we are asked to print the summary of the fitted model. scikit-learn has no method for generating a summary, but statsmodels does.

import statsmodels.formula.api as smf

model = smf.ols(formula="mpg~horsepower", data=auto).fit()
print(model.summary())

a. The F-statistic is much larger than 1, and the p-value (P>|t| in the table) is zero. This indicates that there is a relationship between mpg and horsepower. b. The \(R^2\) value of 0.606 indicates that this relationship explains around 61% of the mpg values. c. The coefficient value corresponding to horsepower is negative. This indicates that the relation between mpg and horsepower is negative. d.

pred = model.get_prediction(exog=dict(horsepower=98))
pred_summary = pred.summary_frame()
print(tabulate(pred_summary, pred_summary.columns, tablefmt="orgtbl"))

The predicted mpg for horsepower = 98 is 24.4671. The 95% confidence interval is \([23.9731, 24.9611]\), and the 95% prediction interval is \([14.8094, 34.1248]\). As mentioned in the text, the prediction interval contains the confidence interval.

Least squares plot

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("ticks")

X = auto["horsepower"]
Y = auto["mpg"]
Ypred = model.predict(X)

plt.close("all")
fig, ax = plt.subplots()
ax.plot(X, Y, 'o', label="Data")
ax.plot(X, Ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3_auto_ls.png", dpi=90)

Diagnostic plots

The R command plot in this case gives four diagnostic plots:

  • Residuals vs Fitted
  • Normal Q-Q
  • Scale-Location
  • Residuals vs Leverage

The Residuals vs Fitted plot shows any non-linear pattern in the residuals, and by extension in the data. The Normal Q-Q plot shows if the residuals are normally distributed. The Scale-Location plot shows if there is heteroscedasticity. The Residuals vs Leverage plot shows if there are leverages in the data.

We will produce these plots using statsmodels and seaborn.

Residuals vs Fitted plot

fitted_vals = model.fittedvalues

plt.close("all")
fig, ax = plt.subplots()
residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto,
                          lowess=True,
                          line_kws={"color": "red"},
                          ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Residuals")
sns.despine()
fig.savefig("img/3_res_vs_fit.png", dpi=90)

This plot clearly shows that there is non-linearity in the data.

Normal Q-Q plot

import statsmodels.api as sm

residuals = model.resid

plt.close("all")
fig, ax = plt.subplots()
qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True)
ax.set_ylabel("Standardized Residuals")
sns.despine()
fig.savefig("img/3_auto_qq.png", dpi=90)

Though there are some points that are far from the \(45^ˆ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.

Scale-Location plot

# normalized residuals and their square roots
norm_residuals = model.get_influence().resid_studentized_internal
norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals))

plt.close("all")
fig, ax = plt.subplots()
slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt,
                     lowess=True,
                     line_kws={"color" : "red"},
                     ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Sqrt of |Standardized Residuals|")
sns.despine()
fig.savefig("img/3_auto_scale_loc.png", dpi=90)

This plot is similar to the first diagnostic plots, except now the quantity on the y-axis is positive. This shows that homoscedasticity is not held, i.e. the variance is not constant.

Residuals vs Leverage plot

plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3_auto_res_vs_lev.png", dpi=90)

We see that none of the points have a very high leverage.

Question 9

Scatter plot matrix of Auto data set

plt.close("all")
spm = sns.pairplot(auto, plot_kws = {'s': 10})
spm.fig.set_size_inches(12, 12)
spm.savefig("img/3_auto_scatter.png", dpi=90)

Correlation matrix

I find heat maps to be better for visualizing correlation matrices than tables. Since the correlation matrix is symmetric we can ignore either of the lower or the upper triangles. We can also ignore the diagonal since it is always going to be 1.

corr_mat = auto[auto.columns[:-1]].corr()

plt.close("all")
fig, ax = plt.subplots()

# Custom diverging color map.
cmap = sns.diverging_palette(220, 10, sep=80, n=7)

# Mask for upper triangle.
mask = np.triu(np.ones_like(corr_mat, dtype=np.bool))

with sns.axes_style("white"):
    sns.heatmap(corr_mat, mask=mask, cmap=cmap, annot=True, robust=True, ax=ax)

fig.savefig("img/3_auto_corr_heat.png")

We see that mpg has considerable negative correlations with cylinders, displacement, horsepower, and weight. This matches what we saw in the scatter plot matrix above. Similarly cylinders, displacement, horsepower and weight are all correlated with each other.

Multiple linear regression with Auto data set

We could do this with the statsmodels.formula API but that involves more typing, so we will use the statsmodels API.

import statsmodels.api as sm

Y = auto["mpg"]
X = auto[auto.columns[1:-1]]
X = sm.add_constant(X) # For the intercept.
ml_model = sm.OLS(Y, X).fit()
print(ml_model.summary())

a. The large F-statistic indicates that we can ignore the null hypothesis, which says that the response mpg does not depend on the predictors. The probability that this data could be generated if the null hypothesis was true is essentially zero (2.04E-139). b. Looking at the p-values of the individual predictors we see that weight, year, and origin have the most statistically significant relation with mpg. We can also argue that displacement has a somewhat significant relation with mpg. On the other hand cylinders, horsepower, and acceleration do not have a significant statistical relationship. This is not necessarily surprising. Given the correlation between mpg, displacement, cylinders and horsepower I think one can argue that the information in cylinders and horsepower is redundant. c. The coefficient for the year variable suggests that every year the mpg increases by 0.7508, i.e. the cars become more fuel-efficient every year.

Diagnostic plots

We make the same diagnostic plots as the previous exercise.

Residuals vs Fitted plot

fitted_vals = ml_model.fittedvalues

plt.close("all")
fig, ax = plt.subplots()
residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto,
                          lowess=True,
                          line_kws={"color": "red"},
                          ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Residuals")
sns.despine()
fig.savefig("img/3_ml_res_vs_fit.png", dpi=90)

This plot clearly shows that there is non-linearity in the data.

Normal Q-Q plot

import statsmodels.api as sm

residuals = ml_model.resid

plt.close("all")
fig, ax = plt.subplots()
qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True)
ax.set_ylabel("Standardized Residuals")
sns.despine()
fig.savefig("img/3_ml_auto_qq.png", dpi=90)

Though there are some points that are far from the \(45^ˆ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.

Scale-Location plot

# normalized residuals and their square roots
norm_residuals = ml_model.get_influence().resid_studentized_internal
norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals))

plt.close("all")
fig, ax = plt.subplots()
slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt,
                     lowess=True,
                     line_kws={"color" : "red"},
                     ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Sqrt of |Standardized Residuals|")
sns.despine()
fig.savefig("img/3_ml_auto_scale_loc.png", dpi=90)

The variance in the standardized residuals is less as compared to the single regression plot, but there is still quite a bit of variance, which means homoscedasticity is not held.

Residuals vs Leverage plot

plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(ml_model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3_ml_auto_res_vs_lev.png", dpi=90)

Point 13 has a high leverage but not a very high residual.

Interaction effects

We go back to the statsmodels.formula API. Additionally I will drop cylinders, horsepower, and acceleration from the model. I will try the following interaction terms:

  • year : origin : this will add a new predictor which is a product of year and origin, but will not include year and origin separately,
  • year * origin : this will add the product of year and origin, but also include year and origin separately,
  • year * weight : same as the above, except for weight in place of origin.
ml_model_1 = smf.ols(formula="mpg ~ displacement + weight + year : origin",
                     data=auto).fit()
ml_model_2 = smf.ols(formula="mpg ~ displacement + weight + year * origin",
                     data=auto).fit()
ml_model_3 = smf.ols(formula="mpg ~ displacement + weight * year + origin",
                     data=auto).fit()

Summary of first interaction model

print(ml_model_1.summary())

The large F-statistic invalidates the null hypothesis. The individual p-values show that displacement is not really significant, but the product of year and origin is. Additionally the \(R^2\) value tells us that this model explains around 71% of the mpg values.

Summary of second interaction model

print(ml_model_2.summary())

The \(R^2\) value has increased; it is now almost the same as the \(R^2\) value for the model with all the quantitative predictors but no interaction. This model explains around 82% of the mpg values. Additionally we see that year is more significant than origin or the product of year and origin. Also, in addition to displacement, the intercept term appears to be insignificant too.

Summary of third interaction model

print(ml_model_3.summary())

The \(R^2\) increased a bit, and it appears other than displacement all the other predictors are significant.

Interaction model with two interactions

We will try an additional model with two interactions: displacement * weight and weight * year.

ml_model_4 = smf.ols("mpg ~ displacement * weight + weight * year + origin",
                     data=auto).fit()
print(ml_model_4.summary())

This is interesting. We see that the \(R^2\) value has increased further, but now displacement has become significant whereas weight and origin have become relatively insignificant. The interaction terms are still significant. My understanding of this model is that while weight does not directly affect mpg, it increases displacement, and that affects the mpg.

Models with variable transformations

ml_model_trans = smf.ols(formula="mpg ~ np.log(weight) + np.power(weight, 2) + year",
                         data=auto).fit()
print(ml_model_trans.summary())

The F-statistic and the p-values indicate that these transformations are statistically significant.

Question 10

Multiple regression with Carseats data set

So far I had been loading the data sets from local .csv files, but I recently found out that statsmodels makes them automatically available using the Rdatasets project. So going forward I will be using that whenever possible.

import statsmodels.api as sm

carseats = sm.datasets.get_rdataset("Carseats", package="ISLR")
print(carseats.__doc__)

Multiple linear regression to predict Sales using Price, Urban, and US.

import statsmodels.formula.api as smf

model = smf.ols(formula="Sales ~ Price + Urban + US",
                data=carseats.data).fit()

print(model.summary())

The F-statistic is larger than 1, though much smaller compared to the F-statistics in the last problem. I think this means that the alternative hypothesis is viable, but not completely sure about that.

From the individual p-values we can conclude that Urban is not a statistically significant predictor for Sales.

Interpretation of coefficient of predictors

Since Urban is not a statistically significant predictor we do not need to worry about its coefficient. The coefficient for US indicates that if the store is in the US then it then the sales will increase by about 1200 units. On the other hand the coefficient for Price says that an increase in price will result in a decrease in sales.

Linear model equation

The equation for this model is \begin{align} Y = 13.04 - 0.02 X_1 + 1.20 X_2 - 0.05 X_3, \end{align} where \(Y\), \(X_1\), \(X_2\), and \(X_3\) stand for Sales, Urban, US, and Price, respectively. \(X_1 = 1\) if the store is an urban location, and \(0\) otherwise. Similarly \(X_2 = 1\) if the store is in the US, and \(0\) if it is not.

Null hypothesis rejection

We can reject the null hypothesis for US, and Price, since the associated p-values are effectively 0.

Smaller multiple linear model for Carseats sales

small_model = smf.ols(formula="Sales ~ US + Price", data=carseats.data).fit()
print(small_model.summary())

Based on the \(R^2\) values both the models fit the data similarly.

Confidence intervals of fitted parameters

print(small_model.conf_int())

The confidence intervals are also printed in the summary, but this is probably more convenient.

Outliers and leverages

To see if there are any leverage points we need to first calculate the average leverage, \((p + 1) / n\), for the data.

npredictors = 2
nobservations = len(carseats.data)
avg_leverage = (npredictors + 1) / nobservations

print(f"Average leverage: {avg_leverage}")

The Residuals vs Leverage plot is the easiest way to check for outliers and high leverage observations.

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("ticks")

plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(small_model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.10.h_res_vs_lev.png", dpi=90)

All the residuals are between -3 and 3, so there are no outliers. However there are a lot of points whose leverage greatly exceeds the average leverage. Thus there are high leverage observations.

Question 11

Simple linear regression with synthetic data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("ticks")

rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = 2 * x + rng.normal(size=100)

data = pd.DataFrame({"X" : x, "Y" : y})

plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.11.a_data.png", dpi=90)

Now we do a simple linear regression with this synthetic data. This model will not have an intercept: \(Y = βX\).

import statsmodels.formula.api as smf

model = smf.ols("Y ~ X - 1", data=data).fit()
print(model.summary())

The coefficient estimate is \(\hat{β} = 2.1196\) with a standard error of \(0.126\). The t-statistic is 16.833 and the associated p-value is 0. This means we can reject the null hypothesis.

Inverse simple linear relation with synthetic data

We are going to use the same data, but now with X as the response and Y as the predictor.

model2 = smf.ols("X ~ Y - 1", data=data).fit()
print(model2.summary())

The coefficient estimate is \(\hat{β} = 0.3496\) with a standard error of \(0.021\). The t-statistic is \(16.833\) and the associated p-value is 0. This means we can reject the null hypothesis.

Relation between the two regressions

Given the underlying true model we should have expected that the coefficients of the two models would be multiplicative inverses of each other. But they are not. The reason being that the two models are minimizing different residual sum of squares. For the two models the residual sum of squares are \begin{align} \mathrm{RSS}(1) &= ∑i=1^n (y_i - \hat{β}^(1) x_i)^2,
\mathrm{RSS}(2) &= ∑i=1^n (x_i - \hat{β}^(2) y_i)^2, \end{align} respectively. \(\mathrm{RSS}^(1)\) is minimized when \(\hat{β}^(1) = ∑y_i x_i / ∑ x_i^2\), and \(\mathrm{RSS}^(2)\) is minimized when \(\hat{β}^(2) = ∑x_i y_i / ∑ y_i^2\). If \(\hat{β}^(1) = 1 / \hat{β}^(2)\) then we have \begin{align} (∑i=1^n x_i y_i)^2 = ∑i=1^n x_i^2 ∑i=1^n y_i^2. \end{align} Since here \(X\) and \(Y\) are random variables with zero mean we can interpret the above equation as \begin{align} \mathrm{Cov}(X, Y) = \mathrm{Var}(X) \mathrm{Var}(Y). \end{align} This is true only if the true relation is \(y_i = β x_i + γ\) for some nonzero constants \(β\) and \(γ\) (See DeGroot and Schervish, Theorem 4.6.3, or ProofWiki for a proof of this statement.). But the true relation in this case was \(y_i = β x_i + ϵ\), where \(ϵ\) is a Gaussian random variable with zero mean. Thus the above statement is not true, and hence \(\hat{β}^(1) ≠ 1 / \hat{β}^(2)\). For a more detailed discussion on this check Stats StackExchange.

t-statistic for first model

The t-statistic for a simple linear fit without intercept is \(\hat{β} / \mathrm{SE}(\hat{β})\) where \(\hat{β} = ∑_i x_i y_i / ∑_i x_i^2\), and the standard error is \begin{align} \mathrm{SE}(\hat{β} = \frac{\sqrt{∑_i (y_i - x_i \hat{β})^2}}{(n-1) ∑_i x_i^2}. \end{align} Substituting the expression for \(\hat{β}\) in to the expressions for the standard error and the t-statistic gives us the expected expression for the t-statistic. The trick is to realize that the summation indices are dummy variables, i.e. \(∑i=1^n x_i^2 = ∑j=1^n x_j^2\). Numerically we can conform this as follows:

n = len(data)
x, y = data["X"], data["Y"]
t = (np.sqrt(n - 1) * np.sum(x * y)
     / np.sqrt(np.sum(x ** 2) * np.sum(y ** 2) - np.sum(x * y) ** 2))
print(f"t-statistic: {t:.3f}")

t-statistic for second model

The expression for the t-statistic is symmetric in \(X\) and \(Y\), so irrespective of whether we are regressing \(Y\) onto \(X\) or \(X\) onto \(Y\), we will have the same t-statistic.

t-statistic for models with intercept

model3 = smf.ols(formula="Y ~ X", data=data).fit()
model4 = smf.ols(formula="X ~ Y", data=data).fit()
print(model3.summary())

print(model4.summary())

We can see that the t-coefficient for the predictors is same for both models.

Question 12

Equal regression coefficients

As this is a regression without intercept we can use the expressions derived in the previous question. The coefficients will be same when \(∑_i x_i^2 = ∑_i y_i^2\). This is particularly true when \(X = Y\).

Different coefficient estimates - numerical

Essentially reusing question 11.

rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = 2 * x + rng.normal(size=100)

data = pd.DataFrame({"X" : x, "Y" : y})

plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.12.b_data.png", dpi=90)
model1 = smf.ols("Y ~ X - 1", data=data).fit()
model2 = smf.ols("X ~ Y - 1", data=data).fit()
print(model1.summary())

print(model2.summary())

Same coefficient estimates - numerical

We need \(∑_i x_i^2 = ∑_i y_i^2\). Setting \(X = Y\) works, but \(Y = \mathrm{Permutation}(X)\) would work too, and is more general.

rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = np.random.permutation(x)

data = pd.DataFrame({"X" : x, "Y" : y})

plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.12.b_data.png", dpi=90)
model3 = smf.ols("Y ~ X - 1", data=data).fit()
model4 = smf.ols("X ~ Y - 1", data=data).fit()
print(model3.summary())

print(model4.summary())

Question 13

Feature vector

rng = np.random.default_rng(seed=42)
x = rng.normal(loc=0, scale=1, size=100)

Error vector

eps = rng.normal(loc=0, scale=0.25, size=100)

Response vector

y = -1 + 0.5 * x + eps

The length of y is 100, and \(β_0 = -1\), and \(β_1 = 0.5\).

Scatter plot

df = pd.DataFrame({"x" : x, "y" : y})

plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df) # jointplot also gives the distributions of x and y in addition to the scatter plot
sns.despine()
sp.savefig("img/3.13.d_scatter.png", dpi=90)

We can see a clear linear trend between x and y.

Least squares fit

model = smf.ols("y ~ x", data=df).fit()
print(model.summary())

The estimates for \(β_0\) and \(β_1\) are almost equal to the true values. The true values fall within the 95% confidence interval of the estimated values.

ypred = model.predict(df["x"])

plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.f_ols.png", dpi=90)

Polynomial regression

poly_model = smf.ols(formula="y ~ x + I(x**2)", data=df).fit()
print(poly_model.summary())

The \(R^2\) values of both the models are pretty much the same. Additionally the p-value of the quadratic term is not zero. The quadratic term does not improve the model fit.

Least squares fit with less noise

The new data is as follows. The spread of the noise is now 0.1 instead of 0.25.

eps = rng.normal(loc=0, scale=0.1, size=100)
y = -1 + 0.5 * x + eps

df2 = pd.DataFrame({"x" : x, "y" : y})

plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df2) # jointplot also gives the distributions of x and y in addition to the scatter plot
sns.despine()
sp.savefig("img/3.13.h_scatter.png", dpi=90)

Now the least squares fit to this data.

less_noisy_model = smf.ols("y ~ x", data=df2).fit()
print(less_noisy_model.summary())

The \(R^2\) value for this data set is much larger than the original data set. The model is able to explain 93% of the less noisy data, whereas it could only explain around 70% of the original data set.

ypred = less_noisy_model.predict(df2["x"])

plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.h_ols.png", dpi=90)

Least squares fit with more noise

The new data is as follows. The spread of the noise is now 0.5 instead of 0.25.

eps = rng.normal(loc=0, scale=0.5, size=100)
y = -1 + 0.5 * x + eps

df3 = pd.DataFrame({"x" : x, "y" : y})

plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df3)
sns.despine()
sp.savefig("img/3.13.i_scatter.png", dpi=90)

Now the least squares fit to this data.

more_noisy_model = smf.ols("y ~ x", data=df3).fit()
print(more_noisy_model.summary())

The \(R^2\) value for this data set is much smaller than it was for the original data set. The model is able to explain only 43% of the more noisy data, whereas it could explain around 70% of the original data.

ypred = more_noisy_model.predict(df3["x"])

plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.i_ols.png", dpi=90)

From the graph it appears that there are possible outliers, which is not surprising given the spread of the error.

Confidence intervals of the three models

print("Confidence interval based on original data set:\n")
print(f"{model.conf_int()}\n")
print("Confidence interval based on less noisy data set:\n")
print(f"{less_noisy_model.conf_int()}\n")
print("Confidence interval based on more noisy data set:\n")
print(f"{more_noisy_model.conf_int()}\n")

The confidence intervals for the less noisy data set are the tightest and the confidence intervals for the more noisy data set are the loosest.

Question 14

Multiple linear model with collinearity

from numpy.random import MT19937

rng = np.random.default_rng(MT19937(seed=5))
x1 = rng.uniform(size=100)
x2 = 0.5 * x1 + rng.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + rng.normal(size=100)

df_coll = pd.DataFrame({"x1" : x1, "x2" : x2, "y" : y})

The form of the linear model is \begin{align} Y = 2 + 2 X_1 + 0.3 X_2 + ϵ. \end{align}

Correlation scatter plot

corr = df_coll.corr()
print(corr)

The correlation between x1 and x2 is 0.768.

plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="x1", y="x2", data=df_coll, ax=ax)
sns.despine()
fig.savefig("img/3.14.a_scatter.png", dpi=90)

Least squares fit with x1 and x2

coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit()
print(coll_model1.summary())

The estimated values for the coefficients are 1.869, 2.175, and 0.445 which are close to the true values, albeit with large standard errors, particularly for \(\hat{β}_2\). Based on the p-values we can reject the null hypothesis for \(β_1\), but we cannot reject the null-hypothesis for \(β_2\).

Least squares fit with x1 only

coll_model2 = smf.ols("y ~ x1", data=df_coll).fit()
print(coll_model2.summary())

The coefficient value has increased, and the \(R^2\) value has decreased marginally. We can still reject the null hypothesis based on the p-value.

Least squares fit with x2 only

coll_model3 = smf.ols("y ~ x2", data=df_coll).fit()
print(coll_model3.summary())

The coefficient value is much larger now, but the \(R^2\) value has decreased. We can now reject the null hypothesis based on the p-value.

Contradiction of models

The three models do not contradict each other. Due to the high correlation between x1 and x2, we can predict x2 from x1. Thus in the original model x2 has very little explanatory power and so we cannot reject the null hypothesis for \(β_2\).

For the second and third model the explanation for the increase in the coefficients is as follows. In the second model we are expressing x2 in terms of x1, and so the coefficient of x1 in the expression for y increases to 2.3. In the third model we are expressing x1 in terms of x2, and so the coefficient of x2 in the expression for y increases to 4.3. The 95% confidence interval of the second model includes the new true value of the coefficient. Even though the 95% confidence interval of the third model does not include the new true value of the coefficient it comes close. This is probably due to the difference between the random number generators used by R and numpy.

Additional data

df_coll = df_coll.append({"x1" : 0.1, "x2" : 0.8, "y" : 6}, ignore_index=True)

coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit()
coll_model2 = smf.ols("y ~ x1", data=df_coll).fit()
coll_model3 = smf.ols("y ~ x2", data=df_coll).fit()
print(coll_model1.summary())

print(coll_model2.summary())

print(coll_model3.summary())

The \(R^2\) values for models 1 and 2 decreased. This observation decreased the predictive power of the models. The average leverage for the data set is

p = len(df_coll.columns)
n = len(df_coll)
lev = (p + 1) / n
print(f"{lev:.3f}")
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model1, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll1_res_vs_lev.png", dpi=90)

For the first model it is both an outlier and a high leverage point.

plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model2, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll2_res_vs_lev.png", dpi=90)

For the second model it is just an outlier.

plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model3, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll3_res_vs_lev.png", dpi=90)

For the third model it is just an high leverage point.

Question 15

Predict per capita crime rate with the Boston data set

We load the Boston from statsmodels.

import statsmodels.api as sm

boston = sm.datasets.get_rdataset("Boston", "MASS")
print(boston.__doc__)

Earlier we are fitted medv to the other predictors, now we will be fitting crim to the other predictors.

import statsmodels.formula.api as smf

df = boston.data
predictors = [c for c in df.columns if c != "crim"]
simple_models = {p : smf.ols(formula=f"crim ~ {p}", data=df).fit() for p in predictors}
print(f"predictor coefficient p-value")
for p, model in simple_models.items():
    print(f"{p:^9} {model.params[p]:>9,.4f} {model.pvalues[p]:>9,.4f}")

Except for chas everything else appears to be statistically significant.

import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil

sns.set_style("ticks")
ncols = 4
nrows = ceil(len(predictors) / ncols)

plt.close("all")
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, constrained_layout=True, figsize=(12, 10))
for i in range(nrows):
    for j in range(ncols):
        if i * ncols + j < len(predictors):
            sns.regplot(x=df[predictors[i * ncols + j]], y=df["crim"], ax=axs[i, j], line_kws={"color" : "r"})
            sns.despine()
fig.savefig("img/3.15.a_reg_mat.png", dpi=120)

Multiple regression with Boston data set

Y = df['crim']
X = df[predictors]
X = sm.add_constant(X)

ml_model = sm.OLS(Y, X).fit()
print(ml_model.summary())

Based on the p-values we can reject the null hypothesis for dis, rad, and medv. If we are willing to be less accurate we can also reject the null hypothesis for zn, nox, black, and lstat.

Comparison plot

import pandas as pd

ml_coefs = ml_model.params
sl_coefs = pd.Series({p : simple_models[p].params.loc[p] for p in predictors})
coef_df = pd.concat([sl_coefs, ml_coefs], axis=1)
coef_df.reset_index(inplace=True)
coef_df.columns = ["Predictors", "Simple OLS Coefficient", "Multiple OLS Coefficient"]
coef_df.dropna(inplace=True)
print(coef_df)

plt.close("all")
fig, ax = plt.subplots()
sns.scatterplot(x="Simple OLS Coefficient", y="Multiple OLS Coefficient", data=coef_df, ax=ax)
sns.despine()
fig.savefig("img/3.15.c_comp_plot.png")

The coefficients for nox are very different in the two models.

Evidence of non-linear associations

We will use scikit-learn to generate the non-linear features.

from sklearn.preprocessing import PolynomialFeatures
pd.options.display.float_format = "{:,.3f}".format

Y = df['crim']
poly_features = PolynomialFeatures(degree=3)
poly_predictors = {p : poly_features.fit_transform(df[p][:, None]) for p in predictors}
poly_models = {p : sm.OLS(Y, poly_predictors[p]).fit() for p in predictors}
for p in predictors:
    print(f"p-values for {p}:")
    print(f"{poly_models[p].pvalues}\n")

From the p-values we see that there is evidence of polynomial association between the response and the predictors indus, nox, age, dis, ptratio, and medv.