- Residual sum of squares (RSS) = \(∑i=1^n (y_i - \hat{y}_i)^2\).
- Residual standard error (RSE) = \(\sqrt{\frac{1}{n-2}\text{RSS}}\).
- \(R^2\) = 1 - \(\frac{\text{RSS}}{\text{TSS}}\).
- Total sum of squares (TSS) = \(∑i=1^n (y_i - \bar{y}_i)^2\).
- Outlier is an observation for which the predicted response is far from the recorded response.
- High leverage point is an observation for which the recorded predictors are unusual.
- High leverage points usually have more influence on the model than outliers.
- The studentized (standardized) residuals are typically expected to be between -3 and 3. If they lie outside then that indicates that there might be outliers in the data.
- The average leverage is \((p + 1) / n\). If the leverage of some observation is significantly higher than the average leverage then that is a high leverage point. (Look in to “Cook’s distance”.)
Since there are three p-values in Table 3.4, there are three null hypotheses:
- H_0^(1):
sales
does not depend onTV
- H_0^(2):
sales
does not depend onradio
- H_0^(3):
sales
does not depend onnewspaper
The p-values corresponding to the first two null hypotheses are very small, less
than 10^-4, whereas the p-value for the last one is close to 1. This means that
the probability of seeing the observed sales
numbers if H_0^(1) or H_0^(2) is true
is almost zero, but the probability of seeing the observed sales
numbers if
H_0^(3) is true is almost 1. In other words sales
depends on TV
and radio
but not on newspaper
.
A KNN classifier classifies an observation is to the class with the majority of nearest neighbors. In the KNN regression method the response is estimated as the average of all the training responses of the nearest neighbors.
a) Assuming X_4 = X_1 X_2 and X_5 = X_1 X_3, the linear fit model is as follows: Y = 50 + 20 X_1 + 0.07 X_2 + 35 X_3 + 0.01 X_1 X_2 - 10 X_3 X_5. Since X_3 (Gender) is 1 for females and 0 for males, this fit essentially gives two models based on the gender: Y = 85 + 10 X_1 + 0.07 X_2 + 0.01 X_1 X_2 for females, and Y = 50 + 20 X_1 + 0.07 X_2 + 0.01 X_1 X_2 for males. If X_1 (GPA), and X_2 (IQ), are fixed for the two genders then the starting salary for males will be more on average than the starting salary of females if X_1 > 35 / 10 = 3.5; i.e. given the GPA is high enough, on average males will have a higher starting salary than females (iii). b) If IQ is 110 and GPA is 4.0 then a female is predicted to have a starting salary of $ 137,100. c) False. The scale for IQ is much larger than that of GPA or Gender. Hence the coefficients related to any IQ based predictor are expected to be small. (Possibly this is why it is often recommended to scale the predictors, so they are all on an equal footing and it is easier to figure out which ones are important.)
a) Even though the true relationship is linear, we need to take into account the irreducible error \(ϵ\). The cubic model is more flexible than the linear model and therefore will be able to better fit to the irreducible error. I expect the cubic model to have lower training RSS than the linear model. b) The linear model will have a lower test RSS than the cubic model since it matches the true relationship. The cubic model would have overfitted to the irreducible error in the training data and therefore perform worse on the test data. c) Irrespective of the true relationship I would expect the cubic model to have lower training RSS than the linear model because it is more flexible than the linear model and thus fits the training data better. d) Since we do not the true relationship between X and Y, it is difficult to say which of the two models will have lower test RSS without more information.
We just have to substitute equation 3.38 in to the equation for the \(i\)th fitted value and rearrange the equation: \begin{align} ai’ = \frac{x_i xi’}{∑k=1^n x_k^2}. \end{align}
The equation for the least squares line is \(y = \hat{β}_0 + \hat{β}_1 x\). From equation 3.4 we have \(\hat{β}_0 = \bar{y} - \hat{β}_1 \bar{x}\). Then putting \(x = \bar{x}\) in the least squares line equation we see that \(y = \bar{y}\). Thus the least squares line always passes through \((\bar{x}, \bar{y})\).
Assumption: \(\bar{x} = \bar{y} = 0\). From equation 3.4 we have \(\hat{y}_i = \hat{β}_1 x_i\), where \begin{align} \hat{β}_1 = \frac{∑i=1^n x_i y_i}{∑i=1^n x_i^2}. \end{align} And from equation 3.17 we have \begin{align} R^2 = 1 - \frac{∑i=1^n (y_i - \hat{y}_i)^2}{∑i=1^n y_i^2}. \end{align} Expanding the square in the numerator of the above fraction and substituting \(\hat{y}_i\) with \(\hat{β}_1 x_i\) we get \begin{align} R^2 = \frac{\hat{β}_1(2 ∑i=1^n x_iy_i - \hat{β}_1 ∑i=1^n x_i^2)}{∑i=1^n y_i^2}. \end{align} Now if we simply substitute the above form of \(\hat{β}_1\) in to this equation we will see that \(R^2 = \mathrm{Cor}(X, Y)^2\), where \(\mathrm{Cor}(X, Y)\) is given by equation 3.18.
import pandas as pd
import numpy as np
from tabulate import tabulate
auto = pd.read_csv("data/Auto.csv")
print(tabulate(auto.head(), auto.columns, tablefmt="orgtbl"))
Recall from the last chapter horsepower
has some missing values and needs to
be converted to a numeric form before we can use this for linear regression.
auto.drop(auto[auto.horsepower == "?"].index, inplace=True)
auto["horsepower"] = pd.to_numeric(auto["horsepower"])
For simple linear regression we can use the ordinary least squares model from
statsmodels
or the linear regression model from scikit-learn
. In this
question we are asked to print the summary of the fitted model. scikit-learn
has no method for generating a summary, but statsmodels
does.
import statsmodels.formula.api as smf
model = smf.ols(formula="mpg~horsepower", data=auto).fit()
print(model.summary())
a. The F-statistic is much larger than 1, and the p-value (P>|t|
in the table)
is zero. This indicates that there is a relationship between mpg
and
horsepower
.
b. The \(R^2\) value of 0.606 indicates that this relationship explains around
61% of the mpg
values.
c. The coefficient value corresponding to horsepower
is negative. This
indicates that the relation between mpg
and horsepower
is negative.
d.
pred = model.get_prediction(exog=dict(horsepower=98))
pred_summary = pred.summary_frame()
print(tabulate(pred_summary, pred_summary.columns, tablefmt="orgtbl"))
The predicted mpg
for horsepower
= 98 is 24.4671. The 95% confidence
interval is \([23.9731, 24.9611]\), and the 95% prediction interval is
\([14.8094, 34.1248]\). As mentioned in the text, the prediction interval
contains the confidence interval.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("ticks")
X = auto["horsepower"]
Y = auto["mpg"]
Ypred = model.predict(X)
plt.close("all")
fig, ax = plt.subplots()
ax.plot(X, Y, 'o', label="Data")
ax.plot(X, Ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3_auto_ls.png", dpi=90)
The R
command plot
in this case gives four diagnostic plots:
- Residuals vs Fitted
- Normal Q-Q
- Scale-Location
- Residuals vs Leverage
The Residuals vs Fitted plot shows any non-linear pattern in the residuals, and by extension in the data. The Normal Q-Q plot shows if the residuals are normally distributed. The Scale-Location plot shows if there is heteroscedasticity. The Residuals vs Leverage plot shows if there are leverages in the data.
We will produce these plots using statsmodels
and seaborn
.
fitted_vals = model.fittedvalues
plt.close("all")
fig, ax = plt.subplots()
residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto,
lowess=True,
line_kws={"color": "red"},
ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Residuals")
sns.despine()
fig.savefig("img/3_res_vs_fit.png", dpi=90)
This plot clearly shows that there is non-linearity in the data.
import statsmodels.api as sm
residuals = model.resid
plt.close("all")
fig, ax = plt.subplots()
qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True)
ax.set_ylabel("Standardized Residuals")
sns.despine()
fig.savefig("img/3_auto_qq.png", dpi=90)
Though there are some points that are far from the \(45^ˆ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.
# normalized residuals and their square roots
norm_residuals = model.get_influence().resid_studentized_internal
norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals))
plt.close("all")
fig, ax = plt.subplots()
slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt,
lowess=True,
line_kws={"color" : "red"},
ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Sqrt of |Standardized Residuals|")
sns.despine()
fig.savefig("img/3_auto_scale_loc.png", dpi=90)
This plot is similar to the first diagnostic plots, except now the quantity on the y-axis is positive. This shows that homoscedasticity is not held, i.e. the variance is not constant.
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3_auto_res_vs_lev.png", dpi=90)
We see that none of the points have a very high leverage.
plt.close("all")
spm = sns.pairplot(auto, plot_kws = {'s': 10})
spm.fig.set_size_inches(12, 12)
spm.savefig("img/3_auto_scatter.png", dpi=90)
I find heat maps to be better for visualizing correlation matrices than tables. Since the correlation matrix is symmetric we can ignore either of the lower or the upper triangles. We can also ignore the diagonal since it is always going to be 1.
corr_mat = auto[auto.columns[:-1]].corr()
plt.close("all")
fig, ax = plt.subplots()
# Custom diverging color map.
cmap = sns.diverging_palette(220, 10, sep=80, n=7)
# Mask for upper triangle.
mask = np.triu(np.ones_like(corr_mat, dtype=np.bool))
with sns.axes_style("white"):
sns.heatmap(corr_mat, mask=mask, cmap=cmap, annot=True, robust=True, ax=ax)
fig.savefig("img/3_auto_corr_heat.png")
We see that mpg
has considerable negative correlations with cylinders
,
displacement
, horsepower
, and weight
. This matches what we saw in the
scatter plot matrix above. Similarly cylinders
, displacement
, horsepower
and weight
are all correlated with each other.
We could do this with the statsmodels.formula
API but that involves more
typing, so we will use the statsmodels
API.
import statsmodels.api as sm
Y = auto["mpg"]
X = auto[auto.columns[1:-1]]
X = sm.add_constant(X) # For the intercept.
ml_model = sm.OLS(Y, X).fit()
print(ml_model.summary())
a. The large F-statistic indicates that we can ignore the null hypothesis, which
says that the response mpg
does not depend on the predictors. The
probability that this data could be generated if the null hypothesis was true
is essentially zero (2.04E-139).
b. Looking at the p-values of the individual predictors we see that weight
,
year
, and origin
have the most statistically significant relation with
mpg
. We can also argue that displacement
has a somewhat significant
relation with mpg
. On the other hand cylinders
, horsepower
, and
acceleration
do not have a significant statistical relationship. This is
not necessarily surprising. Given the correlation between mpg
,
displacement
, cylinders
and horsepower
I think one can argue that the
information in cylinders
and horsepower
is redundant.
c. The coefficient for the year
variable suggests that every year the mpg
increases by 0.7508
, i.e. the cars become more fuel-efficient every year.
We make the same diagnostic plots as the previous exercise.
fitted_vals = ml_model.fittedvalues
plt.close("all")
fig, ax = plt.subplots()
residplot = sns.residplot(x=fitted_vals, y="mpg", data=auto,
lowess=True,
line_kws={"color": "red"},
ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Residuals")
sns.despine()
fig.savefig("img/3_ml_res_vs_fit.png", dpi=90)
This plot clearly shows that there is non-linearity in the data.
import statsmodels.api as sm
residuals = ml_model.resid
plt.close("all")
fig, ax = plt.subplots()
qqplot = sm.qqplot(residuals, line='45', ax=ax, fit=True)
ax.set_ylabel("Standardized Residuals")
sns.despine()
fig.savefig("img/3_ml_auto_qq.png", dpi=90)
Though there are some points that are far from the \(45^ˆ\) fitted line, most of the points lie close to the line, indicating that the residuals are mostly normally distributed.
# normalized residuals and their square roots
norm_residuals = ml_model.get_influence().resid_studentized_internal
norm_residuals_abs_sqrt = np.sqrt(np.abs(norm_residuals))
plt.close("all")
fig, ax = plt.subplots()
slplot = sns.regplot(fitted_vals, norm_residuals_abs_sqrt,
lowess=True,
line_kws={"color" : "red"},
ax=ax)
ax.set_xlabel("Fitted values")
ax.set_ylabel("Sqrt of |Standardized Residuals|")
sns.despine()
fig.savefig("img/3_ml_auto_scale_loc.png", dpi=90)
The variance in the standardized residuals is less as compared to the single regression plot, but there is still quite a bit of variance, which means homoscedasticity is not held.
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(ml_model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3_ml_auto_res_vs_lev.png", dpi=90)
Point 13 has a high leverage but not a very high residual.
We go back to the statsmodels.formula
API. Additionally I will drop
cylinders
, horsepower
, and acceleration
from the model. I will try the
following interaction terms:
year : origin
: this will add a new predictor which is a product ofyear
andorigin
, but will not includeyear
andorigin
separately,year * origin
: this will add the product ofyear
andorigin
, but also includeyear
andorigin
separately,year * weight
: same as the above, except forweight
in place oforigin
.
ml_model_1 = smf.ols(formula="mpg ~ displacement + weight + year : origin",
data=auto).fit()
ml_model_2 = smf.ols(formula="mpg ~ displacement + weight + year * origin",
data=auto).fit()
ml_model_3 = smf.ols(formula="mpg ~ displacement + weight * year + origin",
data=auto).fit()
print(ml_model_1.summary())
The large F-statistic invalidates the null hypothesis. The individual p-values
show that displacement
is not really significant, but the product of year
and origin
is. Additionally the \(R^2\) value tells us that this model explains
around 71% of the mpg
values.
print(ml_model_2.summary())
The \(R^2\) value has increased; it is now almost the same as the \(R^2\) value
for the model with all the quantitative predictors but no interaction. This
model explains around 82% of the mpg
values. Additionally we see that year
is more significant than origin
or the product of year
and origin
. Also,
in addition to displacement
, the intercept term appears to be insignificant
too.
print(ml_model_3.summary())
The \(R^2\) increased a bit, and it appears other than displacement
all the
other predictors are significant.
We will try an additional model with two interactions: displacement * weight
and weight * year
.
ml_model_4 = smf.ols("mpg ~ displacement * weight + weight * year + origin",
data=auto).fit()
print(ml_model_4.summary())
This is interesting. We see that the \(R^2\) value has increased further, but now
displacement
has become significant whereas weight
and origin
have become
relatively insignificant. The interaction terms are still significant. My
understanding of this model is that while weight
does not directly affect
mpg
, it increases displacement
, and that affects the mpg
.
ml_model_trans = smf.ols(formula="mpg ~ np.log(weight) + np.power(weight, 2) + year",
data=auto).fit()
print(ml_model_trans.summary())
The F-statistic and the p-values indicate that these transformations are statistically significant.
So far I had been loading the data sets from local .csv
files, but I recently
found out that statsmodels
makes them automatically available using the
Rdatasets project. So going forward I will be using that whenever possible.
import statsmodels.api as sm
carseats = sm.datasets.get_rdataset("Carseats", package="ISLR")
print(carseats.__doc__)
Multiple linear regression to predict Sales
using Price
, Urban
, and US
.
import statsmodels.formula.api as smf
model = smf.ols(formula="Sales ~ Price + Urban + US",
data=carseats.data).fit()
print(model.summary())
The F-statistic is larger than 1, though much smaller compared to the F-statistics in the last problem. I think this means that the alternative hypothesis is viable, but not completely sure about that.
From the individual p-values we can conclude that Urban
is not a statistically
significant predictor for Sales
.
Since Urban
is not a statistically significant predictor we do not need to
worry about its coefficient. The coefficient for US
indicates that if the
store is in the US then it then the sales will increase by about 1200 units. On
the other hand the coefficient for Price
says that an increase in price will
result in a decrease in sales.
The equation for this model is
\begin{align}
Y = 13.04 - 0.02 X_1 + 1.20 X_2 - 0.05 X_3,
\end{align}
where \(Y\), \(X_1\), \(X_2\), and \(X_3\) stand for Sales
, Urban
, US
, and
Price
, respectively. \(X_1 = 1\) if the store is an urban location, and \(0\)
otherwise. Similarly \(X_2 = 1\) if the store is in the US, and \(0\) if it is
not.
We can reject the null hypothesis for US
, and Price
, since the associated
p-values are effectively 0.
small_model = smf.ols(formula="Sales ~ US + Price", data=carseats.data).fit()
print(small_model.summary())
Based on the \(R^2\) values both the models fit the data similarly.
print(small_model.conf_int())
The confidence intervals are also printed in the summary, but this is probably more convenient.
To see if there are any leverage points we need to first calculate the average leverage, \((p + 1) / n\), for the data.
npredictors = 2
nobservations = len(carseats.data)
avg_leverage = (npredictors + 1) / nobservations
print(f"Average leverage: {avg_leverage}")
The Residuals vs Leverage plot is the easiest way to check for outliers and high leverage observations.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("ticks")
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(small_model, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.10.h_res_vs_lev.png", dpi=90)
All the residuals are between -3 and 3, so there are no outliers. However there are a lot of points whose leverage greatly exceeds the average leverage. Thus there are high leverage observations.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("ticks")
rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = 2 * x + rng.normal(size=100)
data = pd.DataFrame({"X" : x, "Y" : y})
plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.11.a_data.png", dpi=90)
Now we do a simple linear regression with this synthetic data. This model will not have an intercept: \(Y = βX\).
import statsmodels.formula.api as smf
model = smf.ols("Y ~ X - 1", data=data).fit()
print(model.summary())
The coefficient estimate is \(\hat{β} = 2.1196\) with a standard error of \(0.126\). The t-statistic is 16.833 and the associated p-value is 0. This means we can reject the null hypothesis.
We are going to use the same data, but now with X
as the response and Y
as
the predictor.
model2 = smf.ols("X ~ Y - 1", data=data).fit()
print(model2.summary())
The coefficient estimate is \(\hat{β} = 0.3496\) with a standard error of \(0.021\). The t-statistic is \(16.833\) and the associated p-value is 0. This means we can reject the null hypothesis.
Given the underlying true model we should have expected that the coefficients of
the two models would be multiplicative inverses of each other. But they are not.
The reason being that the two models are minimizing different residual sum of
squares. For the two models the residual sum of squares are
\begin{align}
\mathrm{RSS}(1) &= ∑i=1^n (y_i - \hat{β}^(1) x_i)^2,
\mathrm{RSS}(2) &= ∑i=1^n (x_i - \hat{β}^(2) y_i)^2,
\end{align}
respectively. \(\mathrm{RSS}^(1)\) is minimized when \(\hat{β}^(1) = ∑y_i x_i / ∑
x_i^2\), and \(\mathrm{RSS}^(2)\) is minimized when \(\hat{β}^(2) = ∑x_i y_i / ∑
y_i^2\). If \(\hat{β}^(1) = 1 / \hat{β}^(2)\) then we have
\begin{align}
(∑i=1^n x_i y_i)^2 = ∑i=1^n x_i^2 ∑i=1^n y_i^2.
\end{align}
Since here \(X\) and \(Y\) are random variables with zero mean we can interpret
the above equation as
\begin{align}
\mathrm{Cov}(X, Y) = \mathrm{Var}(X) \mathrm{Var}(Y).
\end{align}
This is true only if the true relation is \(y_i = β x_i + γ\) for some nonzero
constants \(β\) and \(γ\) (See DeGroot and Schervish, Theorem 4.6.3, or
ProofWiki for a proof of this statement.). But the true relation in this case
was \(y_i = β x_i + ϵ\), where \(ϵ\) is a Gaussian random variable with zero mean.
Thus the above statement is not true, and hence \(\hat{β}^(1) ≠ 1 / \hat{β}^(2)\).
For a more detailed discussion on this check Stats StackExchange.
The t-statistic for a simple linear fit without intercept is \(\hat{β} / \mathrm{SE}(\hat{β})\) where \(\hat{β} = ∑_i x_i y_i / ∑_i x_i^2\), and the standard error is \begin{align} \mathrm{SE}(\hat{β} = \frac{\sqrt{∑_i (y_i - x_i \hat{β})^2}}{(n-1) ∑_i x_i^2}. \end{align} Substituting the expression for \(\hat{β}\) in to the expressions for the standard error and the t-statistic gives us the expected expression for the t-statistic. The trick is to realize that the summation indices are dummy variables, i.e. \(∑i=1^n x_i^2 = ∑j=1^n x_j^2\). Numerically we can conform this as follows:
n = len(data)
x, y = data["X"], data["Y"]
t = (np.sqrt(n - 1) * np.sum(x * y)
/ np.sqrt(np.sum(x ** 2) * np.sum(y ** 2) - np.sum(x * y) ** 2))
print(f"t-statistic: {t:.3f}")
The expression for the t-statistic is symmetric in \(X\) and \(Y\), so irrespective of whether we are regressing \(Y\) onto \(X\) or \(X\) onto \(Y\), we will have the same t-statistic.
model3 = smf.ols(formula="Y ~ X", data=data).fit()
model4 = smf.ols(formula="X ~ Y", data=data).fit()
print(model3.summary())
print(model4.summary())
We can see that the t-coefficient for the predictors is same for both models.
As this is a regression without intercept we can use the expressions derived in the previous question. The coefficients will be same when \(∑_i x_i^2 = ∑_i y_i^2\). This is particularly true when \(X = Y\).
Essentially reusing question 11.
rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = 2 * x + rng.normal(size=100)
data = pd.DataFrame({"X" : x, "Y" : y})
plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.12.b_data.png", dpi=90)
model1 = smf.ols("Y ~ X - 1", data=data).fit()
model2 = smf.ols("X ~ Y - 1", data=data).fit()
print(model1.summary())
print(model2.summary())
We need \(∑_i x_i^2 = ∑_i y_i^2\). Setting \(X = Y\) works, but \(Y = \mathrm{Permutation}(X)\) would work too, and is more general.
rng = np.random.default_rng(seed=42)
x = rng.normal(size=100)
y = np.random.permutation(x)
data = pd.DataFrame({"X" : x, "Y" : y})
plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="X", y="Y", data=data, ax=ax)
sns.despine()
fig.savefig("img/3.12.b_data.png", dpi=90)
model3 = smf.ols("Y ~ X - 1", data=data).fit()
model4 = smf.ols("X ~ Y - 1", data=data).fit()
print(model3.summary())
print(model4.summary())
rng = np.random.default_rng(seed=42)
x = rng.normal(loc=0, scale=1, size=100)
eps = rng.normal(loc=0, scale=0.25, size=100)
y = -1 + 0.5 * x + eps
The length of y
is 100, and \(β_0 = -1\), and \(β_1 = 0.5\).
df = pd.DataFrame({"x" : x, "y" : y})
plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df) # jointplot also gives the distributions of x and y in addition to the scatter plot
sns.despine()
sp.savefig("img/3.13.d_scatter.png", dpi=90)
We can see a clear linear trend between x
and y
.
model = smf.ols("y ~ x", data=df).fit()
print(model.summary())
The estimates for \(β_0\) and \(β_1\) are almost equal to the true values. The true values fall within the 95% confidence interval of the estimated values.
ypred = model.predict(df["x"])
plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.f_ols.png", dpi=90)
poly_model = smf.ols(formula="y ~ x + I(x**2)", data=df).fit()
print(poly_model.summary())
The \(R^2\) values of both the models are pretty much the same. Additionally the p-value of the quadratic term is not zero. The quadratic term does not improve the model fit.
The new data is as follows. The spread of the noise is now 0.1 instead of 0.25.
eps = rng.normal(loc=0, scale=0.1, size=100)
y = -1 + 0.5 * x + eps
df2 = pd.DataFrame({"x" : x, "y" : y})
plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df2) # jointplot also gives the distributions of x and y in addition to the scatter plot
sns.despine()
sp.savefig("img/3.13.h_scatter.png", dpi=90)
Now the least squares fit to this data.
less_noisy_model = smf.ols("y ~ x", data=df2).fit()
print(less_noisy_model.summary())
The \(R^2\) value for this data set is much larger than the original data set. The model is able to explain 93% of the less noisy data, whereas it could only explain around 70% of the original data set.
ypred = less_noisy_model.predict(df2["x"])
plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.h_ols.png", dpi=90)
The new data is as follows. The spread of the noise is now 0.5 instead of 0.25.
eps = rng.normal(loc=0, scale=0.5, size=100)
y = -1 + 0.5 * x + eps
df3 = pd.DataFrame({"x" : x, "y" : y})
plt.close("all")
sp = sns.jointplot(x="x", y="y", data=df3)
sns.despine()
sp.savefig("img/3.13.i_scatter.png", dpi=90)
Now the least squares fit to this data.
more_noisy_model = smf.ols("y ~ x", data=df3).fit()
print(more_noisy_model.summary())
The \(R^2\) value for this data set is much smaller than it was for the original data set. The model is able to explain only 43% of the more noisy data, whereas it could explain around 70% of the original data.
ypred = more_noisy_model.predict(df3["x"])
plt.close("all")
fig, ax = plt.subplots()
ax.plot(x, y, 'o', label="Data")
ax.plot(x, ypred, 'r', label="Least Squares Regression")
ax.legend(loc="best")
sns.despine()
fig.savefig("img/3.13.i_ols.png", dpi=90)
From the graph it appears that there are possible outliers, which is not surprising given the spread of the error.
print("Confidence interval based on original data set:\n")
print(f"{model.conf_int()}\n")
print("Confidence interval based on less noisy data set:\n")
print(f"{less_noisy_model.conf_int()}\n")
print("Confidence interval based on more noisy data set:\n")
print(f"{more_noisy_model.conf_int()}\n")
The confidence intervals for the less noisy data set are the tightest and the confidence intervals for the more noisy data set are the loosest.
from numpy.random import MT19937
rng = np.random.default_rng(MT19937(seed=5))
x1 = rng.uniform(size=100)
x2 = 0.5 * x1 + rng.normal(size=100) / 10
y = 2 + 2 * x1 + 0.3 * x2 + rng.normal(size=100)
df_coll = pd.DataFrame({"x1" : x1, "x2" : x2, "y" : y})
The form of the linear model is \begin{align} Y = 2 + 2 X_1 + 0.3 X_2 + ϵ. \end{align}
corr = df_coll.corr()
print(corr)
The correlation between x1
and x2
is 0.768.
plt.close("all")
fig, ax = plt.subplots()
sp = sns.scatterplot(x="x1", y="x2", data=df_coll, ax=ax)
sns.despine()
fig.savefig("img/3.14.a_scatter.png", dpi=90)
coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit()
print(coll_model1.summary())
The estimated values for the coefficients are 1.869, 2.175, and 0.445 which are close to the true values, albeit with large standard errors, particularly for \(\hat{β}_2\). Based on the p-values we can reject the null hypothesis for \(β_1\), but we cannot reject the null-hypothesis for \(β_2\).
coll_model2 = smf.ols("y ~ x1", data=df_coll).fit()
print(coll_model2.summary())
The coefficient value has increased, and the \(R^2\) value has decreased marginally. We can still reject the null hypothesis based on the p-value.
coll_model3 = smf.ols("y ~ x2", data=df_coll).fit()
print(coll_model3.summary())
The coefficient value is much larger now, but the \(R^2\) value has decreased. We can now reject the null hypothesis based on the p-value.
The three models do not contradict each other. Due to the high correlation
between x1
and x2
, we can predict x2
from x1
. Thus in the original model
x2
has very little explanatory power and so we cannot reject the null
hypothesis for \(β_2\).
For the second and third model the explanation for the
increase in the coefficients is as follows. In the second model we are
expressing x2
in terms of x1
, and so the coefficient of x1
in the
expression for y
increases to 2.3. In the third model we are expressing x1
in terms of x2
, and so the coefficient of x2
in the expression for y
increases to 4.3. The 95% confidence interval of the second model includes the
new true value of the coefficient. Even though the 95% confidence interval of
the third model does not include the new true value of the coefficient it comes
close. This is probably due to the difference between the random number
generators used by R
and numpy
.
df_coll = df_coll.append({"x1" : 0.1, "x2" : 0.8, "y" : 6}, ignore_index=True)
coll_model1 = smf.ols("y ~ x1 + x2", data=df_coll).fit()
coll_model2 = smf.ols("y ~ x1", data=df_coll).fit()
coll_model3 = smf.ols("y ~ x2", data=df_coll).fit()
print(coll_model1.summary())
print(coll_model2.summary())
print(coll_model3.summary())
The \(R^2\) values for models 1 and 2 decreased. This observation decreased the predictive power of the models. The average leverage for the data set is
p = len(df_coll.columns)
n = len(df_coll)
lev = (p + 1) / n
print(f"{lev:.3f}")
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model1, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll1_res_vs_lev.png", dpi=90)
For the first model it is both an outlier and a high leverage point.
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model2, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll2_res_vs_lev.png", dpi=90)
For the second model it is just an outlier.
plt.close("all")
fig, ax = plt.subplots()
rlplot = sm.graphics.influence_plot(coll_model3, criterion="Cooks", ax=ax)
sns.despine()
ax.set_xlabel("Leverage")
ax.set_ylabel("Standardized Residuals")
ax.set_title(" ")
fig.savefig("img/3.14.g_coll3_res_vs_lev.png", dpi=90)
For the third model it is just an high leverage point.
We load the Boston
from statsmodels
.
import statsmodels.api as sm
boston = sm.datasets.get_rdataset("Boston", "MASS")
print(boston.__doc__)
Earlier we are fitted medv
to the other predictors, now we will be fitting
crim
to the other predictors.
import statsmodels.formula.api as smf
df = boston.data
predictors = [c for c in df.columns if c != "crim"]
simple_models = {p : smf.ols(formula=f"crim ~ {p}", data=df).fit() for p in predictors}
print(f"predictor coefficient p-value")
for p, model in simple_models.items():
print(f"{p:^9} {model.params[p]:>9,.4f} {model.pvalues[p]:>9,.4f}")
Except for chas
everything else appears to be statistically significant.
import matplotlib.pyplot as plt
import seaborn as sns
from math import ceil
sns.set_style("ticks")
ncols = 4
nrows = ceil(len(predictors) / ncols)
plt.close("all")
fig, axs = plt.subplots(nrows=nrows, ncols=ncols, constrained_layout=True, figsize=(12, 10))
for i in range(nrows):
for j in range(ncols):
if i * ncols + j < len(predictors):
sns.regplot(x=df[predictors[i * ncols + j]], y=df["crim"], ax=axs[i, j], line_kws={"color" : "r"})
sns.despine()
fig.savefig("img/3.15.a_reg_mat.png", dpi=120)
Y = df['crim']
X = df[predictors]
X = sm.add_constant(X)
ml_model = sm.OLS(Y, X).fit()
print(ml_model.summary())
Based on the p-values we can reject the null hypothesis for dis
, rad
, and
medv
. If we are willing to be less accurate we can also reject the null
hypothesis for zn
, nox
, black
, and lstat
.
import pandas as pd
ml_coefs = ml_model.params
sl_coefs = pd.Series({p : simple_models[p].params.loc[p] for p in predictors})
coef_df = pd.concat([sl_coefs, ml_coefs], axis=1)
coef_df.reset_index(inplace=True)
coef_df.columns = ["Predictors", "Simple OLS Coefficient", "Multiple OLS Coefficient"]
coef_df.dropna(inplace=True)
print(coef_df)
plt.close("all")
fig, ax = plt.subplots()
sns.scatterplot(x="Simple OLS Coefficient", y="Multiple OLS Coefficient", data=coef_df, ax=ax)
sns.despine()
fig.savefig("img/3.15.c_comp_plot.png")
The coefficients for nox
are very different in the two models.
We will use scikit-learn
to generate the non-linear features.
from sklearn.preprocessing import PolynomialFeatures
pd.options.display.float_format = "{:,.3f}".format
Y = df['crim']
poly_features = PolynomialFeatures(degree=3)
poly_predictors = {p : poly_features.fit_transform(df[p][:, None]) for p in predictors}
poly_models = {p : sm.OLS(Y, poly_predictors[p]).fit() for p in predictors}
for p in predictors:
print(f"p-values for {p}:")
print(f"{poly_models[p].pvalues}\n")
From the p-values we see that there is evidence of polynomial association
between the response and the predictors indus
, nox
, age
, dis
, ptratio
,
and medv
.