-
Notifications
You must be signed in to change notification settings - Fork 0
/
Parametric-Nonparametric-Nolinear-Regression_Agyapong-Project05_DS5494.Rmd
499 lines (349 loc) · 19 KB
/
Parametric-Nonparametric-Nolinear-Regression_Agyapong-Project05_DS5494.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
title: ' Project `r params$proj_number`: `r params$proj_title`'
subtitle: 'DS 5494 - Statistical Data Mining II'
author:
- Willliam Ofosu Agyapong^[[email protected]]
- University of Texas at El Paso (UTEP)
date: "`r format(Sys.Date(), '%B %d, %Y')`"
output:
bookdown::pdf_document2:
fig_caption: true
latex_engine: xelatex
number_sections: true
toc: true
toc_depth: 4
header-includes:
- \usepackage{amsmath}
- \usepackage{amssymb}
- \usepackage{amsfonts}
- \usepackage{amsthm}
- \usepackage{fancyhdr}
- \pagestyle{fancy}
- \fancyhf{}
- \rhead{William O. Agyapong}
- \lhead{Project `r params$proj_number` -- `r params$proj_title`}
- \cfoot{\thepage}
- \usepackage{algorithm}
- \usepackage[noend]{algpseudocode}
geometry: margin = 0.8in
fontsize: 10pt
params:
proj_number: V
proj_title: Parametric/Nonparametric Nonlinear Regression
---
```{r setup, include=FALSE}
# Set global options for output rendering
knitr::opts_chunk$set(eval = T, echo = T, warning = F, message = F, fig.pos = "H", out.extra = "", fig.align = "center")
#----------------- Load required packages
library(dplyr)
library(ggplot2)
library(knitr)
library(patchwork) # interface for laying out plots from ggplot2
options(kableExtra.auto_format = T) # disable kableExtra automatic global options
library(kableExtra)
#----------------- set the current working directory to this path
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
#----------------- Set default rounding to 4 decimal places
options(digits = 4)
#----------------- Set default ggplot theme
theme_set(theme_bw())
```
<!-- QUESTION ONE: WHAT -->
<!-- \noindent\rule{17.5cm}{0.8pt} -->
\newpage
# Introduction
We consider a data set jaws.txt, which is concerned about the association between jaw bone
length (y = bone) and age in deer (x = age). We are going try out several parametric/nonparametric
nonlinear regression models in this low-dimensional (p = 1) setting.
<!-- Question 2 -->
## Bringing in the data
```{r}
jaws <- read.table("jaws.txt", header = T)
# head(jaws)
# tail(jaws)
# dim(jaws)
jaws %>%
slice_sample(n=6) %>%
kable(booktabs=T, linesep="",
caption = "10 random samples from the jaws data set") %>%
kable_styling(latex_options = c("HOLD_position")) %>%
kable_classic()
```
## Relationship between **Bone** and **Age**
```{r bone-vs-age, fig.cap="Relationship between jaw bone length and age in deer"}
ggplot(jaws, aes(age, bone)) +
geom_point() +
geom_smooth(method = "lm", se=F, size=0.7, aes(color="Linear")) + # linear fit
geom_smooth(method = "loess", se=F, size=0.7, aes(color="Loess")) + # nonlinear loess fit
scale_color_manual(name="Fit", values = c("blue", "red")) +
labs(x="Age", y="Jaw bone length") +
theme_bw()
```
The association between Bone and Age look non-linear. Therefore, a nonlinear fit may be more appropriate to model the relationship between the two variables.
<!-- Question 2 -->
# Data Partitioning
## Part (a): Randomly splitting data into training and test sets
First, we randomly partition the data D into the training set D1 and the test set D2 with a ratio of approximately 2:1 on the sample size. The output below shows that 36 observations and 18 observations were allocated to resulting training and test sets, respectively.
```{r}
set.seed(9940)
ratio <- 2/3
train_ind <- sample(1:NROW(jaws), size = NROW(jaws)*ratio)
train_set <- jaws[train_ind, ]
test_set <- jaws[-train_ind,]
dim(train_set); dim(test_set)
```
## Part (b): Preventing extrapolation when it comes to prediction
Next, we make sure that the observations with the minimum and maximum ages end up in the training set.
```{r}
min_max_age1 <- jaws %>% summarise(min_age=min(age), max_age=max(age))
min_max_age2 <- train_set %>% summarise(min_age=min(age), max_age=max(age))
min_max_age3 <- test_set %>% summarise(min_age=min(age), max_age=max(age))
df <- rbind(min_max_age1, min_max_age2, min_max_age3)
rownames(df) <- c("Original data", "Train set", "Test set")
kable(df, booktabs=T, col.names = c("Minimum age", "Maximum age")) %>%
kable_styling(latex_options = c("HOLD_position")) %>%
kable_classic()
```
Luckily for us, with our choice of seed for reproducible results, it turns out that the minimum and maximum ages in the original `jaws` dataset ended up in our training set, so no further action is needed to avoid extrapolation when making predictions on the test data.
<!-- Question 3 -->
# Parametric nonlinear models
## Part (a): An Asymptotic exponential model
We fit an asymptotic exponential model of the form
$$ y = \beta_1 - \beta_2e^{-\beta_3x} + \epsilon$$
We use the `nls()` with the Gauss-Newton algorithm and starting values $\beta_1=150$, $\beta_2=100$, and $\beta_3=0.6$.
```{r }
full_expo_fit <- nls(bone ~ beta1 - beta2*exp(-beta3*age), trace = F, start =
list(beta1=150, beta2=100, beta3=.6), data = train_set)
```
```{r}
summary(full_expo_fit)
```
From the output, the model coefficient estimates are $\hat{\beta_1}= `r coef(full_expo_fit)[1]`$, $\hat{\beta_2} = `r coef(full_expo_fit)[2]`$, and $\hat{\beta_3} = `r coef(full_expo_fit)[1]`$. The mean square error (MSE) is $13.06^2 = 170.5636$ on 33 degrees of freedom.
Looking at the p-values, we can see that, at a reasonable significance level say 5%, all the coefficients are statistically significant.
## Part (b): Hypothesis testing
We are interested in testing the hypothesis: $H_0:\beta_1 = \beta_2$. Let $\beta_1 = \beta_2 = \beta$. Then the reduced model under $H_0$ is of the form
$$ y = \beta\left(1 - e^{-\beta_3x}\right) + \epsilon$$
```{r}
reduced_mod <- nls(bone ~ beta*(1-exp(-beta3*age)), trace = F, start =
list(beta=100, beta3=.6), data = train_set)
summary(reduced_mod)
```
From the above results, we observe that the estimate for the parameters are $\hat{\beta} = `r coef(reduced_mod)[1]`$, and $\hat{\beta_3} = `r coef(reduced_mod)[2]`$. And once again, all the coefficients are statistically significant at the level of $\alpha = 0.05$ compared to
the p−values.
## Comparing the two models
```{r}
anova(full_expo_fit, reduced_mod)
```
At $\alpha = 0.05$ siginificance level compared to the p-value of 0.69, we fail to reject $H_0$, and conclude that there is no statistically significant difference between $\beta_1$ and $\beta_2$, which suggests that the reduced model is not statistically significantly different from the original model. We therefore conclude that the reduced model is better than the original model due to its simple form involving fewer model parameters.
For another level of model comparison, we use `AIC()` and `BIC()` functions in R with their default settings, and Table \@ref(tab:aic-bic) shows the results for each model being considered. Both AIC and BIC values suggest that the reduced model is preferable to the full model. This is consistent with our previous result involving the test of hypothesis, therefore, we conclude that the reduced model is indeed better.
```{r aic-bic}
result_df <- data.frame(AIC=c(AIC(full_expo_fit), AIC(reduced_mod)),
BIC = c(BIC(full_expo_fit), BIC(reduced_mod)))
rownames(result_df) <- c("Full exponential model", "Reduced exponential model")
kable(result_df, booktabs=T,linesep="",
caption ="Model comparison based on AIC and BIC")%>%
kable_styling(latex_options =c("HOLD_position"))%>%
kable_classic()
```
## Part (c): Scatter plot with fitted curve from best nls model
### Plotting functions
Two functions, `myscatterplot()` and `obs_pred_plot()`, were created to help with the generation of scatter plots with fitted curves from individual models and plots of the observed and predicted response values with reference line $y=x$, respectively. This was done to avoid unnecessary repetition of codes for pretty much the same routines.
```{r}
myscatterplot <- function(model=NULL, fitted=NULL, df=train_set, fitted_col='blue',
title = '') {
if(is.null(fitted)) fitted <- predict(model, newdata = df)
if(title == '') title <- "Fitted curve together with the scatterplot of the data"
# augment the data set with the fitted values from our selected model
df$fitted_values <- fitted
# recreate scatter plot with the fitted values
ggplot(df, aes(age, bone)) +
geom_point() +
geom_line(aes(age, fitted_values), color=fitted_col) +
labs(x="Age", y="Jaw bone length", title = title) +
theme_bw()
}
obs_pred_plot <- function (pred, obs=NULL, col = "red", title='') {
# obs: observed values
# pred: predicted values
if(is.null(obs)) obs <- test_set$bone
if(title == '') title <- "Observed versus predicted response values with reference line y=x"
# PREDICTED VS. OBSERVED
ggplot(mapping= aes(obs, pred)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color=col) +
labs(y="Predicted jaw bone length", x="Observed jaw bone length",
title = title)
}
```
Next, we create the required scatter plot with our custom function.
This can be achieved by either specifying the trained model object or directly supplying the fitted values from the trained model, in which case we chose the former.
```{r nlsfit, fig.cap="Relationship between jaw bone length and age in deer with the reduced nls fitted curve"}
# create required scatter plot with our custom functions
myscatterplot(reduced_mod)
```
Figure \@ref(fig:nlsfit) shows that the reduced nonlinear spline model appear to fit the data well as it is able to capture the non-linear relationship we observed earlier.
## Part (d): Apply selected model to the test data
Similarly, we first apply the selected model in part (b) to the test data and use our custom `obs_pred_plot()` function to generate a plot of the predicted response values against the observed response values as shown in Figure \@ref(fig:nls-obs-pred).
```{r nls-obs-pred, fig.width=5, fig.height=3, fig.cap="Observed versus predicted response"}
# observed vs. predicted plot
nls_predicted <- predict(reduced_mod, test_set)
obs_pred_plot(pred = nls_predicted)
```
The red diagonal line represents the reference line $y = x$. Most of the data points deviate from this reference line, indicating that the predictions are not all that good. However, this seems a bit reasonable.
```{r}
# prediction mean square error
(reduced_nls_mse <- mean((test_set$bone - nls_predicted)^2))
```
From the above output, the prediction mean square error (MSE) is **`r reduced_nls_mse`**.
<!-- Question 4 -->
# Local regression methods
## Part (a): KNN Regression
Here, we implement a KNN regression model using the `knn.reg()` function in the `FNN` package. A 4-fold cross-validation was employed to obtain the optimal number of nearest neighbors, $K$ from a set of integer values ranging from 2 to 15. The value corresponding to the minimum cross-validated mean square errors (MSE) was considered optimal.
```{r}
# V-FOLD CV FOR SELECTING K
library(FNN)
set.seed(126)
V <- 5; n <- NROW(train_set)
id.fold <- sample(rep(1:V, each=(n%/%V)))
K <- 2:15; SSE <- rep(0, length(K))
for (k in seq_along(K)){
id.fold <- sample(rep(1:V, each=(n%/%V)))
for (v in 1:V){
cv_train <- train_set[id.fold!=v,]
cv_test <- train_set[id.fold==v,]
yhat <- knn.reg(train=cv_train, test=cv_test, y=cv_train$bone,
k=k)$pred
SSE[k] <- SSE[k] + sum((cv_test$bone-yhat)^2)
}
}
cv_mse <- data.frame(K, MSE=SSE/n)
# Get the optimal k
opt_k <- K[which.min(cv_mse$MSE)]
# Plot MSE versus k
ggplot(cv_mse) +
geom_line(aes(x=K, y=MSE)) +
geom_vline(xintercept=opt_k, color="red", linetype="dashed") +
scale_x_continuous(breaks = K) +
labs(x= "K (number of nearest neighbors)", y="MSE via 5-fold cv",
title = "Cross-validated MSE vsersus number of k-nearest neighbors") +
theme_classic() +
theme(plot.title = element_text( face = "bold"))
```
The red dashed line indicates that, with a 5-fold cross-validation, the optimal number of nearest neighbors occurs at $K=`r opt_k`$, and this will be our choice of $K$ for training a KNN regression model. 3 is quite low so this will results in a somewhat complex model.
```{r}
# fit knn with the optimal k
knn_fitted <- knn.reg(train_set,y=train_set$bone, test = train_set, k=opt_k,
algorithm = "kd_tree")$pred
# make predictions on test data
knn_predicted <- knn.reg(train_set,y=train_set$bone, test = test_set, k=opt_k,
algorithm = "kd_tree")$pred
# scatter plot of data with fitted
myscatterplot(fitted = knn_fitted)
```
The above figure shows a wiggly pattern indicating a potential overfitting problem by the KNN model with $K=3$.
```{r}
# plot of observed vs. predicted
obs_pred_plot(pred = knn_predicted)
```
The figure shows that the KNN regression model made reasonable predictions on the test data.
```{r}
# prediction mean square error
(knn_mse <- mean((test_set$bone - knn_predicted)^2))
```
<!-- > Choice of kernel function: The optimal choice, under some standard assumptions, is the Epanechnikov kernel. This kernel has the advantages of some smoothness, compactness, and rapid computation. For this reason the **Epanechnikov kernel** would be use throughout where a kernel function is required by our choice of implementation procedures. It is a known fact that in kernel estimation, the choice of bandwidth is very critical to the performance of the estimator and therefore extremely important than the choice of kernel. -->
## Part (b): Kernel regression
We apply kernel regression to obtain a nonlinear fit via kernel regression smoothing with **local plug-in bandwidth** through the help of the `lokern` R package, where a local plugin bandwidth function is used to automatically obtain the bandwidths. The plugin method try to estimate the optimal bandwidths by estimating the asymptotically optimal mean squared error optimal bandwidths. Thus, using the default settings in `lokerns()`, a kernel regression was fitted to the training data.
<!-- *The type of kernel used in this function for the kernel regression estimation is not obvious.* -->
```{r}
# Kernel regression smoothing with adaptive local plug-in bandwidth selection.
library(lokern)
kern_reg <- lokerns(train_set$age, train_set$bone)
# distribution of bandwidths
summary(kern_reg$bandwidth)
# boxplot(kern_reg$bandwidth)
```
The above result shows the summary of the adaptive local plug-in bandwidth values used for the kernel regression estimation, from which we see that the optimal bandwidth for this problem ranges between 3.45 and 8.20 with a median value of 5.47.
```{r}
# scatter plot with the fitted y values
myscatterplot(fitted = predict(kern_reg, newdata= train_set)$y)
```
By the above figure, we can see that the kernel regression model was able to capture the nonlinear relationship between jaw bone length and age in deer. However, it appear to have learned too much from the training data to be able to make good predictions on an unseen data.
```{r}
kern_reg_predicted <- predict(kern_reg, newdata=test_set)$y
# plot of observed vs. predicted
obs_pred_plot(kern_reg_predicted)
```
The points are randomly scattered about, with most deviating from the reference line. This shows how unreasonable the predictions made by the the kernel regression model are.
```{r}
(kern_reg_mse <- mean((test_set$bone - kern_reg_predicted)^2))
```
We have here a very large prediction MSE of `r kern_reg_mse`, confirming our previous observation that the predictions do not seem reasonable.
## Part (c): Local (cubic) polynomial regression
Using the `locpol()` function with a **gaussian** kernel, we applied a local (cubic) polynomial regression to the training data. For the choice of bandwidth the `bw.nrd()` function available in the `stats` package was applied on the age predictor, yielding a value of **7.429**.
```{r}
library(locpol)
# fit a local cubic polynomial:
h <- bw.nrd(train_set$age) # get best bandwidth
local_poly_fit <- locpol(bone ~ age, data = train_set, deg = 3, bw=h, kernel = gaussK)
```
```{r}
# observed vs. predicted on test set
lopredicted <- locpol(bone ~ age, data = train_set, xeval = test_set$age,
deg = 3, bw=h, kernel = gaussK)
obs_pred_plot(lopredicted$lpFit$bone)
```
The above figure shows that the local polynomial regression model could not predict most of the observations in the test set correctly, but the predictions do not appear too unreasonable.
```{r}
(lomse <- mean((test_set$bone - lopredicted$lpFit$bone)^2))
```
The prediction MSE is obtained as `rlomse`.
# Regression/smoothing splines
## Part (a): Regression spline (natural cubic spline)
We used `ns()` function to generate the B-spline basis matrix for a natural cubic spline with 3 degrees of freedom resulting in a suitably chosen quantiles of the age variable as knots.
No intercept was included in the basis functions.
```{r}
library(splines)
cubic_spline_fit <- lm(bone ~ ns(age, df=3), data = train_set)
myscatterplot(cubic_spline_fit)
```
The above figure suggests a good fit as the fitted curve seem to capture the nonlinear relationship reasonably well.
```{r}
c_spline_predicted <- predict(cubic_spline_fit, newdata = test_set)
obs_pred_plot(c_spline_predicted)
```
Given the above plot, it can be said that the regression spline model was not able to most of the observations in the test set correctly since a good number of the data points are far from the reference line. However, the predictions do not appear extremely unreasonable.
```{r}
(c_spline_mse <- mean((test_set$bone - c_spline_predicted)^2))
```
For this model, the prediction MSE is `r c_spline_mse`.
## Part (b): Smoothing Splines
By setting the `cv` argument of the `smooth.spline()` function to **false**, the generalized cross-validation (GCV) procedure was used to determine the optimal *smoothing parameter* at $ spar =0.7175$ and a tuning parameter $df = 5.865$.
```{r}
ss_fit <- smooth.spline(train_set$age, train_set$bone, cv=F)
# checking the optimal tuning parameter
ss_fit
# scatter plot of data with fitted curve
myscatterplot(fitted = predict(ss_fit, train_set$age)$y)
```
It is clear from the above figure that the smoothing splines appear to fit the data well.
```{r}
ss_predicted <- predict(ss_fit, test_set$age)
obs_pred_plot(ss_predicted$y)
```
Although a good number of the points deviate from the reference line, they are not extremely far away to suggest unreasonable predictions. This result is comparable to the one in part (a).
```{r}
(ss_mse <- mean((test_set$bone - ss_predicted$y)^2))
```
The prediction MSE is `rss_mse`.
# Prediction MSE measures
```{r}
data.frame(Model = c("Exponential", "KNN regressiion", "Kernel regression",
"Local cubic polynomial regression", "Regression splines",
"Smoothing splines"),
`Prediction MSE` = c(reduced_nls_mse,knn_mse, kern_reg_mse, lomse,
c_spline_mse, ss_mse)) %>%
kable(booktabs=T, linesep="") %>%
kable_styling(latex_options = c("HOLD_position")) %>%
kable_classic()
```
From the above table of results, it is clear that KNN gives the best result for the given data since its prediction MSE is the lowest.