-
Notifications
You must be signed in to change notification settings - Fork 1
/
04-Spatial.Rmd
699 lines (465 loc) · 31.2 KB
/
04-Spatial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
# Machine learning {#four}
This week you will learn about machine learning for classification using _R_. Objectives are:
- To provide an overview of the underlying concepts of machine learning for classification.
- To provide an introduction to some popular classification algorithms.
- To explore these classification algorithms using various packages in _R_.
- To apply various classification algorithms to the statewide COVID-19 dataset.
## Day 22 (Monday) Zoom check-in
Here is an overview of what we'll cover in today's Zoom session:
- Overview of the COVID-19 Machine Learning Dataset (10 minutes)
- Overview of Machine Learning for Classification (20 minutes)
- Introduction to the Random Forest algorithm (5 minutes)
- A Random Forest Example in _R_ using COVID-19 Data (25 minutes)
### A COVID-19 Dataset for Machine Learning
We'd like to predict the severity of COVID-19 in a given state using statewide _feature_ data like population, urban density, number of hospital beds, date of stay at home order, etc. We've already seen that we can get the information about cases and deaths from the New York Times github page. However, gathering corresponding statewide feature data requires quite a bit of hunting through various public websites. Consequently, we're going to skip over the painstaking process of marshalling the feature data and just provide you with a dataset that is already nice and prepped for machine learning.
You'll work with two `.csv` files - a *data* file that contains a veriety of statewide data, and a *metadata* file that describes the various columns of the data file. This combination of data and metadata files is a common way of sharing datasets.
To give you an idea of what was involved in assembling the data and metadata file, a summary of the data collection and processing steps is given below:
- First, a snapshot of the New York Times (NYT) COVID-19 data from April 27th was downloaded from the [nytimes github repo][] and stored on local disk.
```{r}
url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
file <- "us-counties_04_27_2020.csv"
destination <- file.path("workdir", file)
download.file(url, destination)
```
- The NYT COVID-19 data was processed using _R_ :
- The cases and deaths in `us-counties_04_27_2020.csv` were aggregated into statewide values.
- The death rate was calculated and categorized according to an 8-point severity index.
- Finally, the statewide data (augmented with death rate and severity index) was exported as a `.csv` file.
```{r}
library(readr)
library(dplyr)
## get data from file
covid_data_file <- file.path("./workdir", "us-counties_04_27_2020.csv")
## read the data as a tibble
us_data <- read_csv(covid_data_file)
## aggregate by county and state
county_state <-
us_data %>%
group_by(county, state) %>%
summarize(cases = max(cases), deaths = max(deaths))
## aggregate by state
state <-
county_state %>%
group_by(state) %>%
summarize(cases = sum(cases), deaths = sum(deaths))
## calculate death rate
state <-
state %>%
mutate(death_rate = 100.00 * deaths / cases)
# Assign the following severity index using cut:
## PSI Death.Rate
## 1 < 0.1%
## 2 0.1% - 0.5%
## 3 0.5% - 1.0%
## 4 1.0% - 2.0%
## 5 2.0% - 4.0%
## 6 4.0% - 6.0%
## 7 6.0% - 8.0%
## 8 >8.0%
state <-
state %>%
mutate(
severity_index =cut(
death_rate,
breaks = c(0.0, 0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 8.0, 100.0),
labels = 1:8
)
)
## write out as csv
my_out_file <- file.path("./workdir", "covid_data.csv")
write_csv(state, my_out_file)
```
- The resulting statewide COVID-19 *label* data (i.e. what we would like to predict) was augmented with 32 statewide *features*, including population, percent urban, number of hospital beds, etc. Feature data was collected from a veriety of sources, including the Center for Disease Control, the American Heart Association, the U.S. Census Bureau, etc. In some cases the feature data was available for direct download (e.g. as a `.csv` file) and in other cases the feature data was manually harvested (e.g. cut-and-paste from websites).
- The augmented (i.e. features + labels) `.csv` file was split into two `.csv` files that you will need to download:
- [statewide_covid_19_data_04_27_2020.csv][]: This file contains the final COVID-19 machine learning data set, but features and labels are coded so that feature columns are named `X01`, `X02`, `X03`, etc. and label colunms are named `Y01`, `Y02`, `Y03`, etc.
- [statewide_covid_19_metadata_04_27_2020.csv][]: This file maps the column names in the data file to more meaningful names and desciptions (including units) of the associated variables. For example `X01` is `Pct_Sun` and has a description of `Percent sunny days`. This is known as *metadata* - data that describes other data.
Click on the links above to download the data and metadata files that you'll need for the machine learning examples presented throughout the week.
[nytimes github repo]: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
[statewide_covid_19_data_04_27_2020.csv]: assets/statewide_covid_19_data_04_27_2020.csv
[statewide_covid_19_metadata_04_27_2020.csv]: assets/statewide_covid_19_metadata_04_27_2020.csv
### A Machine Learning Primer
Machine Learning (ML) may be defined as using computers to make inferences about data.
- A mapping of inputs to outputs, Y = f(*X* , *$\beta$*)
_ML for Classification_ refers to algorithms that map inputs to a discrete set of outputs (i.e. classes or categories)
- For example, predicting health risk (mild, moderate, severe) based on patient data (height [m], weight [kg], age [years], smoker [yes/no], etc.)
- Or predicting [pandemic severity index][] (PSI) of COVID-19 in a state based on statewide population data.
```{r echo=FALSE}
cats=c(1:5)
frate=c("< 0.1% ", "0.1% - 0.5% ", "0.5% - 1.0% ", "1.0% - 2.0% ", "> 2.0% ")
example=c("seasonal flu", "Asian flu", "n/a", "n/a", "Spanish flu")
df = data.frame(PSI=cats, Death.Rate=frate, Example=example)
print(df, row.names = FALSE, right = FALSE)
```
- Predictions are typically expressed as a vector of probabilities.
- e.g. Pr(cancerous) vs. Pr(benign)
- e.g. Pr(PSI=1), Pr(PSI=2), ..., Pr(PSI=5)
#### A "Black Box" View of Machine Learning {-}
The diagram below illustrates the machine learning concept in terms of a "black box" model that converts inputs into predictions.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_black_box_generic.png')
```
#### A "Black Box" Example {-}
The diagram below illustrates a more concrete machine learning model that converts height, weight, and age into a prediction of whether or not the individual is male or female.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_black_box_example.png')
```
#### Some Important Machine Learning Terms {-}
The above example highlights some important machine learning terminology:
- *Features* (*X*): Features are the inputs to the model. They are also known as descriptive attributes or explanatory variables. In terms of _R_, a single set of features corresponds to a tuple or row of data and a complete set of feature data maps nicely to a data frame or tibble.
- *Parameters* (*$\beta$*): Parameters are internal variables of the selected machine learning algorithm. They are also known as coefficients or training weights. These internal parameters need to be adjusted to minimize the deviation between predictions and observations. The machine learning algorithms and pacakages that we'll be using in _R_ take care of this minimization process for us.
- *Labels* (*Y*): Labels are the outputs of the algorithm, corresponding to the categories you are attempting to predict.
- *Training Data*: Training data is a data set containing paired observations of inputs (i.e features) and outputs (i.e. labels). Training data is also known as measurement data, observation data or calibration data.
- *Training*: Training is the process of adjusting internal algorithm paramters (*$\beta$*) to obtain the best possible match between training data and corresponding model predictions. Training is also known as model calibration or parameter estimation.
An example set of training data that could be used in the gender prediction example is given below. The frist three columns contain feature data and the last column contains label data. We could use this data to train a model to predict a person's gender based on their height, weight and age.
```{r echo=FALSE}
h=c(1.69,1.74,1.92,1.80,1.59,1.85,1.75,1.96,1.85,1.78,1.74,1.81,1.73)
w=c(62,76,82,100,47,75,63,83,39,58,70,57,78)
a=c(30,27,25,41,24,26,33,33,32,28,30,26,32)
g=c("male","female","male","male","female","male","female","male","male","female","female","female","male")
df = data.frame(height_m=h,weight_kg=w,age_y=a,gender=g)
print(df, row.names = FALSE, right = FALSE)
```
- *Test Data*: Like training data, test data is a data set containing paired observations of inputs (i.e features) and outputs (i.e. labels). However, test data is deliberately _not included_ in the training process. Performance measures computed using test data help to quantify the expected performance of the model if/when it is applied to unlabeled data.
Upon completion of training, it is important to evaluate the quality of the model and its skill or ability at making correct predictions. Some terms related to this evaluation process are defined below:
- *Classification Accuracy*: Classification accuracy is the ratio of correct predictions to the total predictions. Classification accuracy can be computed using the training data set or using the test data set.
- *Confusion Matrix*: The confusion matrix is a more detailed summary (relative to classification accuracy) of the performance of a classification algorithm. The diagonals of the confusion matrix count how often the algorithm yields the correct classification for each class. The off-diagonal entries count how often the algorithm confuses one class with another.
The figure below illustrates the classification accuracy and confusion matrix for an example that attempts to classify images of fruits.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_ca_fruits_example.png')
```
#### The Machine Learning Process {-}
Now that we've defined some of the important machine learning terminology let's take a 30,000 foot view at the overall machine learning process. This is a general description of the process that you can follow each time you build and use a machine learning model. The process is illustrated in the figure below (with credit to [Jinesh Maloo][]):
```{r echo = FALSE, fig.align="center", out.width="75%"}
knitr::include_graphics('images/ml_process_maloo.png')
```
- Step 1: Prepare labeled data for training and validation.
- Step 2: Select a machine learning algorithm (i.e. model).
- Step 3: Train the model.
- Step 4: Evaluate model performance.
- If model is useful:
- Step 5: Apply to unlabled data.
- Else (needs improvement):
- Collect more data (go back to Step 1).
- Revise model (go back to Step 2)
[pandemic severity index]: https://en.wikipedia.org/wiki/Pandemic_severity_index
[Jinesh Maloo]: https://blog.usejournal.com/machine-learning-for-beginners-from-zero-level-8be5b89bf77c
### The Random Forest Algorithm
The random forest algorithm is a popular choice for machine learning.
- Over 20 _R_ packages have an implementation of some form of the algorithm.
- We'll be using the `randomForest` pacakge.
- The algorithm is like `bagging` (boostrap aggregating) regression trees, but the rgression trees are de-correlated.
The figure below illustrates one possible `tree` in a `random forest` for a gender prediction model. In computer science terminology, each split in the figure is a `branch` of a graph `tree`. In simple terms, the split points are randomly generated and the resulting `trees` combine to form a `random forest`.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_rf_tree_gender.png')
```
The figure below illustrates a set of 6 `trees` that make up a `random forest` for predicting housing prices. The image is courtesy [Bradley Boehmke at the University of Cincinnati][]. Examine the figure closely and notice that:
- The split variables can differ across trees (not all variables are included in all trees).
- The split variables can differ within trees (not all paths consider the same set of variables).
- The order of splits can differ.
- The split values can differ.
```{r echo = FALSE, fig.align="center", out.width="100%"}
knitr::include_graphics('images/ml_rf_tree_housing.png')
```
The job of the random forest algorithm is to determine the optimal set of trees for your data set, including the splitting configuration of each tree (i.e. order, values, etc.).
Now you have a basic understanding of machine learning and the random forest algorithm. You're probably excited to get going with applying the algorithm! But first, we need to have a data set to work with. In the next section you'll learn about a data set that can be used for predicting the severity of COVID-19 in a state.
[Bradley Boehmke at the University of Cincinnati]: https://uc-r.github.io/random_forests
### A Random Forest Example Using COVID-19 Data
Let's apply the random forest algorithm to the COVID-19 dataset. We'll build out the required _R_ code in sections. To get started, open a new _R_ script in _RStudio_ and name it `covid_19_rf.R`. Enter the code below, but omit lines that begin with a double-hash (`##`) because these are the expected output:
```{r}
#
# Part 1 - load the data
#
library(randomForest)
library(readr)
library(dplyr)
data_file <- file.path("assets", "statewide_covid_19_data_04_27_2020.csv")
df <- read_csv(data_file)
# coerce severity to a factor (so RF algorithm uses classification)
df <-
df %>%
mutate(Y04 = as.factor(df$Y04))
metadata_file <- file.path("assets", "statewide_covid_19_metadata_04_27_2020.csv")
mdf <- read_csv(metadata_file)
mdf
```
Try to run the code. You may get an error about missing the `randomForest` package. You can install it from the RStudio console (see below) or using the installer in the RStudio `packages` pane.
```{r eval=FALSE}
install.packages("randomForest")
```
Now we've loaded the data and metadata file. Let's pick a subset of 5 of the features and use them to try and predict the pandemic severity index (i.e. `Y04`). Add the `Part 2` code below to your RScript but omit lines that begin with a double-hash (`##`) :
```{r}
#
# Part 2 - select features and label
#
# describe all possible features and labels
print(mdf, n = nrow(mdf))
# select some features and the severity index label
my_x <- c("X01","X10","X12","X13","X23")
my_y <- "Y04"
my_xy <- c(my_x, my_y)
# get descriptions of the selected features and label
mdf %>% filter(Code %in% my_xy)
# subset the dataframe
rf_df <- df %>% select(all_of(my_xy))
```
Now we’ll add code to create and train a basic Random Forest model. Add the `Part 3` code below to your RScript but omit lines that begin with a double-hash (`##`):
```{r}
#
# Part 3 - create and train the model
#
# split into train (75%) and test (25%) datasets
#
# note that `row_nubmer()` generates a vector of row numbers, and
# > c(1, 2, 3, 4, 5, 6) %% 4 is
# [1] 1 2 3 0 1 2
train <- rf_df %>% filter((row_number() %% 4) %in% 1:3)
test <- rf_df %>% filter(row_number() %% 4 == 0)
# create and train the RF model
model <- randomForest(Y04 ~ X01 + X10 + X12 + X13 + X23,
data = train)
# show results, includes confusion matrix for training data
model
# measure of parameter importance
importance(model)
```
At this point the model is trained and the next step is to evaluate its usefulness at making predictions. Let's see how the model does at predicting the labels of the test dataset. Remember that the test data was _not_ used during the training exericse. As such, the confusion matrix and classification accuracy associated with the test dataset provides a useful check of the skill of the model. Add the `Part 4` code below to your RScript but omit lines that begin with a double-hash (`##`) :
```{r}
#
# Part 4 - evaluate the model using test data
#
# extract test predictions
preds <- predict(model, test)
# confusion matrix
actual <- factor(test$Y04)
predicted <- factor(preds)
common_levels <- sort(unique(c(levels(actual), levels(predicted))))
actual <- factor(actual, levels = common_levels)
predicted <- factor(predicted, levels = common_levels)
confusion <- table(actual,predicted)
print(confusion)
#classification accuracy
accuracy <- sum(diag(confusion)/nrow(test))
cat("Classification Accuracy = ", accuracy, "\n")
```
Save your script and run it. Take a look at the results for the test dataset - the classifcation accuracy is well under 50%. Furthermore, there are systematic failures in the confusion matrix. It's not a very good model. The most likely culprit is that the set of features is inadequate for making the desired prediction. We should re-run the model using different or additional features.
## Day 23 - Support Vector Machines
For today's independent work you will learn about the Support Support Vector Machine (SVM) algorithm and apply it to the COVID-19 data that you worked with on Monday.
The SVM algorithm seeks to determine an optimal hyperplane that separates labeled observations.
- Hyperplanes can be linear or non-linear.
- Support vectors are data points lying closest to the optimal hyperplane.
- The `e1071` package in R provides an implementation of SVM.
The figure below illusrates a linear SVM. Line $H_3$ provides the optimal separation between the white and black data points. Points with perpendiculars to $H_3$ are the support vectors for the dataset. The SVM algorithm classifies unlabeled data points by examining their location with respect to the optimal hyperplane. Data points _above_ line $H_3$ would be classified as "black" and datapoints _below_ the line would be calssified as white.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_svm_example.png')
```
### An SVM Example using COVID-19
We've already laid a large part of the groundwork for machine learning with the random forest example. With a few modifications, the `covid_19_rf.R` script can be adapted to use an SVM algorithm instead of Random Forest.
- Open your `covid_19_rf.R` script in _RStudio_
- Click `File --> Save As ...` and name the file `covid_19_svm.R`
- In `Part 1` of `covid_19_svm.R`, replace `library(randomForest)` with `library(e1071)`. This will load the `svm` algorithm instead of the `randomForest` algorithm.
- In `Part 3` of the code, replace `randomForest()` with `svm()` and add the following argument to the `svm()` function: `probability = TRUE`. The `svm()` code should look something like this:
```{r eval=FALSE}
model <- svm(Y04 ~ X01 + X10 + X12 + X13 + X23,
data = train,
probability = TRUE)
```
- The `svm()` package does not provide useful implementations of `print(model)` or `importance(model)`. Comment out, or delete, those lines of `Part 3` and replace with `summary(model)`. When you are finished, the final section of the `Part 3` code should look something like this:
```{r eval = FALSE}
# show results, includes confusion matrix for training data
# print(model)
#
# measure of parameter importance
# importance(model)
#
# summarize the trained SVM
summary(model)
```
That's it! The rest of the code (i.e. `Part 4`) can be re-used. Save your `covid_19_svm.R` script and run it. How does the SVM algorithm perform in comparison with the Random Forest algorithm?
## Day 24 - the $k$-Nearest Neighbors Algorithm
For today's independent work you will learn about the $k$-Nearest neighbors (KNN) algorithm and apply it to the COVID-19 data.
For a given unlabeled data point ($X^*$) the KNN algorithm identifies the nearest $k$ labeled data points.
- Euclidean distance is typical
- _Normalization of data is necessary to prevent biased distances._
- The label of $X^*$ is predicted to be the most frequently occurring label among the $k$ nearest neighbors.
- The `class` package in R provides an implementation of KNN.
The figure below illustrates the KNN approach and is courtesy of [Antti Ajanki][]. For $k=3$, the neighborhood contains 2 triangles and 1 square so we’d predict $X^*$ is a triangle. For $k=5$, the neighborhood contains 3 squares and 2 triangles so we’d predict $X^*$ is a square.
```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_knn_example.png')
```
With a few modifications, the `covid_19_svm.R` script can be adapted to use an KNN algorithm instead of SVM.
- Open your `covid_19_svm.R` script in _RStudio_
- Click `File --> Save As ...` and name the file `covid_19_knn.R`
- In `Part 1` of `covid_19_knn.R`, replace `library(e1071)` with `library(class)`. This will load the `knn` algorithm instead of the `svm` algorithm.
- The KNN algorithm requires the feature data to be normalized. Add the following line in `Part 1` of `covid_19_knn.R`. Add the line after the `df <- df %>% mutate(Y04 <- as.factor(Y04))` line, as shown below. If necessary, use `install.packages('BBmisc') ` to install the `BBmisc` package and its `normalize()` function.
```{r eval=FALSE}
df <- df %>% mutate(Y04 <- as.factor(Y04))
df <- BBmisc::normalize(df, method = "range") # add this line
```
- In `Part 3` of the code, replace `svm()` with `knn()` and adjust the call to `knn()` so that it looks like this:
```{r eval=FALSE}
model <- knn(select(train, all_of(my_x)),
select(test, all_of(my_x)),
train$Y04,
k = 7,
prob = TRUE)
```
- The `knn()` package does not provide useful implementations `print(model)`, `importance(model)`, or `summary(model)`. Comment out, or delete, those lines of `Part 3`. When you are finished, the final section of the `Part 3` code should look something like this:
```{r eval = FALSE}
# show results, includes confusion matrix for training data
# print(model)
#
# measure of parameter importance
# importance(model)
#
# summarize the trained SVM
# summary(model)
```
- In `Part 4` of the script, replace:
```{r eval=FALSE}
preds <- predict(model, test)
```
with:
```{r eval=FALSE}
preds <- as.data.frame(model)[,1]
```
That's it! The rest of the code can be re-used. Save your `covid_19_knn.R` script and run it. How does the KNN algorithm perform in comparison with the Random Forest and SVM algorithms?
[Antti Ajanki]: https://commons.wikimedia.org/wiki/File:KnnClassification.svg
## Day 25 - Exploring the KNN Algorithm
For today's independent work you will learn explore the algorithm settings of the KNN algorithm.
Open your `covid_19_knn.R` script and locate the line that creates the KNN model (`model = knn(...)`). Adjust the number of neighbors so that `k = 5` and re-run the script. Record the classification accuracy. Try again, but use `k = 3`.
Re-run each experiment several times (e.g. collect 10 trials of `k=5` and 10 trials of `k=3`, and so on). For a given value of `k`, do you get the same result each time?
Continue with these numerical experiments until you've filled out the table below:
```{r eval = FALSE}
## k_value average_classification_accuracy
## 1 ???%
## 3 ???%
## 5 ???%
## 7 ???%
## 9 ???%
## 11 ???%
## 13 ???%
## 15 ???%
## 17 ???%
```
What do the results suggest about the most appropriate setting for `k`? Can you think of a better way to perform the numerical experiments and collect the results?
## Day 26 (Friday) Zoom check-in
Today we'll check how you're doing with using machine learning in _R_. Then we'll get you prepared for weekend activities, where you'll continue to explore modeling using the COVID-19 dataset.
### Review and trouble shoot (25 minutes)
- Has everyone had a chance to try out at least one of the machine learning algorithms?
- Has anyone tried additional or alternative combinations of features?
- What is the best classification accuracy that you have been able to obtain?
- What parameters appear to be the most important?
### This weekend (25 minutes)
#### ROC/AUC - Another Measure of Machine Learning Performance {-}
We've already seen how classification accuracy and the confusion matrix give an indication of the performance of a trained machine learning algorithm. It's also good practice to examine the "ROC curve" and related "AUC" metrics.
- *ROC curve*: As illustrated in the figure below, the ROC (Receiver Operating Characteristic) curve plots the true positive (TP) vs. false positive (FP) rate at various probability thresholds. In the figure, the dashed blue line represents a hypothetical ROC curve for some machine learning model and the solid red line is the curve for a "non-informative" model (i.e. a model that makes a uniform random guess). As such, we'd like the blue curve to be as far above the red curve as possible.
```{r echo = FALSE, fig.align="center", , out.width="75%"}
knitr::include_graphics('images/ml_roc_curve_example.png')
```
- *AUC*: AUC stands for "area under curve" and is the area under the ROC curve. In the previous figure, the AUC would be the area under the dashed blue curve. Values of AUC quantify the degree to which an ROC curve lies above (or below) the "non-informative" curve. Some interesting AUC values:
- AUC = 0.0: the model is always wrong (with respect to TP vs. FP)
- AUC = 0.5: the model is no better than guessing (i.e. the model matches the red "non-informative" curve in the figure)
- AUC = 1.0: the model is always right (with respect to TP vs. FP)
For a problem with multiple classes (as opposed to a binary True/False, Male/Female, or Yes/No problem) we can compute the ROC curve curve and AUC measures using a “one vs. all” approach:
- First, extract predicted probabilities from the RF model (the scores).
- Next, extract actual classification for each category.
- Finally, leverage three commands of the `ROCR` module:
- `prediction()`: retrieve scores
- `performance()`: generates TPR, FPR, and AUC measures through two separate calls
- `print()`: generates a TPR vs. FPR plot
The ROC/AUC calculation is fairly involved so we'll create a helper function for it and then incorporate the helper function into our machine learning scripts.
```{r}
#
# Part 5 - ROC/AUC
#
# roc_one_vs_all()
# A helper function to compute one vs. all ROC/AUC for a given level (i)
roc_one_vs_all <- function(i) {
if(is.na(sum(probs))){
return(NA)
}
actual <- as.numeric(y_test[[1]] == i)
score <- probs[,i]
pred <- ROCR::prediction(score, actual)
perf <- ROCR::performance(pred, "tpr", "fpr")
ROCR::plot(perf,
main="ROC Curve",
col=cols[as.numeric(i)],
add = i != lvls[[1]])
# calculate the AUC and print
auc_i <- ROCR::performance(pred, measure="auc")
as.numeric([email protected])
}
# prepare data for computing ROC/AUC
x_test <- select(test, all_of(my_x))
y_test <- select(test, all_of(my_y))
# how we obtain probs depends on the algorithm
if (inherits(model, "randomForest"))
{
probs <- predict(model, x_test, type='prob')
} else if (inherits(model, "svm"))
{
probs <- predict(model, x_test, probability = TRUE)
probs <- attributes(probs)
probs <- as.data.frame(probs$probabilities)
probs <- probs[,order(colnames(probs))]
} else # the KNN alg requires approximation of probabilities
{
# highest predicted probabilities
pnrst <- attributes(model)$prob
# corresponding one-based categories
y_one <- as.numeric(min(levels(df$Y04))) - 1
vpreds <- as.integer(as.character(preds)) - y_one
# map probabilities into a matrix of zeroes
probs <- matrix(0.00,
nrow = length(pnrst),
ncol = length(levels(df$Y04)))
probs[cbind(seq_along(vpreds), vpreds)] <- pnrst
# coerce to data frame
probs <- data.frame(probs)
colnames(probs) <- levels(df$Y04)
}
lvls <- unique(as.character(y_test[[1]]))
cols <- c("red","orange","yellow","green",
"blue","purple","violet","black")
# vapply helper function across levels
auc <- vapply(lvls, roc_one_vs_all, numeric(1))
# tack on legend and non-informative line
legend("bottomright",
legend=paste("PSI =",lvls),
col=cols[as.numeric(lvls)],
lty=1,
cex=1.0)
lines(x=c(0,1),
y=c(0,1),
lty=2)
# display AUC metrics
print(tibble(lvls, auc))
```
Assuming you've been following along with the daily activities, you can add this `Part 5` code to any of your algorithm scripts (e.g. `covid_19_rf.R`, `covid_19_knn.R`, `covid_19_svm.R`, etc.). The code uses the `inherits()` function to adapt the output of each individual algorithm into a form that is suitable for the ROC/AUC calculation.
## Day 27
Today you'll explore different combinations of features in your COVID-19 model.
- Select another 5 features and adjust `Part 2` and `Part 3` of your `covid_19_rf.R` script. In `Part 2` you'll need to adjust the `my_x` variable and in `Part 3` you'll need to adjust the model formula (e.g. `Y04 ~ X01 + ....`). Re-run the script and record the error rate, confusion matrix, and measures of parameter importance.
- Repeat the above step after adding an additional 5 features.
- How does the performance of the model change as more features are included?
- Does any particular parameter stand out in terms of importance?
- What does the most important parameter correspond to in termo of the metadata?
## Day 28
Today you'll explore making some tweaks to the random forest model to see if you can improve its performance on the COVID-19 dataset. The algorithm parameters that you'll be adjusting are described below:
- `ntree` : The number of trees to grow. Default is 500.
- `mtry` : The number of variables randomly sampled as candidates at each split. Default is `sqrt(p)` where `p` is the number of features included in the model.
Let's perform some numerical experiments to explore how these algorithm parameters effect model performance:
- Setup a random forest model that has at least 16 features (see instructions from yesterday's activity).
- In `Part 3` of your `covid_19_rf.R` script, add the following arguments to the `randomForest()` function:
- `ntree = 1000`
- `mtry = 8`
- Re-run the script and record the error rate, confusion matrix, and measures of parameter importance.
- Repeat the above process using:
- `ntree = 2000`
- `mtry = 2`
- Does adjusting the parameters effect the model performance? If so, what observations can you make?
- Can you think of a "better" way to evaluate the influence of algorithm parameters?
Congratulations - you made it through a week of machine learning boot camp! You can download completed scripts (i.e. `Part 1` through `Part 5`) for each algorithm using the links below:
- [Complete Random Forest Example](assets/covid_19_rf.R)
- [Complete Support Vector Machine Example](assets/covid_19_svm.R)
- [Complete $k$-Nearest Neighbors Example](assets/covid_19_knn.R)