04-Spatial.Rmd

# Machine learning {#four}

This week you will learn about machine learning for classification using _R_. Objectives are:

- To provide an overview of the underlying concepts of machine learning for classification.

- To provide an introduction to some popular classification algorithms.

- To explore these classification algorithms using various packages in _R_.

- To apply various classification algorithms to the statewide COVID-19 dataset.

## Day 22 (Monday) Zoom check-in

Here is an overview of what we'll cover in today's Zoom session:

- Overview of the COVID-19 Machine Learning Dataset (10 minutes)

- Overview of Machine Learning for Classification (20 minutes)

- Introduction to the Random Forest algorithm (5 minutes)

- A Random Forest Example in _R_ using COVID-19 Data (25 minutes)

### A COVID-19 Dataset for Machine Learning

We'd like to predict the severity of COVID-19 in a given state using statewide _feature_ data like population, urban density, number of hospital beds, date of stay at home order, etc. We've already seen that we can get the information about cases and deaths from the New York Times github page. However, gathering corresponding statewide feature data requires quite a bit of hunting through various public websites. Consequently, we're going to skip over the painstaking process of marshalling the feature data and just provide you with a dataset that is already nice and prepped for machine learning.  

You'll work with two `.csv` files - a *data* file that contains a veriety of statewide data, and a *metadata* file that describes the various columns of the data file. This combination of data and metadata files is a common way of sharing datasets.

To give you an idea of what was involved in assembling the data and metadata file, a summary of the data collection and processing steps is given below:

- First, a snapshot of the New York Times (NYT) COVID-19 data from April 27th was downloaded from the [nytimes github repo][] and stored on local disk.

    ```{r}
    url <- "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
    file <- "us-counties_04_27_2020.csv"
    destination <- file.path("workdir", file)
    download.file(url, destination)
    ```

- The NYT COVID-19 data was processed using _R_ : 

  - The cases and deaths in `us-counties_04_27_2020.csv` were aggregated into statewide values.
  
  - The death rate was calculated and categorized according to an 8-point severity index.
  
  - Finally, the statewide data (augmented with death rate and severity index) was exported as a `.csv` file.

    ```{r}
    library(readr)
    library(dplyr)
    
    ## get data from file
    covid_data_file <- file.path("./workdir", "us-counties_04_27_2020.csv")
    
    ## read the data as a tibble
    us_data <- read_csv(covid_data_file)
    
    ## aggregate by county and state
    county_state <-
        us_data %>%
        group_by(county, state) %>%
        summarize(cases = max(cases), deaths = max(deaths))
    
    ## aggregate by state
    state <-
        county_state %>%
        group_by(state) %>%
        summarize(cases = sum(cases), deaths = sum(deaths))
    
    ## calculate death rate
    state <-
        state %>%
        mutate(death_rate = 100.00 * deaths / cases)
    
    # Assign the following severity index using cut:
    ##  PSI Death.Rate    
    ##  1   < 0.1%        
    ##  2   0.1% - 0.5%   
    ##  3   0.5% - 1.0%   
    ##  4   1.0% - 2.0%   
    ##  5   2.0% - 4.0%   
    ##  6   4.0% - 6.0%   
    ##  7   6.0% - 8.0%   
    ##  8   >8.0%         
    state <-
        state %>%
        mutate(
            severity_index =cut(
                death_rate, 
                breaks = c(0.0, 0.1, 0.5, 1.0, 2.0, 4.0, 6.0, 8.0, 100.0),
                labels = 1:8
            )
        )
    
    ## write out as csv
    my_out_file <- file.path("./workdir", "covid_data.csv")
    write_csv(state, my_out_file)
    ```

- The resulting statewide COVID-19 *label* data (i.e. what we would like to predict) was augmented with 32 statewide *features*, including population, percent urban, number of hospital beds, etc. Feature data was collected from a veriety of sources, including the Center for Disease Control, the American Heart Association, the U.S. Census Bureau, etc. In some cases the feature data was available for direct download (e.g. as a `.csv` file) and in other cases the feature data was manually harvested (e.g. cut-and-paste from websites).

- The augmented (i.e. features + labels) `.csv` file was split into two `.csv` files that you will need to download:

  - [statewide_covid_19_data_04_27_2020.csv][]: This file contains the final COVID-19 machine learning data set, but features and labels are coded so that feature columns are named `X01`, `X02`, `X03`, etc. and label colunms are named `Y01`, `Y02`, `Y03`, etc. 

  - [statewide_covid_19_metadata_04_27_2020.csv][]: This file maps the column names in the data file to more meaningful names and desciptions (including units) of the associated variables. For example `X01` is `Pct_Sun` and has a description of `Percent sunny days`. This is known as *metadata* - data that describes other data.
  
Click on the links above to download the data and metadata files that you'll need for the machine learning examples presented throughout the week.

[nytimes github repo]: https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv
[statewide_covid_19_data_04_27_2020.csv]: assets/statewide_covid_19_data_04_27_2020.csv
[statewide_covid_19_metadata_04_27_2020.csv]: assets/statewide_covid_19_metadata_04_27_2020.csv

### A Machine Learning Primer

Machine Learning (ML) may be defined as using computers to make inferences about data.

- A mapping of inputs to outputs, Y = f(*X* , *$\beta$*)

_ML for Classification_ refers to algorithms that map inputs to a discrete set of outputs (i.e. classes or categories)

- For example, predicting health risk (mild, moderate, severe) based on patient data (height [m], weight [kg], age [years], smoker [yes/no], etc.)
  
- Or predicting [pandemic severity index][] (PSI) of COVID-19 in a state based on statewide population data.

```{r echo=FALSE}
cats=c(1:5)
frate=c("< 0.1%  ", "0.1% - 0.5%  ", "0.5% - 1.0%  ", "1.0% - 2.0%  ", "> 2.0%  ")
example=c("seasonal flu", "Asian flu", "n/a", "n/a", "Spanish flu")
df = data.frame(PSI=cats, Death.Rate=frate, Example=example)
print(df, row.names = FALSE, right = FALSE)
```

- Predictions are typically expressed as a vector of probabilities.
  
  - e.g. Pr(cancerous) vs. Pr(benign)
    
  - e.g. Pr(PSI=1), Pr(PSI=2), ..., Pr(PSI=5)

#### A "Black Box" View of Machine Learning {-}

The diagram below illustrates the machine learning concept in terms of a "black box" model that converts inputs into predictions.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_black_box_generic.png')
```

#### A "Black Box" Example {-}

The diagram below illustrates a more concrete machine learning model that converts height, weight, and age into a prediction of whether or not the individual is male or female.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_black_box_example.png')
```

#### Some Important Machine Learning Terms {-}

The above example highlights some important machine learning terminology:

- *Features* (*X*): Features are the inputs to the model. They are also known as descriptive attributes or explanatory variables. In terms of _R_, a single set of features corresponds to a tuple or row of data and a complete set of feature data maps nicely to a data frame or tibble.

- *Parameters* (*$\beta$*): Parameters are internal variables of the selected machine learning algorithm. They are also known as coefficients or training weights. These internal parameters need to be adjusted to minimize the deviation between predictions and observations. The machine learning algorithms and pacakages that we'll be using in _R_ take care of this minimization process for us.

- *Labels* (*Y*): Labels are the outputs of the algorithm, corresponding to the categories you are attempting to predict.

- *Training Data*: Training data is a data set containing paired observations of inputs (i.e features) and outputs (i.e. labels). Training data is also known as measurement data, observation data or calibration data.

- *Training*: Training is the process of adjusting internal algorithm paramters (*$\beta$*) to obtain the best possible match between training data and corresponding model predictions. Training is also known as model calibration or parameter estimation.

An example set of training data that could be used in the gender prediction example is given below. The frist three columns contain feature data and the last column contains label data. We could use this data to train a model to predict a person's gender based on their height, weight and age.

```{r echo=FALSE}
h=c(1.69,1.74,1.92,1.80,1.59,1.85,1.75,1.96,1.85,1.78,1.74,1.81,1.73)
w=c(62,76,82,100,47,75,63,83,39,58,70,57,78)
a=c(30,27,25,41,24,26,33,33,32,28,30,26,32)
g=c("male","female","male","male","female","male","female","male","male","female","female","female","male")

df = data.frame(height_m=h,weight_kg=w,age_y=a,gender=g)
print(df, row.names = FALSE, right = FALSE)
```

- *Test Data*: Like training data, test data is a data set containing paired observations of inputs (i.e features) and outputs (i.e. labels). However, test data is deliberately _not included_ in the training process. Performance measures computed using test data help to quantify the expected performance of the model if/when it is applied to unlabeled data.

Upon completion of training, it is important to evaluate the quality of the model and its skill or ability at making correct predictions. Some terms related to this evaluation process are defined below:

- *Classification Accuracy*: Classification accuracy is the ratio of correct predictions to the total predictions. Classification accuracy can be computed using the training data set or using the test data set.

- *Confusion Matrix*: The confusion matrix is a more detailed summary (relative to classification accuracy) of the performance of a classification algorithm. The diagonals of the confusion matrix count how often the algorithm yields the correct classification for each class. The off-diagonal entries count how often the algorithm confuses one class with another.

The figure below illustrates the classification accuracy and confusion matrix for an example that attempts to classify images of fruits.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_ca_fruits_example.png')
```

#### The Machine Learning Process {-}

Now that we've defined some of the important machine learning terminology let's take a 30,000 foot view at the overall machine learning process. This is a general description of the process that you can follow each time you build and use a machine learning model. The process is illustrated in the figure below (with credit to [Jinesh Maloo][]):

```{r echo = FALSE, fig.align="center", out.width="75%"}
knitr::include_graphics('images/ml_process_maloo.png')
```

- Step 1: Prepare labeled data for training and validation.

- Step 2: Select a machine learning algorithm (i.e. model).

- Step 3: Train the model.

- Step 4: Evaluate model performance.

- If model is useful:

  - Step 5: Apply to unlabled data.
  
- Else (needs improvement):

  - Collect more data (go back to Step 1).
  
  - Revise model (go back to Step 2)

[pandemic severity index]: https://en.wikipedia.org/wiki/Pandemic_severity_index

[Jinesh Maloo]: https://blog.usejournal.com/machine-learning-for-beginners-from-zero-level-8be5b89bf77c

### The Random Forest Algorithm

The random forest algorithm is a popular choice for machine learning.

- Over 20 _R_ packages have an implementation of some form of the algorithm.

- We'll be using the `randomForest` pacakge.

- The algorithm is like `bagging` (boostrap aggregating) regression trees, but the rgression trees are de-correlated.

The figure below illustrates one possible `tree` in a `random forest` for a gender prediction model. In computer science terminology, each split in the figure is a `branch` of a graph `tree`. In simple terms, the split points are randomly generated and the resulting `trees` combine to form a `random forest`.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_rf_tree_gender.png')
```

The figure below illustrates a set of 6 `trees` that make up a `random forest` for predicting housing prices. The image is courtesy [Bradley Boehmke at the University of Cincinnati][]. Examine the figure closely and notice that:

  - The split variables can differ across trees (not all variables are included in all trees).
  
  - The split variables can differ within trees (not all paths consider the same set of variables).
  
  - The order of splits can differ.
  
  - The split values can differ.
  
```{r echo = FALSE, fig.align="center", out.width="100%"}
knitr::include_graphics('images/ml_rf_tree_housing.png')
```

The job of the random forest algorithm is to determine the optimal set of trees for your data set, including the splitting configuration of each tree (i.e. order, values, etc.).

Now you have a basic understanding of machine learning and the random forest algorithm. You're probably excited to get going with applying the algorithm! But first, we need to have a data set to work with. In the next section you'll learn about a data set that can be used for predicting the severity of COVID-19 in a state.

[Bradley Boehmke at the University of Cincinnati]: https://uc-r.github.io/random_forests

### A Random Forest Example Using COVID-19 Data

Let's apply the random forest algorithm to the COVID-19 dataset. We'll build out the required _R_ code in sections. To get started, open a new _R_ script in _RStudio_ and name it `covid_19_rf.R`. Enter the code below, but omit lines that begin with a double-hash (`##`) because these are the expected output:

```{r}
#
# Part 1 - load the data
#
library(randomForest)
library(readr)
library(dplyr)

data_file <- file.path("assets", "statewide_covid_19_data_04_27_2020.csv")
df <- read_csv(data_file)
# coerce severity to a factor (so RF algorithm uses classification)
df <-
    df %>%
    mutate(Y04 = as.factor(df$Y04))


metadata_file <- file.path("assets", "statewide_covid_19_metadata_04_27_2020.csv")
mdf <- read_csv(metadata_file)
mdf
```

Try to run the code. You may get an error about missing the `randomForest` package. You can install it from the RStudio console (see below) or using the installer in the RStudio `packages` pane.

```{r eval=FALSE}
install.packages("randomForest")
```

Now we've loaded the data and metadata file. Let's pick a subset of 5 of the features and use them to try and predict the pandemic severity index (i.e. `Y04`). Add the `Part 2` code below to your RScript but omit lines that begin with a double-hash (`##`) :

```{r}
#
# Part 2 - select features and label
#

# describe all possible features and labels
print(mdf, n = nrow(mdf))

# select some features and the severity index label
my_x <- c("X01","X10","X12","X13","X23")
my_y <- "Y04"
my_xy <- c(my_x, my_y)

# get descriptions of the selected features and label
mdf %>% filter(Code %in% my_xy)

# subset the dataframe
rf_df <- df %>% select(all_of(my_xy))
```
Now we’ll add code to create and train a basic Random Forest model. Add the `Part 3` code below to your RScript but omit lines that begin with a double-hash (`##`):

```{r}
#
# Part 3 - create and train the model
#
# split into train (75%) and test (25%) datasets
#
# note that `row_nubmer()` generates a vector of row numbers, and
#     > c(1, 2, 3, 4, 5, 6) %% 4 is
#     [1] 1 2 3 0 1 2

train <- rf_df %>% filter((row_number() %% 4) %in% 1:3)
test <- rf_df %>% filter(row_number() %% 4 == 0)

# create and train the RF model
model <- randomForest(Y04 ~ X01 + X10 + X12 + X13 + X23, 
                      data = train)

# show results, includes confusion matrix for training data
model

# measure of parameter importance
importance(model) 
```

At this point the model is trained and the next step is to evaluate its usefulness at making predictions. Let's see how the model does at predicting the labels of the test dataset. Remember that the test data was _not_ used during the training exericse. As such, the confusion matrix and classification accuracy associated with the test dataset provides a useful check of the skill of the model. Add the `Part 4` code below to your RScript but omit lines that begin with a double-hash (`##`) :

```{r}
#
# Part 4 - evaluate the model using test data
#

# extract test predictions
preds <- predict(model, test)

# confusion matrix
actual <- factor(test$Y04)
predicted <- factor(preds)
common_levels <- sort(unique(c(levels(actual), levels(predicted))))
actual <- factor(actual, levels = common_levels)
predicted <- factor(predicted, levels = common_levels)
confusion <- table(actual,predicted)
print(confusion)

#classification accuracy
accuracy <- sum(diag(confusion)/nrow(test))
cat("Classification Accuracy = ", accuracy, "\n")
```

Save your script and run it. Take a look at the results for the test dataset - the classifcation accuracy is well under 50%. Furthermore, there are systematic failures in the confusion matrix. It's not a very good model. The most likely culprit is that the set of features is inadequate for making the desired prediction. We should re-run the model using different or additional features.

## Day 23 - Support Vector Machines

For today's independent work you will learn about the Support Support Vector Machine (SVM) algorithm and apply it to the COVID-19 data that you worked with on Monday.

The SVM algorithm seeks to determine an optimal hyperplane that separates labeled observations.

- Hyperplanes can be linear or non-linear.

- Support vectors are data points lying closest to the optimal hyperplane.

- The `e1071` package in R provides an implementation of SVM.

The figure below illusrates a linear SVM. Line $H_3$ provides the optimal separation between the white and black data points. Points with perpendiculars to $H_3$ are the support vectors for the dataset. The SVM algorithm classifies unlabeled data points by examining their location with respect to the optimal hyperplane. Data points _above_ line $H_3$ would be classified as "black" and datapoints _below_ the line would be calssified as white.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_svm_example.png')
```

### An SVM Example using COVID-19

We've already laid a large part of the groundwork for machine learning with the random forest example. With a few modifications, the `covid_19_rf.R` script can be adapted to use an SVM algorithm instead of Random Forest.

- Open your `covid_19_rf.R` script in _RStudio_

- Click `File --> Save As ...` and name the file `covid_19_svm.R`

- In `Part 1` of `covid_19_svm.R`, replace `library(randomForest)` with `library(e1071)`. This will load the `svm` algorithm instead of the `randomForest` algorithm.

- In `Part 3` of the code, replace `randomForest()` with `svm()` and add the following argument to the `svm()` function: `probability = TRUE`. The `svm()` code should look something like this:

```{r eval=FALSE}
model <- svm(Y04 ~ X01 + X10 + X12 + X13 + X23, 
             data = train,
             probability = TRUE)
```

- The `svm()` package does not provide useful implementations of `print(model)` or `importance(model)`. Comment out, or delete, those lines of `Part 3` and replace with `summary(model)`. When you are finished, the final section of the `Part 3` code should look something like this:

```{r eval = FALSE}
# show results, includes confusion matrix for training data
# print(model) 
#
# measure of parameter importance
# importance(model) 
#
# summarize the trained SVM
summary(model)
```

That's it! The rest of the code (i.e. `Part 4`) can be re-used. Save your `covid_19_svm.R` script and run it. How does the SVM algorithm perform in comparison with the Random Forest algorithm?

## Day 24 - the $k$-Nearest Neighbors Algorithm

For today's independent work you will learn about the $k$-Nearest neighbors (KNN) algorithm and apply it to the COVID-19 data.

For a given unlabeled data point ($X^*$) the KNN algorithm identifies the nearest $k$ labeled data points.

- Euclidean distance is typical

- _Normalization of data is necessary to prevent biased distances._

- The label of $X^*$ is predicted to be the most frequently occurring label among the $k$ nearest neighbors.

- The `class` package in R provides an implementation of KNN.

The figure below illustrates the KNN approach and is courtesy of [Antti Ajanki][]. For $k=3$, the neighborhood contains 2 triangles and 1 square so we’d predict $X^*$ is a triangle. For $k=5$, the neighborhood contains 3 squares and 2 triangles so we’d predict $X^*$ is a square.

```{r echo = FALSE, fig.align="center", out.width="50%"}
knitr::include_graphics('images/ml_knn_example.png')
```

With a few modifications, the `covid_19_svm.R` script can be adapted to use an KNN algorithm instead of SVM.

- Open your `covid_19_svm.R` script in _RStudio_

- Click `File --> Save As ...` and name the file `covid_19_knn.R`

- In `Part 1` of `covid_19_knn.R`, replace `library(e1071)` with `library(class)`. This will load the `knn` algorithm instead of the `svm` algorithm.

- The KNN algorithm requires the feature data to be normalized. Add the following line in `Part 1` of `covid_19_knn.R`. Add the line after the  `df <- df %>% mutate(Y04 <- as.factor(Y04))` line, as shown below. If necessary, use `install.packages('BBmisc') ` to install the `BBmisc` package and its `normalize()` function.

```{r eval=FALSE}
df <- df %>% mutate(Y04 <- as.factor(Y04))
df <- BBmisc::normalize(df, method = "range") # add this line
```

- In `Part 3` of the code, replace `svm()` with `knn()` and adjust the call to `knn()` so that it looks like this:

```{r eval=FALSE}
model <- knn(select(train, all_of(my_x)),
             select(test, all_of(my_x)),
             train$Y04,
             k = 7,
             prob = TRUE)
```

- The `knn()` package does not provide useful implementations `print(model)`, `importance(model)`, or `summary(model)`. Comment out, or delete, those lines of `Part 3`. When you are finished, the final section of the `Part 3` code should look something like this:

```{r eval = FALSE}
# show results, includes confusion matrix for training data
# print(model) 
#
# measure of parameter importance
# importance(model) 
#
# summarize the trained SVM
# summary(model)
```

- In `Part 4` of the script, replace:

```{r eval=FALSE}
preds <- predict(model, test)
```

with:

```{r eval=FALSE}
preds <- as.data.frame(model)[,1]
```

That's it! The rest of the code can be re-used. Save your `covid_19_knn.R` script and run it. How does the KNN algorithm perform in comparison with the Random Forest and SVM algorithms?

[Antti Ajanki]: https://commons.wikimedia.org/wiki/File:KnnClassification.svg

## Day 25 - Exploring the KNN Algorithm

For today's independent work you will learn explore the algorithm settings of the KNN algorithm.

Open your `covid_19_knn.R` script and locate the line that creates the KNN model (`model = knn(...)`). Adjust the number of neighbors so that `k = 5` and re-run the script. Record the classification accuracy. Try again, but use `k = 3`. 

Re-run each experiment several times (e.g. collect 10 trials of `k=5` and 10 trials of `k=3`, and so on). For a given value of `k`, do you get the same result each time?

Continue with these numerical experiments until you've filled out the table below:

```{r eval = FALSE}
## k_value  average_classification_accuracy  
##    1               ???%
##    3               ???%
##    5               ???%
##    7               ???%
##    9               ???%
##   11               ???%
##   13               ???%
##   15               ???%
##   17               ???%
```

What do the results suggest about the most appropriate setting for `k`? Can you think of a better way to perform the numerical experiments and collect the results?

## Day 26 (Friday) Zoom check-in

Today we'll check how you're doing with using machine learning in _R_. Then we'll get you prepared for weekend activities, where you'll continue to explore modeling using the COVID-19 dataset. 

### Review and trouble shoot (25 minutes)

- Has everyone had a chance to try out at least one of the machine learning algorithms?

- Has anyone tried additional or alternative combinations of features?

- What is the best classification accuracy that you have been able to obtain?

- What parameters appear to be the most important?

### This weekend (25 minutes)

#### ROC/AUC - Another Measure of Machine Learning Performance {-}

We've already seen how classification accuracy and the confusion matrix give an indication of the performance of a trained machine learning algorithm. It's also good practice to examine the "ROC curve" and related "AUC" metrics. 

- *ROC curve*: As illustrated in the figure below, the ROC (Receiver Operating Characteristic) curve plots the true positive (TP) vs. false positive (FP) rate at various probability thresholds. In the figure, the dashed blue line represents a hypothetical ROC curve for some machine learning model and the solid red line is the curve for a "non-informative" model (i.e. a model that makes a uniform random guess). As such, we'd like the blue curve to be as far above the red curve as possible.

```{r echo = FALSE, fig.align="center", , out.width="75%"}
knitr::include_graphics('images/ml_roc_curve_example.png')
```

- *AUC*: AUC stands for "area under curve" and is the area under the ROC curve. In the previous figure, the AUC would be the area under the dashed blue curve. Values of AUC quantify the degree to which an ROC curve lies above (or below) the "non-informative" curve. Some interesting AUC values:

  - AUC = 0.0: the model is always wrong (with respect to TP vs. FP)
  
  - AUC = 0.5: the model is no better than guessing (i.e. the model matches the red "non-informative" curve in the figure)
  
  - AUC = 1.0: the model is always right (with respect to TP vs. FP)

For a problem with multiple classes (as opposed to a binary True/False, Male/Female, or Yes/No problem) we can compute the ROC curve curve and AUC measures using a “one vs. all” approach:

- First, extract predicted probabilities from the RF model (the scores).

- Next, extract actual classification for each category.

- Finally, leverage three commands of the `ROCR` module:

  - `prediction()`: retrieve scores

  - `performance()`: generates TPR, FPR, and AUC measures through two separate calls
  
  - `print()`: generates a TPR vs. FPR plot

The ROC/AUC calculation is fairly involved so we'll create a helper function for it and then incorporate the helper function into our machine learning scripts.

```{r}
#
# Part 5 - ROC/AUC
#

# roc_one_vs_all()
#   A helper function to compute one vs. all ROC/AUC for a given level (i)
roc_one_vs_all <- function(i) {
    if(is.na(sum(probs))){
        return(NA)
    }
    
    actual <- as.numeric(y_test[[1]] == i)
    score <- probs[,i]
    
    pred <- ROCR::prediction(score, actual)
    perf <- ROCR::performance(pred, "tpr", "fpr")
    
    ROCR::plot(perf, 
         main="ROC Curve", 
         col=cols[as.numeric(i)], 
         add = i != lvls[[1]])
    
    # calculate the AUC and print
    auc_i <- ROCR::performance(pred, measure="auc")
    as.numeric(auc_i@y.values)
}

# prepare data for computing ROC/AUC
x_test <- select(test, all_of(my_x))
y_test <- select(test, all_of(my_y))
# how we obtain probs depends on the algorithm
if (inherits(model, "randomForest")) 
{
    probs <- predict(model, x_test, type='prob')
} else if (inherits(model, "svm"))
{
    probs <- predict(model, x_test, probability = TRUE)
    probs <- attributes(probs)
    probs <- as.data.frame(probs$probabilities)
    probs <- probs[,order(colnames(probs))]
} else # the KNN alg requires approximation of probabilities
{
  # highest predicted probabilities
  pnrst <- attributes(model)$prob
  # corresponding one-based categories
  y_one <- as.numeric(min(levels(df$Y04))) - 1
  vpreds <- as.integer(as.character(preds)) - y_one
  # map probabilities into a matrix of zeroes
  probs <- matrix(0.00, 
                  nrow = length(pnrst),
                  ncol = length(levels(df$Y04)))
  probs[cbind(seq_along(vpreds), vpreds)] <- pnrst
  # coerce to data frame
  probs <- data.frame(probs)
  colnames(probs) <- levels(df$Y04)
}
lvls <- unique(as.character(y_test[[1]]))
cols <- c("red","orange","yellow","green",
          "blue","purple","violet","black")

# vapply helper function across levels
auc <- vapply(lvls, roc_one_vs_all, numeric(1))

# tack on legend and non-informative line
legend("bottomright",
       legend=paste("PSI =",lvls), 
       col=cols[as.numeric(lvls)], 
       lty=1, 
       cex=1.0)
lines(x=c(0,1),
      y=c(0,1),
      lty=2)

# display AUC metrics
print(tibble(lvls, auc))
```

Assuming you've been following along with the daily activities, you can add this `Part 5` code to any of your algorithm scripts (e.g. `covid_19_rf.R`, `covid_19_knn.R`, `covid_19_svm.R`, etc.).  The code uses the `inherits()` function to adapt the output of each individual algorithm into a form that is suitable for the ROC/AUC calculation.

## Day 27

Today you'll explore different combinations of features in your COVID-19 model. 

- Select another 5 features and adjust `Part 2` and `Part 3` of your `covid_19_rf.R` script. In `Part 2` you'll need to adjust the `my_x` variable and in `Part 3` you'll need to adjust the model formula (e.g. `Y04 ~ X01 + ....`). Re-run the script and record the error rate, confusion matrix, and measures of parameter importance.

- Repeat the above step after adding an additional 5 features.

- How does the performance of the model change as more features are included? 

- Does any particular parameter stand out in terms of importance? 

- What does the most important parameter correspond to in termo of the metadata?

## Day 28

Today you'll explore making some tweaks to the random forest model to see if you can improve its performance on the COVID-19 dataset. The algorithm parameters that you'll be adjusting are described below:

- `ntree` : The number of trees to grow. Default is 500.

- `mtry` : The number of variables randomly sampled as candidates at each split. Default is `sqrt(p)` where `p` is the number of features included in the model.

Let's perform some numerical experiments to explore how these algorithm parameters effect model performance:

- Setup a random forest model that has at least 16 features (see instructions from yesterday's activity). 

- In `Part 3` of your `covid_19_rf.R` script, add the following arguments to the `randomForest()` function:

  - `ntree = 1000`

  - `mtry = 8`

- Re-run the script and record the error rate, confusion matrix, and measures of parameter importance.

- Repeat the above process using:

  - `ntree = 2000`
  
  - `mtry = 2`

- Does adjusting the parameters effect the model performance? If so, what observations can you make?

- Can you think of a "better" way to evaluate the influence of algorithm parameters?

Congratulations - you made it through a week of machine learning boot camp! You can download completed scripts (i.e. `Part 1` through `Part 5`) for each algorithm using the links below:

- [Complete Random Forest Example](assets/covid_19_rf.R)

- [Complete Support Vector Machine Example](assets/covid_19_svm.R)

- [Complete $k$-Nearest Neighbors Example](assets/covid_19_knn.R)