Titanic_Doug_R_final.Rmd

---
title: "Titanic in R!"
author: "Doug Barrows"
date: "1/16/2019"
output:
  #powerpoint_presentation
  #slidy_presentation
  revealjs::revealjs_presentation
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Caret (Classification And REgression Training)
  * wrapper for [many different ML algorithms](https://rdrr.io/cran/caret/man/models.html)
  * allows you to easily try different parameters specific to each algorithm
  * integrates resampling methods to select best set of parameters
  * provides many defaults so you can use out of box relatively easily, but almost anything can be manually changed
  * also allows for imputation of missing values
    * will increase numbers for more accuracy and will be able to compare to real test set (can't remove NAs from that set)
  
```{r, message=FALSE}
# install/load caret package 
#install.packages("caret")
library(caret)
```

## Resampling - bootstrap vs cross validation
##### caret allows for easy use of strategies to assess model efficiency without having to explicitly separate out a test set of data

![](./bootrap_concept.png)

## Resampling -bootstrap vs cross validation 
![](./cross_validation.jpeg)

## Load the training data
```{r}
train <- read.table("./train.csv", sep = ",", header = TRUE)

str(train)

train$Sex <- as.integer(train$Sex) - 1

```

## How to transform data? What about age?
```{r, fig1, fig.height = 3, fig.width = 5, fig.align = "center", warning=FALSE}
ggplot(train, aes(Age)) + 
  geom_density(alpha=0.5, aes(fill=factor(Survived))) + 
  geom_vline(xintercept = 10) +
  theme(legend.position = "bottom")
train$is_child <- as.integer(train$Age < 10) # make a column specifying child or not

```

## How to transform data? What about age?
```{r, fig2, fig.height = 4, fig.width = 6, fig.align = "center", warning=FALSE}
# look specifically whether a child is more likely to survive
ggplot(train, aes(x = is_child, fill=factor(Survived))) + 
  geom_bar(position = "dodge")
```


## Transform Fare?
```{r, fig3, fig.height = 3, fig.width = 7, fig.align = "center"}
par(mfrow=c(1,2))
hist(train$Fare)
hist(log(train$Fare))
train$logfare <- log(train$Fare + 1)
```

## Simplify family size indicators 
```{r, fig4, fig.height = 3, fig.width = 6, fig.align = "center", warning=FALSE}
# Sibsp: # of siblings and spouses aboard
# Parch: # of parents and children aboard
train$fam_size <- train$SibSp + train$Parch

#density plot for family size
ggplot(train, aes(fam_size)) + 
  geom_density(alpha=0.5, aes(fill=factor(Survived))) +
  theme(legend.position = "bottom")

```

## Simplify family size indicators
```{r, fig5, fig.height = 3, fig.width = 5, fig.align = "center", warning=FALSE}
# look at actual survival numbers for different family sizes
train$fam_size_groups <- ordered(ifelse(train$fam_size == 0, "single", 
                                        ifelse(train$fam_size > 3, "more_than_3", "medium")), 
                                 levels = c("single", "medium", "more_than_3"))
ggplot(train, aes(x = fam_size_groups, fill=factor(Survived))) + geom_bar(position = "dodge") 
# make columns for both being single and having a big family
train$single <- as.integer(train$fam_size == 0) 
train$fam_too_big <- as.integer(train$fam_size > 3) 
```

## How do variables correlate with survival?
```{r}
sub <- c("Survived","Pclass","Sex","Age", "is_child","SibSp", "Parch", "fam_size","single","fam_too_big", "Fare","logfare")
train_sub <- train[colnames(train) %in% sub]
data.frame(SurvivedCorr = cor(na.omit(train_sub))[,1])
```


## Subset features 
```{r}
features3_names <- c("Pclass", "Sex", "logfare", "is_child", "single", "fam_too_big")
features3 <- train[, colnames(train) %in% features3_names]
str(features3)

```

## Back to caret - set up the training parameters 
```{r trainControl}

fitControl <- trainControl( 
    method="repeatedcv", # cross-validation (default is bootstrapping)
    number=10, # this will split data set into 10 groups, and continually train on 90% and test on the remaining 10% for all groups
    repeats=5, # repeat the CV 5 times for each model
    verboseIter=TRUE,
    savePredictions = TRUE)

```

## k-nearest neighbor with cross validation
#### fill in NAs with knnImpute
```{r knn_CV2,cache=TRUE, results='hide',dependson='feature_transform'}
knn_survived_caret_CV3 <- train(features3, # features
                            as.factor(train$Survived), # outcome
                            method = "knn",
                            trControl=fitControl, # use the parameters set previously
                            tuneLength = 12, # this means that caret will choose 12 different values for key parameters 
                            preProcess = "knnImpute") # will use KNN to fill in missing values 
```


## Look at KNN model
```{r, dependson='knn_CV2'}
knn_survived_caret_CV3
```
## Can easily visualize which parameter is best with ggplot and model
```{r, fig6, fig.height = 4, fig.width = 7, fig.align = "center"}
ggplot(knn_survived_caret_CV3)
```

## Logistic Regression 
#### Logit function - a "generalized linear model" which allows you to solve classification problems with linear regression

![](./logit.png)

  * for every one unit increase in X the log odds will increase by beta, holding all other features constant
    * e.g. beta = 0.67, to get odds from log odds, e^0.67 = 2.
    * this means if beta = 0.67 you would be twice as likely to survive if X = 1 as opposed to X = 0

## Logistic Regression
#### fill NAs with knnImpute
```{r LR_CV2_knn, cache=TRUE, results='hide', dependson='feature_transform'}
LR_survived_caret_CV3 <- train(features3, 
                            as.factor(train$Survived), 
                            method = "glm", 
                            family = "binomial",
                            trControl=fitControl,
                            preProcess = "knnImpute")
```

## Look at the LR model
```{r, dependson='LR_CV2_knn'}
LR_survived_caret_CV3$results
```

## Look at the LR model
```{r, dependson='LR_CV2_knn'}
summary(LR_survived_caret_CV3)
```

```{r,echo=FALSE, results='hide', cache=TRUE}
features_single_test <- features3[, !(colnames(features3) %in% c("Sex", "Pclass", "logfare"))]
LR_survived_caret_single_test <- train(features_single_test, 
                            as.factor(train$Survived), 
                            method = "glm", 
                            family = "binomial",
                            trControl=fitControl,
                            preProcess = "knnImpute")
```

## Single feature without sex or indicators of wealth
```{r}
summary(LR_survived_caret_single_test)
```

## Random Forests!

![](./randomforest.png)


## Random Forests
#### knnImpute to fill NAs
```{r rf_CV2_knnimpute, cache=TRUE, results='hide', dependson='feature_transform'}
set.seed(1)
rf_survived_caret_CV3_knnimpute <- train(features3, 
                            as.factor(train$Survived), 
                            method = "rf", 
                            ntree = 500, # the number of trees used to make forest
                            trControl=fitControl,
                            tuneGrid = expand.grid(mtry = c(2, 4, 6)),
                            preProcess = "knnImpute")
```
## Look at the RF model
```{r, dependson='rf_CV2_knnimpute'}
rf_survived_caret_CV3_knnimpute
```

## Can also get importance measures for each feature

```{r}
varImp(rf_survived_caret_CV3_knnimpute, scale = FALSE)
# mean decrease in classification accuracy after permuting that feature over all trees
# i.e. decrease in accuracy in out-of-bag passengers after feature is shuffled 
```


## Neural Network - cause why not?

```{r nn_CV2_knn, cache=TRUE, results='hide', dependson='feature_transform'}
set.seed(1)
nn_survived_caret_CV3_knnimpute <- train(features3, 
                            as.factor(train$Survived), 
                            method = "nnet", 
                            trControl=fitControl,
                            tuneLength = 12,
                            preProcess = "knnImpute")

```
## Look at the NN model
```{r, dependson='nn_CV2_knn'}
nn_survived_caret_CV3_knnimpute

```

## Test data!
#### Transform features in same way as training data
```{r}
test <- read.delim("./test.csv", sep = ",", header = TRUE)

test$Sex <- as.integer(test$Sex) - 1
test$is_child <- as.integer(test$Age < 10)
test$logfare <- log(test$Fare + 1)
test$fam_size <- test$SibSp + test$Parch
test$single <- as.integer(test$fam_size == 0)
test$fam_too_big <- as.integer(test$fam_size > 3) 

```

## subset test data as we did for training data
```{r}
test_names <- c("Pclass", "Sex", "logfare", "is_child", "single", "fam_too_big")
features3_test <- test[, colnames(test) %in% test_names]
str(features3_test)
str(features3)

```

## output data frames to be used for scoring
```{r}

# predict knn cross validation2
predictions_test_submit_knnCV3 <- data.frame(PassengerId = test$PassengerId, Survived = predict(knn_survived_caret_CV3, features3_test))
head(predictions_test_submit_knnCV3)
#write.table(predictions_test_submit_knnCV3, "./test_knn_CV_fam.csv", sep = ",", row.names = FALSE)

## logistic regression - features set3 - knnimpute
predictions_test_submit_LR_CV3 <- data.frame(PassengerId = test$PassengerId, Survived = predict(LR_survived_caret_CV3, features3_test))
#write.table(predictions_test_submit_LR_CV3, "./test_LR_CV_features3_fam_knnimpute.csv", sep = ",", row.names = FALSE)

# predict rf cross validation knnimpute
predictions_test_submit_rf_CV3_knnimpute <- data.frame(PassengerId = test$PassengerId, Survived = predict(rf_survived_caret_CV3_knnimpute, features3_test))
#write.table(predictions_test_submit_rf_CV3_knnimpute, "./test_rf_CV3_fam_knnimpute.csv", sep = ",", row.names = FALSE)

# predict nn cross validation knnimpute
predictions_test_submit_nn_CV3_knnimpute <- data.frame(PassengerId = test$PassengerId, Survived = predict(nn_survived_caret_CV3_knnimpute, features3_test))
#write.table(predictions_test_submit_nn_CV3_knnimpute, "./test_nn_CV3_fam_knnimpute.csv", sep = ",", row.names = FALSE)

```

## How did we do?
```{r}
algorithms <- c("KNN", "Logistic Regression", "Random Forest", "Neural Network")
accuracy <- c(max(knn_survived_caret_CV3$results$Accuracy),  
                max(LR_survived_caret_CV3$results$Accuracy),
                max(rf_survived_caret_CV3_knnimpute$results$Accuracy),
                max(nn_survived_caret_CV3_knnimpute$results$Accuracy))

kaggle_transformed <- c(0.7655, 0.7703, 0.8038, 0.7703)

data.frame(row.names = algorithms, 
           accuracy, 
           kaggle_transformed)

```

## Is this better than untransformed features?
#### redid the above modeling with these features...
```{r}
features_NT_names <- c("Pclass", "Sex", "Age", "SibSp", "Parch", "Fare")
features_NT <- train[, colnames(train) %in% features_NT_names]
str(features_NT)
```

## Is this better than untransformed features?
```{r}
kaggle_nontransformed <- c(0.7703, 0.7519, 0.7560, 0.7512)
data.frame(row.names = algorithms, 
           kaggle_transformed,
           kaggle_nontransformed)
```

## The End - thanks!