In this exercise we'll use boosting on the Titanic dataset from Kaggle. The dataset, demos and discussions on the dataset can be found here. The raw titanic data can be accessed via tools.get_titanic
To use boosting, we will use the GradientBoostingClassifier
in sklearn
.
We will be using the pandas library to organize and clean up the data. The data we're given here has quite a few defects (which is quite common when dealing with real data) and we need to fix and clean it up before we do our predictions. A cheat-sheet for pandas can be found here
About the Titanic Dataset: This dataset contains information about passengers aboard the Titanic. It also contains information about which passengers survived the shipwreck and which didn't. This dataset is often used to train models to predict from passenger information which passengers survived and which didn't.
For each passenger, at least some of the following information is given:
Variable | Definition | Key |
---|---|---|
survival | Survival | 0 = No, 1 = Yes |
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
sex | Sex | - |
Age | Age in years | - |
sibsp | # of siblings / spouses aboard the Titanic | - |
parch | # of parents / children aboard the Titanic | - |
ticket | Ticket number | - |
fare | Passenger fare | - |
cabin | Cabin number | - |
embarked | Port of Embarkation | C = Cherbourg Q = Queenstown, S = Southampton |
We have to clean up the data. We have provided you with the function tools.get_titanic
that cleans up the data and returns you : (X_train, y_train), (X_test, y_test), submission_X
:
X_train
: The features for the training sett_train
: The targets for the training setX_test
: The features for the test sett_test
: The targets for the test setsubmission_X
: Features that you will use to make predictions about the survival of passengers without any knowledge of the correspondingsubmission_y
.
Example usage:
>>> (tr_X, tr_y), (tst_X, tst_y), submission_X = get_titanic()
# get the first 1 row in the training features
>>> tr_X[:1]
Pclass SibSp Parch Fare Sex_male Cabin_mapped_1 Cabin_mapped_2 ... Cabin_mapped_4 Cabin_mapped_5 Cabin_mapped_6 Cabin_mapped_7 Cabin_mapped_8 Embarked_Q Embarked_S
794 3 0 0 7.8958 1 0 0 ... 0 0 0 0 0 0 1
[1 rows x 15 columns]
Notice that all these sets are not numpy
arrays but pandas
data frames.
Take a look at tools.get_titanic
. In the middle of the function we drop a couple of columns that contain NaNs (not numbers):
X_full.drop(
['PassengerId', 'Cabin', 'Age', 'Name', 'Ticket'],
inplace=True, axis=1)
You can take a look at the data frame before performing the drop with e.g. print(X_full[:10])
and you should see the first 10 rows with these age values:
22.0
38.0
26.0
35.0
35.0
NaN
54.0
2.0
27.0
14.0
But maybe we can replace the NaN
values with some other information? Take a look at get_titanic
for clues.
Make a new function in your template.py
called get_better_titanic
that does this.
This question should be answered in 1_2.txt
Write a short summary about your change to get_titanic
.
We will now try training a random forest classifier on the titanic data.
Create a function rfc_train_test(X_train, t_train, X_test, t_test)
that trains a Random Forest Classifier on (X_train, t_train)
and returns:
- accuracy
- precision
- recall
on (X_test, t_test)
Example inputs and outputs (note: you might not get exactly the same result):
>>> rfc_train_test(tr_X, tr_y, tst_X, tst_y)
(0.8097014925373134, 0.7708333333333334, 0.7184466019417476)
This question should be answered in 2_2.txt
Upload your choice of parameters and the accuracy, precision and recall on the test set. How or why did you choose those parameters?
Create a function gb_train_test(X_train, t_train, X_test, t_test)
that trains a GradientBoostingClassifier
on (X_train, t_train)
and returns:
- accuracy
- precision
- recall
on (X_test, t_test)
Example inputs and outputs (note: you might not get exactly the same result):
>>> gb_train_test(tr_X, tr_y, tst_X, tst_y)
(0.8208955223880597, 0.8313253012048193, 0.6699029126213593)
This question should be answered in 2_4.txt
Upload the Gradient boosting classifier accuracy, precision and recall on the test set. How does it compare to the random forest classifier?
This question should be answered in 2_5.txt
We will now try to perform randomized parameter search to find better parameters for the gradient boosting classifier. We will be using the randomizedSearchCV
import to help us do this.
Fill in the blanks in the param_search
function:
n_estimators
: Choose multiple integer values between 0 and 100max_depth
: Choose multiple integer values no more than 50.learning_rate
: Choose multiple float values between 0 and 1.
By calling the function with the training features and training targets, it will perform the parameter search and return the best value of n_estimators
, max_depth
and learning_rate
it can find.
Example usages (note: you might not get exactly the same result):
>>> param_search(tr_X, tr_y)
(10, 2, 0.5)
Run this function and report which are the best values of parameters.
Create a function gb_optimized_train_test
that does the same as your gb_train_test
but now uses your optimized parameters instead.
>>> gb_optimized_train_test(tr_X, tr_y, tst_X, tst_y)
(0.8171641791044776, 0.7934782608695652, 0.7087378640776699)
To get full marks, each score has to be higher than the one from your vanilla GB classifier
Now that you have your optimized classifier we will submit it to kaggle.
Fill in the _create_submission
function. Run the function and a .csv
submission file will appear under ./data
.
Log in to Kaggle, join the Titanic competition and submit your .csv
file. After the upload, kaggle will calculate your score. It should look something like:
7132 YourName novice tier 0.77033 1 3m
Read this carefully before you submit your solution.
You should edit template.py
to include your own code.
This is an individual project so you can of course help each other out but your code should be your own.
You are not allowed to import any non-built in packages that are not already imported.
Files to turn in:
template.py
: This is your code1_2.txt
2_2.txt
2_4.txt
2_5.txt
Make sure the file names are exact. Submission that do not pass the first two tests in Gradescope will not be graded.
There are many other types of models that we didn't try on this dataset that might work better. Maybe there are better parameter optimizations that could be done. Maybe the get_titanic
function could be further improved?
You decide what you do but the goal with the independent section is to improve your kaggle score the most.
Upload any code that you generated as part of the independent assignment and a short summary of your experimentation, solution and results.