- Prerequisites
- Data folder
- Exploratory Data Analysis (EDA folder)
- NB tuning folder
- RF tuning folder
- Model Comparison folder
- fitcnb_ce folder (adapted Matlab implementation of fitcnb using cross entropy)
- utils folder
- Authors
- License
- Matlab version 2018a
- Statistics and Machine Learning Toolbox
- Neural Network Toolbox
The following can be used but are skipped if not present:
- Parallel Computing Toolbox (required to optimise Random Forest computations for device)
- Deep learning toolbox (required to plot confusion charts)
Loads the datafile containing the heart data named heart.csv
from the current directory and splits the data into training and test sets, returning the labels and features for the training and test sets as well as a cvpartition
object. The script fixes a random seed so that the cross validation partition and split of test and training data are deterministic to allow for repeatability.
Generate boxplots for categorical predictor features and bar charts showing frequuency values for categorical data
Performs basic exploratory analysis on the data and generates a heat map of correlations between features.
This is the top-level script running experiments on the Naive Bayes model. The script runs Bayesian optimisation and a grid search testing normal and kernel distributions and optimising kernel width on all features. A manual grid search is also run where all combinations of distribution were tried on continuous features. The data is standardised and the same process is re-run. The Naive_Bayes_Optimisation.m
and Naive_Bayes_man_gs.m
functions are called to run the optimisations. The script then considers the differences between models trained using cross entropy as a loss function compared to MCR (misclassification rate). Finally the script runs and evaluates the best model selected from these experiments.
Function to run Bayesian Optimisation and grid search given some features, labels and a cross validation partition. Prints the runtime for these optimisations to the terminal.
Runs a manual grid search of all possible combinations of distribution over the set of features. Categorical features are fixed to the multivariable multinomial distribution mvmm
and while kernel kernel
and gaussian normal
distributions are used for the continuous features.
Performs k-fold CV using RF analysis to calculate predictor importance for all included variables. Predictor importance is calculated for each k-fold sample and then average values are calulated for each predictor. Average values are then displayed in a bar chart.
Sets up 10-fold CV and performs RF analysis using the Matlab implementation Treebagger. Performs search for optimal Treebagger hyperparameters MinLeafSize (Minimum number of observations per tree leaf), NumPredictorsToSample (Number of variables to select at random for each decision split) and numTrees (number of trees to include in ensembl random forest) using Bayesian Optimisation.
Performs 20 cycles of Bayesian Optimisation sampling 30 different points of the hyperparameter search space.
Error assessed using either MCR (lossFcn_RF_MCR.m
found in RF tuning folder) or Cross Entropy (CE) (lossFcn_RF_CE.m
found in RF tuning folder) loss functions.
Calculates mean and sd values for different performance metrics (recall, precision, F1, specificity, accuracy, AUC) over each 20 cycle run.
Returns bar chart of mean performance metrics (recall, precision, F1, specificity, accuracy, AUC) and ROC curve showing best performing models generated using MCR and CE.
Trains optimised NB and RF models using optimised hyperparamters (optimal parameters are hard-coded for ease) generated from running Run_NB_Analysis.m for NB and feature_selection.m for RF. on the complete training dataset. Generated models are used to make predictions for the test set data. Performance metrics are generated (recall, precision, F1, specificity, accuracy, AUC) for each model and a bar chart is returned comparing the performance of NB and RF models on the test set data. A ROC curveis also generated comparing NB and RF models.
Adapted Matlab's 'bayesopt' fitcnb implementation to use cross entropy instead of misclassification rate as the loss function used to explore the hyperparamter search space. The current version is compatible with R2018a and R2018B Matlab versions. The code used to change the loss function is located in createObjFcn_ce.m. All other paths to other scripts called within fitcnb implementation have been preserved so it retains all other fitcnb functionality.
Function script which calculates specification for the machine being used to run the analysis and implements parallel pool environment.
- Finds presence/absence of GPU.
- Finds Number of cores available.
- Finds number of CPUSs avaialable.
- Implement a paralell pool environment using the available resources.
Function script which takes as input a classification model, a confusion matrix for the inputted classification model, matrix of input features and a corresponding array of labels for the inputted features. Returns the following performance metrics
- recall
- precision
- F1
- specificity
- accuracy
- AUC and also returns a ROC curve for the model
Kevin Ryan, Peter Grimshaw
This project is licensed under the MIT License - see the LICENSE.md file for details