The files are organised based on:
i) the process followed to implement the models,
ii) the Machine Learning algorithms being applied.
Contains the functions used to clean the data. Specifically:
-how to deal with missing values;
-creation of dummy variables;
-feature augmentation with interaction terms and polynomials;
-normalisation of data.
Contains the generic functions used throughout the implementation of algorithms. Specifically:
-auxiliary functions for regression implementations:
-single steps of regression algorithms;
-extraction of a sample of the dataset;
-batch creation.
Implements a subset selection method based on AIC. The method is implemented both for ridge and for logistic regression and constructs a series of models of increasing number of variables, greedily adding a new variable at each step. In the end the best of these models is selected using AIC. Contains the functions:
compare_aic_gradient_descent(y,tx,gamma,max_iter,threshold)
compare_aic_ridge(y,tx,lambda_)
Contains the implementations of the main machine learning algorithms we selected. The functions defined in this .py are:
-least_squares_GD
-least_squares_SGD
-least_squares
-ridge_regression
-ridge_regression_SGD
-lasso_regression_GD
-logistic_regression
-reg_logistic_regression
-logistic_regression_newton_method_demo
Cross validation is used to set the values of hyperparameters and polynomial degrees in different regression models. The files which implement cross-validation are:
-"cross_validaion_logistic.py"
-"cross_validation_lasso.py"
-"cross_validation_ridge.py"
-"cross_validation_ridge_super.py"
The following files run algorithms implementing different machine learning algorithms from the data loading phase to the final csv creation.
-"test_lasso.py"
-"test_logistic_penalized.py"
-"test_logistic_penalized-cross.py"
-"test_logistic_newton.py"
-"test_logistic_gd.py"
-"test_AIC_logistic.py"
A few functions, such as batch_iter, were taken from the helpers of the lab session of the course.