Smartphone accelerometer data for human activity classification

The completion of this project earned me a verified certificate from HarvardX. The goal was to provide the predicted physical activity based on tri-axial smartphone accelerometer data, which was collected from the Beiwe research platform.

Provided datasets

In the time series files the relevant columns are those labeled x, y, z for the othogonal axes of linear acceleration values and the timestamp for the tracking of time. The train accelerations dataset has 3744 rows Vs. 1250 of the test dataset.

In the train label file (375 rows) a number from 1 to 4 (i.e., 1=Standing, 2=Walking, 3=Stairs down, 4=Stairs up) is assigned to each time point. The test label file (175 rows) has empty cells to be filled with the predicted classes which were evaluated externally by the EdX platform.

Main steps of my supervised ML approach

Exploring and cleaning data. Two critical aspects emerged. Firstly, the different sampling rate between acceleration data (10 Hz, 1 data point every 0.1 second) and multi-label classification (1 Hz, 1 label every second) required a homogenization procedure* such that each label has its own descriptor. Here acceleration data was filtered according to the timestamps of each given label, thus reducing the dataframe from 3744 to 375 rows. Secondly, a strong class imbalance was evident and had to be properly addressed.
Modeling. The 3 classification models covered in the course were also used here: multinomial logistic regression, random forest classifier and nearest neighbors classifier. Since accuracy alone is not optimal to evaluate imbalanced datasets (i.e., accuracy paradox), model selection was based on the average between Accuracy and F1-scores. Furthermore, the highest Cohen's kappa score, a value between -1 and 1, was assessed to further confirm the best model (kappa score represents accuracy normalized by the imbalance of the classes in the data).
SMOTE for Balancing Data. Given the already small size of the train set, data was augmented (i.e., oversampled) for the minority classes. SMOTE algorithm works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line. Previously found best parameters (i.e., with cross validation randomized search) were again used to fit the model.

                Accuracy	F1-scores	Kappa-scores
Imbalanced data	0.722667	0.675894	0.458995
SMOTE data	0.850939	0.848939	0.801252

Best model performance sensibly increased after balancing the data. The same model then was used to extract predicted labels on the test set.

Improvement suggestions

Homogenization of data can be improved by replicating each label 10 times along time in order to have a more informed classifier training.
Number of descriptors can be further increased by extracting statistical features of the 3-axes components (e.g., median, SD, kurtosis, 1st Fourier transform element, etc)

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
FinalProject-EdX.ipynb		FinalProject-EdX.ipynb
README.md		README.md
test_labels.csv		test_labels.csv
test_time_series.csv		test_time_series.csv
train_labels.csv		train_labels.csv
train_time_series.csv		train_time_series.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smartphone accelerometer data for human activity classification

Provided datasets

Main steps of my supervised ML approach

Improvement suggestions

See Also*

About

Releases

Packages

Languages

gufett0/classification-timeseries-accelerometer

Folders and files

Latest commit

History

Repository files navigation

Smartphone accelerometer data for human activity classification

Provided datasets

Main steps of my supervised ML approach

Improvement suggestions

See Also*

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages