This repository contains my data mining course homeworks. The homeworks of this course seem useful and interesting so I decided to tell you what is going on so you can practice too
There is a file named covid.csv that contains information about people suffering from COVID-19 in south korea.
- I read this file using pandas library
- This dataset is small and contains only 176 records, these are the columns
id | sex | birth_year | country | region | infection_reason | infected_by | confirmed_date | state |
---|---|---|---|---|---|---|---|---|
nominal | nominal | interval | nominal | nominal | nominal | ratio | interval | nominal |
0 nan | 0 nan | 10 nan | 0 nan | 10 nan | 81 nan | 134 nan | 0 nan | 0 nan |
- It is asked to find max, mean and std of the column birth_year
the max is 2009
Let me talk about finding mean, this column has null values and I can have different strategies facing them I calculated the mean using two ways:
first: We can think that the null values don't exist and calculate the mean, mean() function of pandas dataframe does that, the mean is 1973.3855
second: I can change the pandas dataframe to a numpy array and then calculate the mean but the mean() function of numpy array ignore the null values so I have to substitute them with a number for example zero, the mean is 1861.2613.
everything I said about mean remains the same for std. I calculated std just using pandas dataframe std function
std: 17.0328
4.yes, null values exist in the dataset.
The question is to remove the null values by a proper method but what the proper method?? if the dataset was huge and
the null values were few I would remove the records which have null values but our case is completely the opposite so I
should substitute the null values with a value. pay attention: sometimes a column contains null values that I don't
want to select in feature selection or I sometimes even use a method that can handle null values but in this question we
assume that we don't want to put aside any column and our method cannot handle null values. In my opinion the best way
to substitute null values of numerical columns is to put the median value instead of them Note: I prefer to use
median over mean cause it's more resistant to outliers. For nominal columns I substitute the null values with the
most frequent value.
trouble alert: There is column in our dataset that has date time values, it's logical to use median strategy for
this column but the SimpleImputer class that I use considers this column as string and cannot find the median I thought
of two solutions, I can find the median df['confirmed_date'].astype('datetime64[ns]').quantile(.5)
and then use
SimpleImputer with constant strategy or I can convert the column to timestamp before passing it to SimpleImputer, the
second approach is easier.
-
visualize data
First, I want to plot the histogram of some columns, I think the birth year and infected by columns are the most appropriate and plotting the histogram of the other columns don't give us any information (for example id column : joy:).
Second, I'd like to plot a scatter plot so I need to choose two columns, let's find the correlation between birth_year and confirmed_date -
Here we like to detect and remove outliers, even if you have not studied a single book you may think of sorting or visualizing the data and easily seen and remove outliers for datasets like the one we have in here this approach really works but most of the time the dataset is huge and you may prefer more automatic ways like:
-
Inter Quartile Range (IQR): Look at the code below
import numpy as np
Q1_Q2 = np.quantile(data,0.25)
Q3 = np.quantile(data,0.75)
IQR = Q3 - Q1_Q2
- Z-Score
In any distribution, about 95% of values will be within 2 standard deviations of the mean and 99.7% of the data within 3. Based on this, any absolute value of z-score above 3 is considered as an outlier.
Z-score is calculated by substracting the mean and dividing by std
I treat with outliers like null values and substitute them with median
While reading the dataset using pandas I found out there are ';' instead of ',' so I wrote a bash script to solve this
problem
- I extract G3 as Y
- I split the dataset to train and test 3 fit a linear regression model (as easy as a piece of cake) Note: we have nominal columns in our dataset which obviously linear regression cannot handle so we should transform these nominal values to numerical value before fitting the model
Let me explain how I think completely, our nominal columns are school, sex, address, femsize, Pstatus, Mjob, Fjob,
reason, guardian, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic.
the categorical attributes generally fall into three groups (it's my grouping):
-
binary: they can have only two possible values we can consider one them as zero and the other as one, in our dataset these are binary: school, sex, address, femsize, Pstatus, schoolsup, famsup, paid, activities, nursery, higher, internet, romantic
-
ordinal: they can have multiple categorical values but you can see an order among them for example Like,Like Somewhat,Neutral,Dislike Somewhat,Dislike, it's obviously seen that 'Like' is much closer to 'Like Somewhat' than 'Dislike' so the differentiate of the numbers I assign to 'Like' and 'Like Somewhat' should be less than the differentiate of 'Like' and 'Dislike' (you get the point!) none of our attributes is ordinal
-
nothing : they can have multiple categorical values but there is no order, in this case it's not reasonable to assign numerical values to the values cause the values which get closer numbers will have lower distance from each other though it's not correct.Mjob, Fjob, reason and guardian of our dataset are form this group Solution: we can use OneHotEncoding, this strategy converts each category value into a new column and assigns a 1 or 0 (True/False) value to the column
-
Predict test data
-
Find accuracy
I use mean squared error for this purpose, the mse was 5.7495
Working with titanic dataset.
I may change my opinion in future but with the knowledge I have right now I guess the best way is to replace the null values of the embarked column with the mode of the column (it has only 2 null values and replacing mode can be a good guess), It's hard to say which column is more important in our prediction at the moment but I guess age can affect our prediction a lot so I try not to remove the column and I use median values for the null ones but the null values of the "cabin" column is so many that I think the column doesn't worth keeping.
We'd like to use a decision tree to classify passengers and the decision tree classifier from sklearn library cannot
work with categorical values so we have to transform the categorical columns to numerical. I divide categorical columns
into two groups, the ones which we can arrange in an order and the ones which we cannot. I use ordinal encoder for the
first group and one hot encoder for the second one. Now let's take a look at the categorical columns:
At first I thought of dropping this column, how can names affect our prediction??!! no body stays alive because of
his/her name but it's a little trickier I found two points hidden in names first: we can find families using names and
cause families travel together it's probable to say they all survive together or they all not, second: some words like
Miss. and Mrs. are specified in this column which gives us a sense to estimate his/her age so we can fill the null
values in the age column in a more proper way. We Asians may not be familiar with western names I've searched for it and
write some points for you.
Look at this example:
Baclini,Mrs.Solomon (Latifa Qurban)
Mrs. indicates that she is married
Solomon is the name of her husband. This is an old-ish custom where wives can be referred to by therir husband
names. For instance, if Jane Smith was married to John Smith, she could be referred to as Mrs.John Smith.
Latifa is her first name.
Qurban is her "maiden" name. This is the last name that she had before getting married.
Baclini is her married last name(the last name of her husband)
I take another example:
Baclini,Miss.Marie Catherine
Miss indicates that she is unmarried.
In this Marie is her first name, Catherine is her middle name and Baclini is the last name.
Mr. (for men of all the ages)
Master. (for male children)
We have other words like Dr., Sir., Col. and ... it's hard to separate each of them apart so I call all of them
professional and I guess their age should be relatively high.
Let's sum it up I think first name and middle name have nothing to do with the passenger's survival but the last name
can help us find out families and the Mr. etc. words help us fill null ages.
Note: Finding last name for married women is tricky actually they have two last names and those are both important
cause they may have a trip with their husband or parents. I may change my code in future but to make it simple I just
consider their married last name.
Sex : There is no sex order but this column takes only two values so I prefer to use ordinal encoder over one hot
encoder which increases the number of my columns.
Ticket : Tickets have 1. an optional string prefix and 2. a number except for the special cases Ticket='LINE. Ticket
prefix tells you who the issuing ticket office and/or embarkation point was. Ticket number can be compared for equality
that tells you who were sharing a cabin or travelling together, or compared for closeness. The ticket = LINE have been
assigned to a group of American Line employees for free
Embarked: This column can take three values 'C', 'Q', 'S' which I think have no order, so I use one hot encoder
I strongly recommend splitting your dataset to train and test as soon as possible and don't look at the test data at all, if you do that you may have null values in the test data. To remove them I used the data in the training dataset for example if age is null in the test data I filled it with the average of age in the train data.
Working on heart-disease-uci dataset.
age
sex: (1 = male, 0 = female)
cp : chest pain type
trestbps : resting blood pressure (in mm Hg on admission to the hospital)
chol : serum cholestoral in mg/dl
fbs : fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg : resting electrocardiographic results
thalach : maximum heart rate achieved
exang : exercise induced angina (1 = yes, 0 = no)
oldpeak : ST depression induced by exercise relative to rest
slope : the slope of the peak exercise ST segment
ca : number of major vessels (0-3) colored by flourosopy
thal : 3 = normal, 6 = fixed defect, 7 = reversable defect
target : 0 or 1
Normalizing data is important when we want to calculate distance among data records, for example normalizing is not
important in decision tree models cause we are not going to calculate any distance but it's too important in models like
regression and knn.
Normalizing means standardizing features by removing the mean and scaling to unit variance.
To normalize the dataset I used StandardScaler() to fit and transform the data then I used the StandardScaler object
I had gotten to just transform the test data.
Accuracy reached to 86% from 65% just by normalizing the data.
At first, I thought this dataset doesn't contain any NaN values BUT this dataset has some wrong values which we should
put NaN instead.
Let's see the number of unique values in each column
Look!!!, there are two columns that seem strange:
after investigating the columns we know that 'ca' ranges from 0 to 3, so it should
have only 4 unique values, but it has five 😳 so there should be a wrong value which needs to be cleaned.
X_train['ca'].unique()
the above code shows that this column has an unaccepted value '4', so I substituted the value '4' with NaN.
The same thing happens for 'thal' column too this header says this column can only take values from 1 to 3 but, this column contains 4 unique values so like what I did for 'ca' column every value other than 1 to 3 should be changed to null.
Basically it is a radioactive element injected into the bloodstream of the patient. Then the blood flow of the patient is observed while they are doing exercise and resting.
- 0 maps to null in the original dataset
- 1 maps to 6 in the original dataset. This means that a fixed defect was found.
- 2 maps to 3 in the original dataset. This means that the blood flow was normal.
- 3 maps to 7 in the original dataset. This means that a reversible defect was found.
X_train.drop_duplicates()
Let's use box plots
X_train.plot(kind='box', subplots=True, layout=(2, 7), sharex=False,
sharey=False, figsize=(20, 10), color='deeppink')
plt.show()
I can either drop the outliers or assign a new value. I chose the first option.
Accuracy on knn model was 83% Accuracy on gaussian naive bayes was 90%
There are two naive bayes models explained below
used for discrete data
used for continuous data
I am going to work with titanic dataset again, I have done the preprocessing in HW2 Q6 so I can simply use my new model. I want to use random forest tree model.
- max_depth=5, criterion=gini
- max_depth=5, criterion=entropy
- max_depth=10, criterion=gini
- max_depth=10, criterion=entropy
Comparing random forest model with decision tree: In HW2 Q6 my decision tree had max_depth=5 and criterion= gini, the accuracy was 77%, the forest classifier with the same parameters gave accuracy= 77%.
Comparing speed of learning:
I used time.start and time.end to calculate the time of learning.
decision tree learning time was 0.0022056102752685547
random forest learning time was 0.10478830337524414
I'm going to work with titanic dataset again. In this part I wanna use SVM for classification.
As you see, the accuracy of linear SVM model is less than the decision tree and random forest models, this shows that the data we have is not linearly separable.
2.SVM with non linear kernel function:
2-1. poly
2-2. rbf
I want to implement k-means algorithm.
- Illustrating datasets
Look at the pictures above, it's obvious that k-means can perform well on dataset1 but is not a good choice for dataset2, and for dataset2 we should use algorithms like DBSCAN.
- Implement k-means algorithm
WOW!!! if you want to implement the k-means algorithm notice that using plus-plus algorithm for initializing centroids is as important as hell and DOES increase the performance.
3-4. calculating clustering error This is the result of one run.
cluster error for cluster blue is 0.3248827689004398
cluster error for cluster green is 0.3160088195773474
cluster error for cluster orange is 0.3415297385741179
cluster error for cluster purple is 0.31782962165200457
clustering error is 0.32506273717597745
according to the above picture the best number of clusters is 10.
The shape of the clusters are not globular so k-means cannot perform well.
This is the original picture
and this is the image after color reduction using k-means with k=64
Something important I want to mention is that it took me 1033.5089447498322 seconds to do the compression for 64 clusters and that's really long for a compression task, AND I also think the plus plus part of the k-means algorithm plays an important role in the time.