News Classifier

Note: The present repo is set only for demonstration purposes. If you would like to execute this project, please reach out to me for instructions. If you would like to have a look at the original datasets, you will have to download the files manually as they are too large for uploading them in a Github repo
Important: If Github is unable to load a notebook, please use Jupyter NBViewer by just copying-pasting the Github link of the corresponding notebook. That will render the notebook view properly.

News Classifier

-- Project Status: [Completed]

Project Intro/Objective

The purpose of this project is to build a news classifier based on textual semi-structured data. The categories we have are:

Political
Sport
World
Entertainment
Travel
Financial
Technology

We would like to test out different classification algorithms and see which one has the highest score but before we do so, our main focus will be on cleaning the data and prepare them for training and testing.

Methods Used

Machine Learning
Data Visualization
Hyper-parameter tuning
NLP
EDA

Libraries & Technologies

Python 3
Dask
Pandas, Plotly, NLTK, seaborn, sklearn, numpy, WordCloud, collections and more
Jupyter Notebooks (Anaconda)

Data source

The data has been collected by Webhose.io, a company that turns multiple vertical web content domains into machine-readable data feeds anyone can access at scale. Webhose is also offering free datasets about news and blog feeds.
Our used datasets can be found here

Project Highlights & Challenges

1. Unbalanced dataset distribution

One of the main challenges I came across was the unbalanced distribution of the dataset. Especially after processing, the distribution did not shift greatly.

2. Balancing the distribution of the dataset using a custom undersampling method

As you see for the image above, in undersampling we are using the minority class as a max limit to extract samples from the majority class. In general, this method is frequently used in the Machine Learning area, however, it has also a risk of information loss as we remove random samples from another class.
In order to ensure that we will collect only qualitative records and minimize the risk of information loss, we selected random samples with the help of quantiles in the outliers detection.

Theory

Our dataset's boxplot

Using Plotly's boxplot I was able to visualize the outliers in 4 quantiles. Next, I identified the number of records of the minority class (let's say X number) and I took X number of random records for every other class but only for the records that were between Q1 and Q3. In that way, I make sure to extract only qualitative records of every downsampled class and reduce the risk of information loss as much as possible.

Final Dataset

More info in Data Cleaning pt2 and in EDA.py file.

3. Algorithms Comparison - GridSearch Results

Accuracy can be a valid choice for our classification evaluation challenge since the dataset is well balanced.
As we see from the above image, the best algorithms were by far Support Vectors with the implementation of SVC coming 1st and SDG (stochastic gradient descent) following in 2nd place with a very small difference.

4. Pandas crashed as it was unable to locate 76GB in RAM for training the model with the full dataset! Now what?

For applying our GridSearch, I used a small representation of the whole dataset to save up time and computing capacity. However, I knew that this might be a problem when I will have to fit the model with the whole dataset. There was no way for me to fit the final model in 76GB of memory and I had to search for alternatives.

Dask offers advanced parallelization for analytics while using efficiently disk capacity instead of memory and it can leverage the multi-core CPUs for distributed computing.

That helped me not only to fit the whole dataset but also to see our accuracy improving by 8.2% (!). Eager to know more? Check out Optimise and Conclusions notebook.

Future work

As a next step, I would like to expose the model with FastAPI and scrape news feed on a daily basis. The final idea would be to create a self-learning model that will feed new data every day. The usage of such an algorithm could be on classifying future news articles from any source, that have no assigned category. Also a nice addition would be to display in the final user, a buzzword cloud the daily hot articles. In that way, the user sees what the most famous articles of day are talking about.

Contact

If you have any questions or remarks about the project, feel free to reach out to me. I am always open for feedback
My contact info can be found on my personal website: www.philpappas.com

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
visuals		visuals
.gitignore		.gitignore
1_Data_Collection.ipynb		1_Data_Collection.ipynb
2_Data_Cleaning_pt1.ipynb		2_Data_Cleaning_pt1.ipynb
3.1_Find_noisy_words.ipynb		3.1_Find_noisy_words.ipynb
3_Data_Cleaning_pt2.ipynb		3_Data_Cleaning_pt2.ipynb
4_NLP.ipynb		4_NLP.ipynb
5_GridSearch_Exploration.ipynb		5_GridSearch_Exploration.ipynb
6_Optimise_and_Conclusions.ipynb		6_Optimise_and_Conclusions.ipynb
Cleaner.py		Cleaner.py
EDA.py		EDA.py
NLP.py		NLP.py
README.md		README.md
check pickle vs feather performance.ipynb		check pickle vs feather performance.ipynb
packages_imported.py		packages_imported.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

News Classifier

-- Project Status: [Completed]

Project Intro/Objective

Methods Used

Libraries & Technologies

Data source

Project Highlights & Challenges

1. Unbalanced dataset distribution

2. Balancing the distribution of the dataset using a custom undersampling method

3. Algorithms Comparison - GridSearch Results

4. Pandas crashed as it was unable to locate 76GB in RAM for training the model with the full dataset! Now what?

Future work

Contact

About

Releases

Packages

Languages

phil-pappas/News_Articles_Classifier

Folders and files

Latest commit

History

Repository files navigation

News Classifier

-- Project Status: [Completed]

Project Intro/Objective

Methods Used

Libraries & Technologies

Data source

Project Highlights & Challenges

1. Unbalanced dataset distribution

2. Balancing the distribution of the dataset using a custom undersampling method

3. Algorithms Comparison - GridSearch Results

4. Pandas crashed as it was unable to locate 76GB in RAM for training the model with the full dataset! Now what?

Future work

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages