Machine learning model (based on Natural Language Processing) for classification of tweets into two categories: true information or fake news.
Dataset comes from Kaggle: Fake News Detection
We will be building a solution based on NLP (Natural Language Processing).
Our task will be to create a model that can classify text into one of two groups:
- True information (label 1)
- Fake information (fake news, label 2)
- Best Model: Support Vector Machine (SVM)
- Best Parameters:
C = 10
,gamma = 'scale'
,kernel = 'rbf'
- Key Performance Metrics:
- Precision: 0.98
- Recall: 0.97
- F1-Score: 0.98
- Accuracy: 98%
- Weighted Average Precision: 0.98
The Support Vector Machine model achieved the highest precision (0.98) and was chosen as the best-performing model.
pandas
: Data manipulation, cleaning, and saving datasets.ydata-profiling
: Generating exploratory data analysis profiling reports.nltk
: Stopword removal and text tokenization.spacy
: Lemmatization and advanced text processing.re
: Regular expressions for text cleaning.gensim
: For training the Word2Vec model and word vector analysis.sklearn
: For training models, hyperparameter tuning, and classification reports.numpy
: For numerical data manipulation and vector operations.
- Notebook 1 - "1_data_preprocessing"
- Notebook 2 - "2_vectors_report"
- Notebook 3 - "3_model_training_and_tuning"
- Steps:
- Loaded the dataset from a CSV file.
- Generated a profiling report for exploratory data analysis to understand the dataset distribution.
- Technologies Used:
pandas
: For data manipulation and analysis.ydata-profiling
: For generating a profiling report of the dataset.
- Steps:
- Filtered articles related to "politics" from the dataset for specific analysis.
- Checked for missing values and duplicates in the dataset.
- Technologies Used:
pandas
: For data filtering and inspecting missing or duplicate entries.
- Steps:
- Dropped rows with missing values to ensure data consistency.
- Combined the 'Headline' and 'Body' columns into a single feature for better context.
- Removed unnecessary columns like 'URLs'.
- Technologies Used:
pandas
: For data cleaning and feature engineering.
- Steps:
- Normalized text by converting it to lowercase and removing numbers and special characters.
- Removed stopwords to reduce noise in the data.
- Tokenized the text and performed lemmatization to standardize words.
- Technologies Used:
nltk
: For stopword removal and text tokenization.spacy
: For lemmatization and advanced text processing.re
: For regular expressions to clean unwanted characters.
- Steps:
- Loaded the preprocessed dataset from a JSON file.
- Technologies Used:
pandas
: For loading the dataset into a DataFrame.
- Steps:
- Tokenized the 'Headline+Body' column by splitting the text into individual words.
- Technologies Used:
pandas
: For applying tokenization on the dataset.
- Steps:
- Trained a Word2Vec model on the tokenized text with specified hyperparameters.
- Retrieved word vectors for analysis and explored word similarities (e.g., "politics", "trump").
- Technologies Used:
gensim
: For training the Word2Vec model and exploring word vectors.numpy
: For manipulating and calculating word embeddings.
- Steps:
- Calculated the average word vector for each article by averaging the word embeddings of the tokens in the article.
- Added the calculated average vectors as new columns in the dataset.
- Technologies Used:
numpy
: For handling vector operations and calculating the average vectors for each article.
- Steps:
- Split the dataset into training and testing sets using an 80/20 ratio.
- Technologies Used:
sklearn
: For splitting the dataset into train and test sets.
- Steps:
- Trained multiple classification models: Random Forest, Logistic Regression, and Support Vector Machine (SVM).
- Generated classification reports to evaluate the performance of each model.
- Technologies Used:
sklearn
: For implementing and training classification models, as well as generating classification reports.
- Steps:
- Saved the dataset with added article vectors to a new JSON file for future use.
- Technologies Used:
pandas
: For saving the final dataset to a JSON file.
- Steps:
- Loaded the dataset containing article vectors from a JSON file.
- Technologies Used:
pandas
: For loading the dataset into a DataFrame.
- Steps:
- Prepared the feature matrix
X
(article vectors) and target variabley
(labels). - Split the dataset into training and testing sets using an 80/20 ratio.
- Used GridSearchCV for hyperparameter optimization on three models:
- Random Forest Classifier
- Logistic Regression
- Support Vector Machine (SVM)
- Prepared the feature matrix
- Technologies Used:
sklearn
: For model training, hyperparameter tuning (GridSearchCV), and data splitting.
- Steps:
- Evaluated each model using the classification report to obtain precision, recall, and F1-score.
- Printed the classification report for each model, showing the best-performing hyperparameters.
- Technologies Used:
sklearn
: For generating classification reports and evaluating model performance.
- Steps:
- Compared the models based on their precision scores from the classification report.
- Selected the best model based on the highest precision value.
- Technologies Used:
sklearn
: For precision comparison and selecting the best-performing model.
- Steps:
- Printed the best model and its corresponding precision value for final analysis.
- Technologies Used:
sklearn
: For result analysis and final model selection.