This project seeks to build a model that accurately classifies tweets about Bitcoin as having either positive or negative sentiment. Unlabeled tweets classified by this model could ultimately could be used to analyze time trends on Bitcoin sentiment and assess the predictive power of Twitter sentiment on future price movements of the cryptocurrency.
This project uses Twitter data sourced from Kaggle. It consists of 1 million Tweets referencing Bitcoin between February and August 2021. The sentiment is pre-labeled.
- In this analysis we target sentiment - positive or negative.
- Sentiment is fairly balanced, with around 53 percent of tweets labeled as negative and 47 percent positive.
- As the data is fairly balanced and we value false positives and false negatives equally, we focus on F1 score and accuracy
-
Just 3 percent of validation data categorized as negative when it was actually positive
-
Just 1.8 percent of validation data categorized as positive when it was actually negative
- A Logistic Regression model was the best-performing classifier, with Count Vectorization used to process the annotated tweets
- Final model can classify unlabeled Tweets as positive or negative with ~97 percent accuracy, 97 percent F1 score
- Words important to the model included 'best', 'awesome', 'successful', 'insane', 'worst', 'worthless'
- Positive tweets had more hashtags on average, negative tweets more frequently contained a price
- Pull more recent Tweets on Bitcoin via Twitter API and run final model on real-time data
- Use model-labeled Tweets to conduct Time Series Analysis, with the aim of understanding the predictive power of Tweet sentiment on the price of BTC
- Notebook
- Presentation
- Reach me at [email protected]