This project implements a movie recommendation system using collaborative filtering techniques, specifically matrix factorization. The system is designed to predict user preferences for movies based on historical ratings and to generate personalized movie recommendations.
By leveraging latent factor models, the system captures underlying patterns in user behavior and movie characteristics. The project uses the MovieLens dataset, which provides a rich source of user ratings and movie metadata.
- Project Overview
- Project Structure
- Installation
- Data Preparation
- Methodology
- Model Training
- Evaluation
- Usage
- Results
- Visualization
- References
- Acknowledgments
The goal of this project is to build a scalable and efficient movie recommendation system using collaborative filtering. The key features include:
- Data Preprocessing: Handling large datasets (e.g., 32 million ratings) efficiently.
- Model Training: Implementing matrix factorization with bias terms and regularization.
- Performance Optimization: Utilizing techniques like Numba for just-in-time compilation to speed up computations.
- Evaluation: Tracking training and validation metrics, including loss and RMSE.
- Recommendation Generation: Providing personalized movie recommendations based on user input.
- Visualization: Visualizing movie embeddings to understand the latent feature space.
The project is organized into the following main components:
- Scripts:
main.py
: The main script that runs the training loop for the recommendation system.Data_Preprocessing.py
: Preprocesses the dataset and prepares training and test data.Recommendation.py
: Generates movie recommendations for a user based on their favorite movies.Visualization.py
: Visualizes movie embeddings using dimensionality reduction techniques.
- Modules:
Factors_Update_Functions.py
: Contains functions for updating user and movie biases and latent factors.Helper_Functions.py
: Utility functions for plotting, logging, saving models, and loading data.
- Data Directories:
ml-32m/
: Contains the original MovieLens dataset files (ratings.csv
,movies.csv
).TRAIN_TEST_DATA/
: Stores preprocessed training and test data.
- Experiments Directory:
Experiments/
: Contains experiment logs, models, results, and plots.
- Miscellaneous:
README.md
: This detailed readme file.requirements.txt
: Lists required Python packages.
Ensure you have the following installed:
- Python 3.7 or higher
- NumPy
- Pandas
- Numba
- Matplotlib
- scikit-learn
- Polars
- Pickle
-
Clone the Repository
git clone https://github.com/your_username/your_repository.git cd your_repository
-
Set Up a Virtual Environment (Recommended)
python3 -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Required Packages
Install the required packages using
pip
:pip install -r requirements.txt
If
requirements.txt
is not available, install packages individually:pip install numpy pandas numba matplotlib scikit-learn polars
The project uses the MovieLens 25M Dataset, which contains millions of user ratings for movies along with movie metadata.
Required Files:
ratings.csv
: Contains user ratings (userId
,movieId
,rating
,timestamp
).movies.csv
: Contains movie metadata (movieId
,title
,genres
).
-
Download the Dataset
Download
ratings.csv
andmovies.csv
from the MovieLens website. -
Place the Files
Place the downloaded files in the
ml-32m/
directory within the project folder. -
Run Data Preprocessing
Execute the data preprocessing script:
python Data_Preprocessing.py
This script performs the following tasks:
- Reads and processes the ratings and movies data.
- Maps user and movie IDs to continuous indices starting from 0 for efficient storage and computation.
- Splits the data into training and test sets (e.g., 80% training, 20% test).
- Groups ratings by users and movies.
- Saves the preprocessed data and mappings to the
TRAIN_TEST_DATA/
directory.
Note: The preprocessing step may take some time due to the size of the dataset.
The recommendation system uses collaborative filtering, which makes predictions about a user's interests by collecting preferences from many users. Specifically, it employs matrix factorization to discover latent features underlying the interactions between users and items (movies).
- User Factors (
users_factors
): Represents the latent preferences of users. - Movie Factors (
movies_factors
): Represents the latent characteristics of movies. - Genre Factors (
genre_factors
): Captures genre-specific characteristics.
- User Bias (
user_bias
): Captures the tendency of a user to rate items higher or lower than average. - Item Bias (
item_bias
): Captures the inherent popularity or quality of an item.
The model minimizes the regularized squared error between predicted and actual ratings:
$ \min_{U, V, b_u, b_i} \sum_{(u, i) \in \text{Ratings}} \left( r_{ui} - (\mathbf{u}_u^\top \mathbf{v}_i + b_u + b_i) \right)^2 + \lambda \left( |\mathbf{u}_u|^2 + |\mathbf{v}_i|^2 \right) + \gamma \left( b_u^2 + b_i^2 \right) $
Where:
- ( r_{ui} ): Actual rating of user ( u ) for item ( i ).
- ( \mathbf{u}_u ): Latent factor vector for user ( u ).
- ( \mathbf{v}_i ): Latent factor vector for item ( i ).
- ( b_u ), ( b_i ): User and item biases.
- ( \lambda ), ( \gamma ): Regularization parameters.
- Alternating Least Squares (ALS): The optimization alternates between fixing user factors and updating item factors, and vice versa.
- Regularization: Prevents overfitting by penalizing large latent factors and biases.
- Numba JIT Compilation: The
@jit(nopython=True)
decorator is used to compile functions to machine code at runtime, significantly speeding up computations.
Execute the training script:
python main.py
Within main.py
, you can adjust the following hyperparameters:
-
Experiment Settings:
experiment_name = "your_experiment_name"
-
Model Hyperparameters:
K_factors = 30 # Number of latent factors n_Epochs = 100 # Number of training epochs lambda_reg = 1 # Regularization parameter for latent factors gamma = 0.01 # Regularization parameter for biases taw = 10 # Learning rate or regularization term
-
Setup Logging and Experiment Folder
- Initializes logging to track progress.
- Creates an experiment folder to save models and logs.
-
Data Loading
- Loads preprocessed training and test data.
- Loads index mappings for users and movies.
-
Model Initialization
- Initializes user and movie latent factors with random values.
- Initializes user and item biases with random values.
-
Training Loop
For each epoch:
-
User Updates:
- Updates user biases and latent factors using
Update_user_biases
andUpdate_user_factors
.
- Updates user biases and latent factors using
-
Movie Updates:
- Updates item biases and latent factors using
Update_movie_biases
andUpdate_movie_factors
.
- Updates item biases and latent factors using
-
Metric Calculation:
- Calculates training and validation loss and RMSE using
calc_metrics
.
- Calculates training and validation loss and RMSE using
-
Logging and Saving:
- Logs the metrics and training times.
- Saves the model parameters.
-
-
Post-Training
- Plots and saves the loss and RMSE curves using
plot_likelihood
andplot_rmse
.
- Plots and saves the loss and RMSE curves using
- Update_user_biases
- Update_user_factors
- Update_movie_biases
- Update_movie_factors
- calc_metrics
These functions perform the core computations for updating model parameters and calculating metrics.
- plot_likelihood
- plot_rmse
- setup_logging
- setup_experiment_folder
- save_model
- load_model
- Load_training_data
- Load_test_data
- Load_idx_maps
These utility functions assist with logging, plotting, data loading, and model saving/loading.
-
Root Mean Square Error (RMSE):
[ \text{RMSE} = \sqrt{ \frac{1}{N} \sum_{(u, i)} \left( r_{ui} - \hat{r}_{ui} \right)^2 } ]
Where:
- ( N ): Total number of ratings.
- ( r_{ui} ): Actual rating.
- ( \hat{r}_{ui} ): Predicted rating.
- The
calc_metrics
function computes RMSE and total loss for both training and validation datasets after each epoch.
- Metrics are logged to both the console and a log file in the experiment folder.
- Training and validation RMSE and loss are tracked over epochs.
- Loss and RMSE curves are plotted and saved to visualize training progress.
Use the Recommendation.py
script to generate movie recommendations based on a user's favorite movie.
-
Load Trained Model and Data
The script loads the trained model parameters and mappings.
-
Specify Input Movie
Modify the script to specify the movie the user likes:
movie_name, movie_id, genre = get_movie_details(movies, "Toy Story (1995)")
-
Create a User Profile
A "fake user" is created based on the favorite movie using
create_fake_user
. -
Generate Recommendations
- Computes recommendation scores by projecting the user's latent factors onto all movie factors.
- Adjusts scores with item biases.
-
Display Recommendations
Outputs a list of recommended movies and their genres, excluding the input movie.
python Recommendation.py
The user liked the movie: Toy Story (1995), with id: 1, The genre is Animation|Children|Comedy
Recommendations are:
title genre
0 Toy Story 2 (1999) Adventure|Animation|Children|Comedy
1 Finding Nemo (2003) Adventure|Animation|Children
2 Monsters, Inc. (2001) Adventure|Animation|Children
...
Use the Visualization.py
script to visualize movie embeddings.
-
Load Model and Mappings
Loads movie factors and index-to-title mappings.
-
Dimensionality Reduction
Applies PCA to reduce movie latent factors to 2D for visualization.
-
Plot Embeddings
Plots the reduced embeddings and labels select movies.
python Visualization.py
- A scatter plot displaying movies in the latent feature space, potentially revealing clusters of similar movies.
- Loss and RMSE Curves: Plots show how loss and RMSE decrease over epochs, indicating model learning.
- The system provides relevant movie recommendations based on user preferences.
- Demonstrates the model's ability to capture latent similarities.
- Visualizations may reveal clusters of movies with similar genres or themes.
- Helps in interpreting the learned latent features.
All plots generated during training and evaluation are saved in the respective experiment folder under Experiments/
.
- Loss Curves:
Experiments/<experiment_name>/_Negative Log Likelihood Curves_experiment_plot.pdf
- RMSE Curves:
Experiments/<experiment_name>/_RMSE Curves_experiment_plot.pdf
- Embeddings Visualization: Saved as images when running
Visualization.py
.
- MovieLens Dataset: MovieLens 32M Dataset
- Matrix Factorization Techniques: Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix Factorization Techniques for Recommender Systems.
- Numba Documentation: Numba - JIT Compiler for Python
- Polars Documentation: Polars - Fast DataFrames in Rust and Python
- GroupLens Research: For providing the MovieLens dataset.