This is an implementation exercise of the Girvan-Newman algorithm. The goal is to find the optimal community clustering of a Yelp user network using the Yelp dataset.
The general pipeline is as followed:
- First, we have an input file (inputs/sample_data.csv) which contains a list of Yelp user and business pairs. The pairs represent each business review given by their customers.
- We construct a graph of user network. Each pair of user is connected if and only if the number of times both users rated the same businesses exceeds the threshold value from the input.
- We use this graph as the initial graph of the network and calculate betweenness value of the graph and save the output in a file.
- Calculate modularity of the clustering.
- Remove edge(s) with highest betweenness, re-cluster, and calculate modularity once again.
- Repeat this until there are no more edges left to remove and keep the community clustering with the highest modularity value.
- Output the optimal clustering to a text file.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
- Python >= 3.6
- Spark >= 2.3.3
Download the project by cloning the repository.
git clone https://github.com/hsuanhauliu/girvan-newman-community-detection.git
Follow the command below to run the program.
python3 main.py [threshold_value] [input_file] [betweenness_output_file] [community_output_file]
Example:
python3 main.py 7 inputs/sample_data.csv betweenness.txt community.txt