- One simulated dataset and one real-world dataset will be used for this assignment.
- Build and test the program with a small simulated CSV file provided.
- Calculate combinations of frequent businesses and users based on a support threshold.
- Create baskets for each user containing the business ids reviewed by the user, and for each business containing the user ids that commented on the business.
- Generate a subset using the Ta Feng dataset with a structure similar to the simulated data.
- Implement the SON Algorithm on top of the Spark Framework.
- Find all possible combinations of frequent itemsets in any given input file within the required time.
- Case number: Integer specifying the case (1 for Case 1, 2 for Case 2).
- Support: Integer defining the minimum count to qualify as a frequent itemset.
- Input file path: Path to the input file including path, file name, and extension.
- Output file path: Path to the output file including path, file name, and extension.
- Grade: 100%