A large list of user_ids in a txt file (not included because the file is too big for Github). Blackbox.py will be a simulation of a data stream.
I will implement the Bloom Filtering algorithm to estimate whether the user_id in the data stream has shown before. I will find proper hash functions and the number of hash functions in the Bloom Filtering algorithm.
- Global filter bit array length: 69997
- Hash functions: Should be independent, uniformly distributed, and keep the same once created
- User_id conversion: Convert user_id to an integer and then apply hash functions
- Data stream size: 100
- Number of tests: More than 30
- False positive rate (FPR): Should not be larger than 0.5 more than once
- Run time: Within 100s for 30 data streams
- Save results in a CSV file with the header “Time,FPR”
- Encapsulate hash functions into a function called myhashs
In this task, I will implement the Flajolet-Martin algorithm to estimate the number of unique users within a window in the data stream.
- Stream size: 300
- Number of tests: More than 30
- Final result requirement: 0.2 <= (sum of all estimations / sum of all ground truths) <= 5
- Run time: Within 100s for 30 data streams
- Save results in a CSV file with the header “Time,Ground Truth,Estimation”
- Encapsulate hash functions into a function called myhashs
The goal of this task is to implement the fixed size sampling method (Reservoir Sampling Algorithm).
- Stream size: 100
- Number of tests: More than 30
- Random seed: 553 in the main function
- Run time: Within 100s for 30 data streams
- Print the current stage of your reservoir into a CSV file every time you receive 100 users
- Grade: 100%