This is a project to investigate the accuracy cost of small sample sizes when sampling from a categorical distribution.
Currently, the project only implements one very simple case: sampling from a distribution of evenly weighted categories, using the Jaccard index to evaluate the similarity of the sample distribution to the known population distribution.
To run the project, just run R/sample_size_cost.R, either from within an R REPL/IDE or from the command line.
There are two variables of interest that the user might want to set. These currently need to be set within the code. They are:
bucket.counts
: the set of different distributions that will be sampled fromsample.sizes
: the set of different sample sizes to use when sampling from each distribution.
The script will create a folder for each distribution within the plots/ directory. There, for each sample size, it will store a histogram of similarity scores generated by sampling from the distribution 1000 times and comparing the sample to the distribution. Also within the same directory is a plot called 'errorbars.jpg'; it shows the mean and standard deviation of similarity scores at each sample size for that distribution.
Also in the plots/ directory, the script will create a "cross-section" plot. This plot shows the mean and standard deviation of similarity scores for each distribution size, at a fixed sample size.