GitHub - ytbaum/sample_size_cost: Exploring the cost of small sample sizes, using various goodness-of-fit tests.

This is a project to investigate the accuracy cost of small sample sizes when sampling from a categorical distribution.

Currently, the project only implements one very simple case: sampling from a distribution of evenly weighted categories, using the Jaccard index to evaluate the similarity of the sample distribution to the known population distribution.

To run the project, just run R/sample_size_cost.R, either from within an R REPL/IDE or from the command line.

There are two variables of interest that the user might want to set. These currently need to be set within the code. They are:

bucket.counts: the set of different distributions that will be sampled from
sample.sizes: the set of different sample sizes to use when sampling from each distribution.

The script will create a folder for each distribution within the plots/ directory. There, for each sample size, it will store a histogram of similarity scores generated by sampling from the distribution 1000 times and comparing the sample to the distribution. Also within the same directory is a plot called 'errorbars.jpg'; it shows the mean and standard deviation of similarity scores at each sample size for that distribution.

Also in the plots/ directory, the script will create a "cross-section" plot. This plot shows the mean and standard deviation of similarity scores for each distribution size, at a fixed sample size.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
R		R
plots		plots
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

ytbaum/sample_size_cost

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages