A toolkit for measuring the efficacy of various methods for calculating a confidence interval. Currently provides a toolkit for measuring the efficacy of methods for a confidence interval for the following statistics:
- proportion
- the difference between two proportions
This library was mainly inspired by the library: "Five Confidence Intervals for Proportions That You Should Know About" by Dr. Dennis Robert
- python >=3.8
- python libs:
- numpy
- scipy
- matplotlib
- tqdm
https://pypi.org/project/CI-methods-analyser/
Applied statistics and data science: compare multiple CI methods to select the most appropriate for specific scenarios (by its accuracy at a specific range of true population properties, by computational performance, etc.)
Education on statistics and CI: demonstrates how different CI methods perform under various conditions, helps to understand the concept of CI by comparing methods for evaluation of accuracy of CI methods
Wald Interval is defined as so:
How well does it approximate the confidence interval?
Let's assess what would be the quality of produced 95%CI with this method by testing on a range of proportions. We'll take 100 true proportions, with 1% step [0.001, 0.011, 0.021, ..., 0.991]
.
from CI_methods_analyser import CImethodForProportion_efficacyToolkit as toolkit, methods_for_CI_for_proportion
toolkit(
method=methods_for_CI_for_proportion.wald_interval, method_name="Wald Interval"
).calculate_coverage_and_show_plot(
sample_size=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title="Wald Interval coverage"
)
input('press Enter to exit')
This outputs the image:
The plot indicates the overall bad performance of the method and particularly poor performance for extreme proportions. While for some true proportions the calculated CI has true confidence of around 95%, most of the time the confidence is significantly lower. For the true proportions of <0.05 and >0.95 the true confidence of the generated CI is generally lower than 90%, as indicated by the steep descent on the left-most and right-most parts of the plot.
You really might want to use a different method. Check out this wonderful medium.com article by Dr. Dennis Robert:
The function calculate_coverage_and_show_plot
that we just used is a shortcut. The code below does the same calculations and yields the same result. It relies on the public properties and methods, giving more control over parts of the calculation:
from CI_methods_analyser import CImethodForProportion_efficacyToolkit as toolkit, methods_for_CI_for_proportion
# take an already implemented method for calculating CI for proportions
wald_interval = methods_for_CI_for_proportion.wald_interval
# initialize the toolkit
wald_interval_test_toolkit = toolkit(
method=wald_interval, method_name="Wald Interval")
# calculate the real coverage that the method produces
# for each case of a true population proportion (taken from the list `proportions`)
wald_interval_test_toolkit.calculate_coverage_analytically(
sample_size=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95)
# now you can access the calculated coverage and a few statistics:
# wald_interval_test_toolkit.coverage # 1-d array of 0-100, the same shape as passed `proportions`
# NOTE: `proportions`, when passed as a tuple of 3 float strings, expands to a list of evenly spaced float values where the #0 value is begin, #1 is end, #2 is step.
# wald_interval_test_toolkit.average_coverage # np.longdouble 0-100, avg of `coverage`
# wald_interval_test_toolkit.average_deviation # np.longdouble 0-100, avg abs diff w/ `confidence`
# plots the calculated coverage in a matplotlib.pyplot figure
wald_interval_test_toolkit.plot_coverage(
plt_figure_title="Wald Interval coverage")
# you can access the figure here:
# wald_interval_test_toolkit.figure
# shows the figure (non-blocking)
wald_interval_test_toolkit.show_plot()
# because show_plot() is non-blocking,
# you have to pause the execution in order for the figure to be rendered completely
input('press Enter to exit')
I expose some style/color settings used by matplotlib.
My preference goes to the night light-friendly styling:
from CI_methods_analyser import CImethodForProportion_efficacyToolkit as toolkit, methods_for_CI_for_proportion
toolkit(
method=methods_for_CI_for_proportion.wald_interval, method_name="Wald Interval"
).calculate_coverage_and_show_plot(
sample_size=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title="Wald Interval coverage",
theme='dark_background', plot_color="green", line_color="orange"
)
input('press Enter to exit')
You can implement your own methods and test them:
from CI_methods_analyser import CImethodForProportion_efficacyToolkit as toolkit
from CI_methods_analyser.math_functions import normal_z_score_two_tailed
from functools import lru_cache
# not a particularly good method for calculating CI for proportion
@lru_cache(100_000)
def im_telling_ya_test(x: int, n: int, conflevel: float = 0.95):
z = normal_z_score_two_tailed(conflevel)
p = float(x)/n
return (
p - 0.02*z,
p + 0.02*z
)
toolkit(
method=im_telling_ya_test, method_name='"I\'m telling ya" test'
).calculate_coverage_and_show_plot(
sample_size=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title='"I\'m telling ya" coverage',
theme='dark_background', plot_color="green", line_color="orange"
)
input('press Enter to exit')
This is the kind of test one would not trust. It shows very unreliable performance for the majority of the true proportions, as indicated by an extremely high discrepancy between the "ordered" confidence level of 95% and the true confidence of the CI range provided by this method. This means the output CIs are generally smaller than should be, therefore there's less confidence that the true value lies within the range of a CI. One could say, this method overestimates its ability to generate a confident range.
Let's try another custom method: "God is my witness" score
from CI_methods_analyser import CImethodForProportion_efficacyToolkit as toolkit
from CI_methods_analyser.math_functions import normal_z_score_two_tailed
from functools import lru_cache
# you could say, this method is "too good"
@lru_cache(100_000)
def God_is_my_witness_score(x: int, n: int, conflevel: float = 0.95):
z = normal_z_score_two_tailed(conflevel)
p = float(x)/n
return (
(0 + p)/2 - 0.005*z,
(1 + p)/2 + 0.005*z
)
toolkit(
method=God_is_my_witness_score, method_name='"God is my witness" score'
).calculate_coverage_and_show_plot(
sample_size=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title='"God is my witness" score coverage', theme='dark_background'
)
input('press Enter to exit')
This method clearly overdid the estimates. While one expects 95%CI, the output range is less clear, as it allows for a very wide range of possibilities. In a stats lingo one would say that this method is way too conservative.
Let's use the implemented Pooled Z test:
, where:from CI_methods_analyser import CImethodForDiffBetwTwoProportions_efficacyToolkit as toolkit_d, methods_for_CI_for_diff_betw_two_proportions as methods
toolkit_d(
method=methods.Z_test_pooled, method_name='Z test pooled'
).calculate_coverage_and_show_plot(
sample_size1=100, sample_size2=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title='Z test pooled', theme='dark_background',
)
input('press Enter to exit')
As you can see, this test is generally perfect for close proportions (along y = x
line) [WHITE], unless proportions have extreme values, where confidence of the outputted CIs is lower than expected [PURPLE]
Also, this test is extremely conservative for the high and extreme differences between two proportions, i.e. for proportions whose values are far apart [GREEN]
You may want to change the color palette (although I wouldn't):
from CI_methods_analyser import CImethodForDiffBetwTwoProportions_efficacyToolkit as toolkit_d, methods_for_CI_for_diff_betw_two_proportions as methods
toolkit_d(
method=methods.Z_test_pooled, method_name='Z test pooled'
).calculate_coverage_and_show_plot(
sample_size1=100, sample_size2=100, proportions=('0.001', '0.999', '0.01'), confidence=0.95,
plt_figure_title='Z test pooled', theme='dark_background',
colors=("gray", "purple", "white", "orange", "#d62728")
)
input('press Enter to exit')
Two ways can be used to calculate the efficacy of CI methods for a given confidence and a true population proportion:
- approximately, with random simulation (as implemented in R by Dr. Dennis Robert, see link above). Here:
calculate_coverage_randomly
. - precisely, with the analytical solution. Here:
calculate_coverage_analytically
By default, always prefer the analytical solution.
Sampling the same binomial distribution n times, as it's typically done, (called "random experiments", or "simulations") is inefficient, because the binomial distribution is already fully determined by the given true population proportion.
By relying on the binomial distribution from scipy, the analytical solution provides 100% accuracy for any method (defined as a python function), any confidence level, any true population proportion(s), any sample and population size(s).
Mathematical proof of the analytical solution:
Both "simulation" and "analytical" methods are implemented for CI for both statistics: proportion, and the difference between two proportions. For the precise analytical solution, an optimization was made. Theoretically, it is lossy, but practically, the error is always negligible (as shown by test_z_precision_difference.py
) and is less significant than a 64-bit floating point precision error between the closest float
representation and the true Real
value. Optimization is regulated with the parameter z_precision
, which is automatically estimated by default.
1. Equivalence and Noninferiority Testing (as I understand, are fancy terms for 2-sided and 1-sided p tests for the difference between two proportions)
- https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/PASS/Confidence_Intervals_for_the_Difference_Between_Two_Proportions.pdf
- https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/PASS/Non-Inferiority_Tests_for_the_Difference_Between_Two_Proportions.pdf
- https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Two_Proportions-Non-Inferiority,_Superiority,_Equivalence,_and_Two-Sided_Tests_vs_a_Margin.pdf
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3019319/
- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2701110/
- https://pubmed.ncbi.nlm.nih.gov/9595617/
- http://thescipub.com/pdf/10.3844/amjbsp.2010.23.31
2. Biostatistics course (Dr. Nicolas Padilla Raygoza, et al.)
- https://docs.google.com/presentation/d/1t1DowyVDDRFYGHDlJgmYMRN4JCrvFl3q/edit#slide=id.p1
- https://www.google.com/search?q=Dr.+Sc.+Nicolas+Padilla+Raygoza+Biostatistics+course+Part+10&oq=Dr.+Sc.+Nicolas+Padilla+Raygoza+Biostatistics+course+Part+10&aqs=chrome..69i57.3448j0j7&sourceid=chrome&ie=UTF-8
- https://slideplayer.com/slide/9837395/
3. Using z-test instead of a binomial test: