DATA EXPLODER FOR PRINCIPAL COMPONENT ANALYSIS
v1.0 (09/2021)
Updates, discussions, etc. can be found here in the Data Exploder project: github.com/AleSacco
For further inquiries write to: Alessio Sacco ([email protected])
This script consists in:
- This main "DataExploder.py" file, to be run;
- (OPTIONAL) a configuration file: "config.py".
The script looks for a "config.py" file, in the directory in which the script is run. If the file is not found, the script uses default parameters, which are then written in a config file which is created by default if not found.
This script takes as input 2 .csv files describing data as tables:
- a file containing the best estimate for each data point (default file name: data.csv);
- a file containing an uncertainty value for each data point (dafault file name: uncertainties.csv).
Any data point's complete information is contained at a specific table coordinate which is the same in both files: the estimate file contains the best estimate for that data point, while the uncertainty file contains a value pertaining the uncertainty. The script also accepts data measured as below the limit of detection (LOD), in which case the best estimates table entry is to contain the LOD of the measurement (NOT the value 0), while the corresponding uncertainties entry can contain a blank value, or any non-numerical string to indicate that the first value is a LOD; the number 0 can also be used for this purpose, but this is not recommended.
Both tables must have the exact same structure in terms of row/column positions, number of label columns, etc. THE FIRST ROW must be the same for both tables, containing the unique names for each of the columns (variable names or types of label) and will not be treated as data. In the configuration file, "Number of label columns" is an integer indicating the number of label columns, i.e. the number of leftmost COLUMNS THAT WILL BE IGNORED in the Monte Carlo data generation: these entries in each row will be replicated verbatim for the corresponding generated samples. These can include the sample names and/or categorical variables, intended for later analysis.
In this version of the Data Exploder, each single datum consists in two inputs: best estimate, either a number or any non-numeric string for absent data (such as "N/A" or "NA", or no data), and uncertainty. If non-numeric strings are found in the best estimates table, the corresponding variables will be IGNORED FOR ALL DATA. If a numeric, non-zero uncertainty input is present in the correspondent file, the script interprets it as half of the confidence interval on the measurement with a Gaussian probability density function (pdf), i.e. expanded uncertainty; if an uncertainty input is a string, NaN (not a number), or zero, this is interpreted as an indication that the datum is to be read as BELOW THE LIMIT OF DETECTION: a uniform pdf is used for the data point instead, ranging from zero to the value indicated in the best estimate table.
Using the appropriate pdf, the script then "explodes" each datum (generates Monte Carlo samples) accordingly, using for the Gaussian pdfs a coverage factor, usually named "k", which changes according to the choice of confidence level. As default, in this script k=1.96, corresponding to a confidence level of 95% for a Gaussian pdf, but his can be changed in the config file.
Configuration file variables:
"k" (decimal/float, default 1.96): coverage factor, used for computing Gaussian width parameters from uncertainties;
"Measurements file name" (string, default "data.csv"): name of the file containing the best estimates data;
"Uncertainties file name" (string, default "uncertainties.csv"): name of the file containing the uncertainties data;
"Destination file name" (string, default "Exploded data.csv"): name of the file generated by the script, containing the Monte Carlo samples data;
"Number of samples" (integer or decimal/float, default 1E3): number of Monte Carlo samples that will be generated for each data point (rounded to integer if necessary);
"Number of label columns" (integer, default 3): number of leftmost columns to be ignored, containing labels and categorical data, unique or not.