Skip to content
Joel Natividad edited this page Sep 24, 2024 · 10 revisions

stats Command Output Explanation

The qsv stats command computes summary statistics and infers data types for each column in a CSV file. Here's a detailed explanation of the output:

Basic Information

  • field: The name of the column (or its index if --no-headers is used).
  • type: Inferred data type (NULL, Integer, String, Float, Date, DateTime, or Boolean).
    • data type inferences are GUARANTEED as stats scans the entire file.
    • Date and DateTime are only inferred when --infer-dates is enabled. It can infer 19 date formats.
    • Booleans are only inferred when -infer-boolean is enabled. The heuristic for inferring booleans is as follows: When a column's cardinality is 2, and the 2 values' first characters are 0/1, t/f & y/n case-insensitive, the data type is inferred as boolean.
  • is_ascii: Whether the column contains only ASCII characters (true/false).

Numeric Statistics

  • sum: The total sum of all numeric values in the column. Will return *OVERFLOW*/*UNDERFLOW* when the sum is greater than/lesser than i64::MAX/i64::MIN respectively.
  • min: The minimum value in the column.
  • max: The maximum value in the column.
  • range: The difference between the maximum and minimum values.
  • sort_order: The sorting order of the column (ASCENDING, DESCENDING, or UNSORTED).

 

NOTE: lengths are byte not char lengths as some UTF-8 characters take more than one byte. They're only computed for String, Float and Integer types.

  • min_length: The length of the shortest value in the column.
  • max_length: The length of the longest value in the column.
  • sum_length: The total sum of the lengths of the column. Returns *OVERFLOW* when the sum is greater than u64::MAX
  • avg_length: The average length of the column. Returns *OVERFLOW* when sum_length overflows.

 

  • mean: The average value of the column.
  • sem: Standard Error of the Mean, a measure of the precision of the sample mean.
  • stddev: Standard deviation, a measure of variability in the data.
  • variance: The average of the squared differences from the mean.
  • cv: Coefficient of Variation, the ratio of the standard deviation to the mean.
  • nullcount: The number of null or empty values in the column.
  • max_precision: The maximum number of decimal places in numeric values.
  • sparsity: The proportion of null or empty values in the column.

Advanced Statistics (requires --everything or specific flags)

  • median: The middle value when the data is sorted (requires --median or --everything) When --quartiles is specified, this is not returned as its the same as "q2_median".
  • mad: Median Absolute Deviation, a robust measure of variability (requires --mad or --everything).  
     
    (requires --quartiles or --everything).
  • lower_outer_fence: Q1 - 3 * IQR, used to identify extreme outliers.
  • lower_inner_fence: Q1 - 1.5 * IQR, used to identify mild outliers.
  • q1: First quartile (25th percentile).
  • q2_median: Second quartile (50th percentile, same as median).
  • q3: Third quartile (75th percentile).
  • iqr: Interquartile Range, the difference between Q3 and Q1.
  • upper_inner_fence: Q3 + 1.5 * IQR, used to identify mild outliers.
  • upper_outer_fence: Q3 + 3 * IQR, used to identify extreme outliers.
  • skewness: A measure of the asymmetry of the probability distribution.

 

  • cardinality: The number of unique values in the column (requires --cardinality or --everything).
  • mode: The most frequent value(s) in the column (requires --mode or --everything).
  • mode_count: The number of modes.
  • mode_occurrences: The number of times the mode(s) appear.
  • antimode: The least frequent non-zero/non-null value(s) in the column.
  • antimode_count: The number of antimodes.
  • antimode_occurrences: The number of times the antimode(s) appear.

Date and Time Statistics

When --infer-dates is enabled, several statistics are computed as follows:

  • Date range, standard deviation, variance, MAD, and IQR are returned in days.
  • DateTime results are in RFC3339 datetime format (YYYY-MM-DDTHH:MM:SS±HH:MM) - e.g. 2022-01-01T00:16:00+00:00
  • Date results are in RFC3339 date format (YYYY-MM-DD) in the UTC timezone.

Notes

  • The default "streaming" statistics (sum, min/max/range, sort order, min/max/sum/avg length, mean, sem, stddev, variance, cv, nullcount, max_precision, sparsity) works with constant memory and can be computed efficiently on arbitrarily large CSV files.
  • Advanced statistics require loading the entire file into memory and must be explicitly enabled.
  • The command supports various caching options to improve performance on subsequent runs.
  • The stats command is central to qsv and underpins other "smart" commands like frequency, schema, validate, and tojsonl that uses the statistical info to work smarter and faster.

For more detailed information on specific options and usage, refer to the qsv stats --help output.