-
Notifications
You must be signed in to change notification settings - Fork 74
Supplemental
Joel Natividad edited this page Sep 24, 2024
·
10 revisions
The qsv stats
command computes summary statistics and infers data types for each column in a CSV file. Here's a detailed explanation of the output:
- field: The name of the column (or its index if --no-headers is used).
-
type: Inferred data type (NULL, Integer, String, Float, Date, DateTime, or Boolean).
- data type inferences are GUARANTEED as
stats
scans the entire file. - Date and DateTime are only inferred when
--infer-dates
is enabled. It can infer 19 date formats. - Booleans are only inferred when
-infer-boolean
is enabled. The heuristic for inferring booleans is as follows: When a column's cardinality is 2, and the 2 values' first characters are 0/1, t/f & y/n case-insensitive, the data type is inferred as boolean.
- data type inferences are GUARANTEED as
- is_ascii: Whether the column contains only ASCII characters (true/false).
-
sum: The total sum of all numeric values in the column. Will return
*OVERFLOW*
/*UNDERFLOW*
when the sum is greater than/lesser than i64::MAX/i64::MIN respectively. - min: The minimum value in the column.
- max: The maximum value in the column.
- range: The difference between the maximum and minimum values.
- sort_order: The sorting order of the column (ASCENDING, DESCENDING, or UNSORTED).
NOTE: lengths are byte not char lengths as some UTF-8 characters take more than one byte. They're only computed for String, Float and Integer types.
- min_length: The length of the shortest value in the column.
- max_length: The length of the longest value in the column.
-
sum_length: The total sum of the lengths of the column. Returns
*OVERFLOW*
when the sum is greater than u64::MAX -
avg_length: The average length of the column. Returns
*OVERFLOW*
whensum_length
overflows.
- mean: The average value of the column.
- sem: Standard Error of the Mean, a measure of the precision of the sample mean.
- stddev: Standard deviation, a measure of variability in the data.
- variance: The average of the squared differences from the mean.
- cv: Coefficient of Variation, the ratio of the standard deviation to the mean.
- nullcount: The number of null or empty values in the column.
- max_precision: The maximum number of decimal places in numeric values.
- sparsity: The proportion of null or empty values in the column.
-
median: The middle value when the data is sorted (requires
--median
or--everything
) When--quartiles
is specified, this is not returned as its the same as "q2_median". -
mad: Median Absolute Deviation, a robust measure of variability (requires
--mad
or--everything
).
(requires--quartiles
or--everything
). - lower_outer_fence: Q1 - 3 * IQR, used to identify extreme outliers.
- lower_inner_fence: Q1 - 1.5 * IQR, used to identify mild outliers.
- q1: First quartile (25th percentile).
- q2_median: Second quartile (50th percentile, same as median).
- q3: Third quartile (75th percentile).
- iqr: Interquartile Range, the difference between Q3 and Q1.
- upper_inner_fence: Q3 + 1.5 * IQR, used to identify mild outliers.
- upper_outer_fence: Q3 + 3 * IQR, used to identify extreme outliers.
- skewness: A measure of the asymmetry of the probability distribution.
-
cardinality: The number of unique values in the column (requires
--cardinality
or--everything
). -
mode: The most frequent value(s) in the column (requires
--mode
or--everything
). - mode_count: The number of modes.
- mode_occurrences: The number of times the mode(s) appear.
- antimode: The least frequent non-zero/non-null value(s) in the column.
- antimode_count: The number of antimodes.
- antimode_occurrences: The number of times the antimode(s) appear.
When --infer-dates
is enabled, additional date-specific statistics are computed:
- Date range, standard deviation, variance, MAD, and IQR are returned in days.
- DateTime results are in RFC3339 datetime format (YYYY-MM-DDTHH:MM:SS±HH:MM) - e.g. 2022-01-01T00:16:00+00:00
- Date results are in RFC3339 date format (YYYY-MM-DD) in the UTC timezone.
- The default "streaming" statistics (sum, min/max/range, sort order, min/max/sum/avg length, mean, sem, stddev, variance, cv, nullcount, max_precision, sparsity) works with constant memory and can be computed efficiently on arbitrarily large CSV files.
- Advanced statistics require loading the entire file into memory and must be explicitly enabled.
- The command supports various caching options to improve performance on subsequent runs.
- The stats command is central to qsv and underpins other "smart" commands like
frequency
,schema
,validate
, andtojsonl
that uses the statistical info to work smarter and faster.
For more detailed information on specific options and usage, refer to the qsv stats --help
output.