This repo provides methods of measuring how two empirical distributions differ.
- Download the latest version of D-SHIFT in releases tab
- Create configuration files. See the example.
- Run with
java -cp <application_jar> stats.EvaluateDistribution <path_to_config_a> <path_to_config_b> ...
- Scala 2.11
- Spark 2.4.4
- circe (JSON library for Scala)
Each method reports its corresponding statistic
- Two-sample Kolmogorov-Smirnov test => max KS distance
- Kullback–Leibler divergence => how much information loss when approaching origin with current distribution
- Other distribution comparison methods
- Test of significance
Take a look at the following config.
{
"eval_method": "kstest",
"compared_col": {
"origin_sample_column": "",
"current_sample_column": ""
},
"source": {
"format": "",
"path_to_origin_sample": "",
"path_to_current_sample": ""
},
"options": {
"method": "",
"numOfBin": null,
"rounding": null
}
}
- The method used to evaluate how two empirical distributions differ
- Currently supported methods:
- Two-sample KS test
- KL divergence
- This field denotes the column (numeric) whose distribution will be compared
- Since each sample data might have different column name for the same data (e.g.
Sex
in first sample &Gender
in second sample), this field is introduced - This field consists of two sub-fields, namely
origin_sample_column
andcurrent_sample_column
origin_sample_column
: column from sample with expected distribution (e.g. stored data that becomes the base distribution)current_sample_column
: column from sample with actual distribution (e.g. new data with the same or different distribution with the stored data)
- This field denotes the data sources properties
- The sub-fields include the following:
format
: currently supportscsv
andparquet
filepath_to_origin_sample
: path to the origin sample setpath_to_current_sample
: path to the current sample set
- This field denotes several treatments for internal computation
method
: whether to discretize the data (binned
) or not (normal
)numOfBin
: number of bins when the selected method isbinned
rounding
: number of digits after comma
- PRs are welcome!
- You may add other distribution comparison methods
- You also may add a feature for test of significance
- Bug fixes and features request