This is a framework for automating high-permutation, large scale ML analyses. While it was developed for research into prediction post-surgical outcomes for patients with DCM, it can be easily extended to allow for the analysis of any tabular dataset.
-
Clone this repository to wherever you need it:
git clone https://github.com/SomeoneInParticular/dcm-classic-ml.git
-
Create a new Conda/Mamba environment with the dependencies needed:
conda env create -f classic_ml_reloaded.yml
mamba env create -f classic_ml_reloaded.yml
-
Activate said environment
conda activate classic_ml_reloaded
mamba activate classic_ml_reloaded
-
Done!
-
This only sets up the tool to be run; you will still need to create the configuration files for the analyses you want to run (see
testing
for an example).
Four files are needed to run an analysis
- A tabular dataset, containing the metrics you want to run the analysis on
- Should contain at least 1 independent and 1 target metric; unsupervised analyses are currently not supported
- A data configuration file; this defines where a dataset is and what pre-processing methods
should be applied to its contents. An example, alongside the dataset it manages, can be found
in
testing/iris_data/
- A model configuration file; this defines which ML model to test, which hyper-parameters to tune,
and how to tune them. A few examples are available in
testing/model_configs/
- A study configuration file; this defines which metrics to evaluated throughout the runtime of the
analysis, and where to save the results (currently only supports an SQLite DB output format). An
example is provided in
testing/testing_study_config.json
Once all three have been created, and you have installed all dependencies (detailed in
classic_ml_reloaded.yml
) simply run the following command (replacing the values within the
curly brackets with the corresponding file name):
python run_ml_analysis.py -d {data_config} -m {model_config} -s {study_config}
The overall structure of the analysis can be broken down into the following broad steps:
- Configuration Loading: All configuration files are loaded and checked for validity.
- Dataset Loading: The tabular dataset designated in the data configuration file is loaded
- If a target column is specified, it is split off the dataset at this point to isolate it from pre-processing (see below)
- Study Initialization: An Optuna study is initialized, set up to run
n_trials
trials as specified in the study config file.- All steps past this point occur per-trial, sampling from the corresponding
Trial
instance to determine the hyperparameters to use. - Configuration files denote a parameter as being "trial tunable" by placing a dictionary in the
place of a constant; an example of this can be seen in the
penalty
parameter for thetesting/model_configs/log_reg.json
file. - Details on how hyper-parameters are sampled via Optuna Trials can be found (here)[https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html].
- All steps past this point occur per-trial, sampling from the corresponding
- Universal Pre-Processing: Any data processing hooks for which
"run_per_replicate": true
are run on the dataset in its entirety- If a data processing hook does not specify a
run_per_replicate
value, it defaults totrue
.
- If a data processing hook does not specify a
- In-Out Splits: The dataset is split via a stratified k-fold split into in- and out-groups,
n_replicates
times.- As the parameter name implies, each of these splits will make up an analytical "replicate"
- Any post-split hooks for which
"run_per_replicate": true
will also run here, fitting to the in-dataset and transforming both the in- and out-dataset if possible - If a data processing hook does not specify a
run_per_replicate
value, it defaults totrue
. - NOTE: Despite this occurring per-trial, the RNG state being fixed prior to study start ensures that the in-out datasets are the same for all trials, so long as universal pre-processing did not delete and samples during its run-time
- Replicate Pre-Processing: For each in-dataset, any data processing hooks for which
"run_per_cross": true
are run on the in-dataset.- If a data processing hook does not specify a
run_per_cross
value, it defaults tofalse
.
- If a data processing hook does not specify a
- Train-Test Splits: The validation dataset is split via a stratified k-fold split into
n_crosses
splits, as defined in the study configuration file.- As the parameter name implies, each of these splits will make up an analytical "cross"
- Any post-split hooks for which
"run_per_cross": true
will also run here, fitting to the train dataset and transforming both the train and test set if possible - If a data processing hook does not specify a
run_per_cross
value, it defaults tofalse
.
- Cross-Validate Performance Reported: Any metrics that the user requested be tracked are
calculated. These metrics are defined in the study config like so.
train
: Evaluate the metric on a model which has been trained on the training set, evaluating the metric from the model itself, or from the model's output when applied to the test set.- As a result of this being run once per cross, each metric specified at this hook will result in
n_crosses
values being output (each denoted as{metric_name} [{cross_idx}]
)
- As a result of this being run once per cross, each metric specified at this hook will result in
validate
: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the in-dataset.test
: Evaluate the metric on a model which has been trained on the in-dataset, evaluating the metric from the model itself, or from the model's output when applied to the out-dataset.objective
: Evaluated identically to thetrain
hook, but reported as an average both to you and the study instance (allowing the study to guide the hyperparameter sampling in future trials)- Currently, only one
objective
metric can be defined due to this averaging.
- Currently, only one