version 03/16/2023 (M/d/y)
The Clinical Biomarker Detection pipeline presented in this repository applies pre-processing and machine learning-based approaches to identify strong biomarkers for a given disease in clinical cohort data. The pipeline is currently designed to apply Support Vector Machine Classification to predict a binary target feature (e.g. disease) in combination with other configurable techniques and processing steps that can drastically improve the prediction power.
The repository is composed of the main pipeline script CBD_pipeline_SVM_HPC.py and the configuration file CBDP_config.py that needs to be configured in respect to the clinical data and the research topic.
Furthermore, the source folder contains all necessary functions used by the pipeline, including modified files for the python packages imblearn
, eli5
and mlxtend
.
-
The python packages necessary for this analysis can be found and installed to your working environment via the requirements.txt file using
pip install -r requirements.txt
. Python used:Python version 3.8.6
. -
/!\ /!\ /!\ After installing the required packages, files in the imblearn, eli5 and mlxtend package folders need to be replaced with the modified files in the source folder in order to enable them for parallelized computing and additional automations.
- In eli5: The file permutation_importance.py in the ./env/lib/.../eli5/ folder must be replaced by this permutation_importance.py file.
- In mlxtend: The files feature_importance.py and _init_.py in the ./env/lib/.../mlxtend/evaluate/ folder must be replaced by the two files feature_importance.py and _init_.py.
- In imblearn: The file base.py in the ./env/lib/.../imblearn/over_sampling/_smote/ folder must be replaced by this base.py file.
- Step (#9): Refers to the configurable processing steps
- Technique (#16): Point where a technique must be selected
- Specification (#43): Possible configurable technique specifications
- Starting point (#1): Pipeline entry point
- Pipe funnel point (#1): Pipeline funnel exit
- Ending point (#1): End of pipeline and results output
-
Abbreviations:
- T/F: True/False
- rus: random under-sampling
- smote: synthetic minority over-sampling technique
- chi_sq: chi squared
- PCA: principal component analysis
- LDA: linear discriminant analysis
-
Note:
- Only one possibility of pipeline-order is shown in detail in the figure above, namely samples->features. In case of features->samples, the pipeline steps IR and FT are swapped, meaning that FT is performed before IR. In case of IR and FT being both disabled in the configuration file, these steps will be skipped except the standard scaling mechanism of continuous features during FT which is the minimum of transformation one should at least pass to a Support Vector Machine classifier. These two alternative pipeline-order cases are represented by the black dashed lines.
- Regarding the feature transformation techniques, it is possible to select a combination of continuous and categorical techniques (e.g. PCA + Select K Best) as well as to select one single transformation, e.g. PCA for the continuous features. In that case any categorical features will be passed through the pipeline without any transformation. Later update of the pipeline may include the choice between passing these features through or dropping them.
- Input:
Entry of the pipeline. The pipeline is desinged to take as input a preferrably cleaned and imputed version of the data, which is already split into training and test sets. - Data splitting (DS):
Step of splitting the input data sets based on a given binary feature, e.g. gender, specific disease,... If no splitting feature is given or the step being disabled, it will be ignored. - Subgroups (SG):
Step of data subgroup selections based on priorly added prefixes to the features, e.g. BF- for body fluids, PM- for physical measurements,... If no subgroups are selected or the step being disabled, all input features will be passed to the next step. This step is specifically designed to match and select given subgroups, but individual feature names could be added as well to the selection. - Remove engineered input (REI):
Step to remove features or group of features that are already represented in an engineered feature of the data set to avoid having redundant information and correlated features in the data set. Such features can include entire subgroups that were engineered into other features, but also individual feature names could be added to this list. If no features are selected or the step being disabled, this step will be ignored. - Constancy check (CC):
Step to remove features that are constant or near-constant. Regarding near-constants, there are three different applications involved depending on the feature data type. In case of continuous feature, the variance-to-mean ratio is calculated and defined as near-constant if this ratio does not meet the minimum threshold given in the configuration file. In case of categorical features, binary types are defined as near-constant if the sum of positive class does not exceed the number of stratified splits during cross-validation as given in the configuration file. Non-binary categorical features are defined as near-constant if the sum of all non-zero classes does not exceed the number of stratified splits during cross-validation. - Remove highly correlated features (RHCF):
Step to remove the highly correlated features in the data set for the following associations of possible feature data types: continuous-continuous, categorical-categorical, and continuous-categorical correlation. If disabled, the step will be ignored. The following techniques will be applied.- continuous-continuous: Spearman's Rank Order Correlation using decimal or percentile threshold
- categorical-categorical: Corrected Cramer's V Correlation using decimal or percentile threshold
- continuous-categorical: Point Bi-serial correlation using decimal or percentile threshold (correlated features will be removed from the longer list)
- Imbalance resampling (IR):
Step of the classification pipeline to resample imbalanced data. The order of steps in the classification pipeline can be defined in the configuration file, e.g. resampling before feature transformation or vice versa. If disabled, the step will be ignored. If enabled, the following techniques can be selected.- Random under-sampling (RUS): Randomly select majority class samples to equal the number of minority class samples.
- Synthetic minority over-sampling technique (SMOTE): Creation of synthetic minority class samples using k-nearest neighbors algorithm to equal the number of majority class samples. In more detail, the variant SMOTENC is used which is specifically designed to over-sample continuous and categorical features together (original SMOTE does not make a difference between the two types). To ensure proper functioning of SMOTENC in all pipeline-order cases, it is recommended to update the base SMOTENC function as explained above in the 'Requirements' section.
- Feature transformation (FT):
Step of the classification pipeline to transform the features. The order of steps in the classification pipeline can be defined in the configuration file, e.g. resampling before feature transformation or vice versa. If disabled, the step will be ignored. If enabled, the following techniques can be defined.- Scaler technique: Select between the standard scaler (distribution centered around 0, standard deviation of 1, mean removed), robust scaler (median and scales removed according to the quantile range), or Minmax scaler (scaling each feature to a specific range like [0, 1]). The scaler technique will only be applied on the continuous features, with standard scaler being the default if none is selected. The default scaling technique will also be applied alone in case this step is disabled.
- Feature technique: Select between linear PCA, LDA and non-linear kernel PCA for continuous features. For categorical features, currently only the select k best method using chi squared and Cramer's V correlation is available. If the step is disabled, the features will not be transformed. Please note that in case of non-linear PCA, the classifier kernel will be forced to be linear in order to avoid applying non-linear kernel transformations twice (if linear PCA or LDA is selected, non-linear classifier kernels are allowed).
- Feature importance (FI):
Step to identify the most important features selected by the classification model if this step is enabled. In case of linear classification, feature importance by permutation is not necessary and the information can be retrieved directly from the trained estimator using the built-inSVC.coef_
attribute. In case of non-linear classification, the feature importance is measured using the feature permutation algorithm. In this case, it is possible to choose between three different methods that do show consistent results. It is also possible to select all and validate the result's consistency by yourself.- sklearn's permutation_importance function of the sklearn.inspection group.
- eli5's get_score_importance function of the eli5.permutation_importance group (mod files required).
- mlxtend's feature_importance_permutation function of the mlxtend.evaluate group (mod files required).
- 'all' to run the permuted feature importance with all three methods and to plot comparisons (mod files required).
- Box and bar plotting (BBP):
Step to visualize the most important features in a ranked order between the negative and positive classes. If disabled, the step will be ignored. If enabled, the most important categorical and continuous features can be plotted separately or combined as defined in the configuration file. - Output:
Classification model of the selected output-target, model evaluation summaries and plots, e.g. confusion matrices, ROC-AUC curves, performance metrics, summary plots for the various enabled steps like heatmaps of the highly correlated features, venn diagrams of removed features if data is split, comparison plots of feature importance methods if all enabled, list of features ranked by their importancy, ...
Depending of the configured setup and user preferences, the pipeline can either be deployed using a local machine or using HPC clusters. Please note that this choice will have large effects on the required computational time for the analysis, and therefore the configuration settings should be selected appropriately and with care. The input data must exist as training and test data, preferrably cleaned and imputed (no empty values). The feature names in the data set should be preceeded by a prefix that refers to the subgroup of clinical data, e.g. body fluids (BF-), physical measurements (PM-), survey (SV-), individual medications (IM-), individual devices (ID-), ...
The configuration file CBDP_config.py presents 84 configurable variables and parameters that define the enabled steps, techniques, and specifications that should be highly specific to the clinical data of interest. The table below summarises the configurable variables, and more precise descriptions are available in the configuration file.
Variable | Example | Description | Type |
---|---|---|---|
seed | 42 | Random seed | int |
fix_font | 18 | Fix font size for general plots | int |
imp_font | 8 | Specific font size for feature importance plots | int |
plot_style | 'fivethirtyeight' | Matplotlib plot style | str |
fig_max_open_warning | 0 | Warning shown by matplotlib after number of open figures | int |
pandas_col_display_option | 5 | Number of columns displayed in pandas dataframe | int |
figure_dpi | 300 | Dot per inches resolution of the result figures | int |
figure_format | 'png' | Desired format for the generated figures | str |
debug | False | Debug statement to decide if figures should be displayed for debugging, select False to avoid 'RuntimeError: main thread is not in main loop' errors when the complete pipeline runs on local machine with threading | bool |
Variable | Example | Description | Type |
---|---|---|---|
curr_dir | os.getcwd() | Pathway to current directory | str, directory |
folder_prefix | 'results/' | Folder name for results can be a folder in folder or prefix | str |
train_path | curr_dir + '/data/train_imputed.csv' | Path to imputed training set | str, file |
test_path | curr_dir + '/data/test_imputed.csv' | Path to imputed training set | str, file |
output_feature | 'PM-Frailty_Index' | Target output feature | str, binary feature |
positive_class | 'frail' | Name to give the positive class of the output feature | str |
negative_class | 'non-frail' | Name to give the negative class of the output feature | str |
output_related | ['PM-Frailty_Score', 'PM-Frailty_gait', 'SV-Frailty_exhaustion', 'SV-Frailty_physicalactivity', 'PM-Frailty_gripstrength', 'PM-Gripstrength_max', 'PM-Frailty_weightloss'] | Output-related features | str, list |
sample_tagging_feature | output_related + ['BF-VitDDef'] | Feature used to define samples to tag specifically | str or list of str |
tag_threshold | (('>=', '3'), ('==', '1'), ('==', '1'), ('==', '1'), ('==', '1'), ('==', '1'), ('<=', 'np.nanpercentile(x, 20)'), ('==', '1')) | Threshold to define samples to tag, can be multiples, first position must be math operator | tuple of str or tuples |
Variable | Example | Description | Type |
---|---|---|---|
kernels | ['linear', 'poly', 'rbf', 'sigmoid'] | Kernels to use for the Support Vector Machine classifier | str, list |
non_linear_kernels | ['poly', 'rbf', 'sigmoid'] | Repeat with the above kernels that are non_linear | str, list |
cache_size | 200 | Cache size of SVM classifier, 200 (HPC) - 2000 (local) | int |
decision_func_shape | 'ovr' | Decision function shape of classifier, one vs rest 'ovr' or one vs one 'ovo' | str |
clf_verbose | False | Classifier verbose | bool |
grid_verbose | 1 | Grid search verbose | int |
hard_iter_cap | 150000 | Hard stopping criterion | int |
splits | 10 | Stratified k fold splits | int |
scorer | 'F2' | Scorer used during the experimental steps, F.5, F1, F2, F5, roc_auc, matthews_corrcoef, dor (diagnostic odds ratio), balanced_accuracy or accuracy | str |
shuffle_all | 1000 | Proven 1000 for a set of 1200 samples that each sample receives at least half of the other values (see proof) | int |
shuffle_male | 500 | Proven 500 for a set of 600 samples (see proof) | int |
shuffle_female | 500 | Proven 500 for a set of 600 samples (see proof) | int |
linear_shuffle | True | Feature importance by shuffling in case of PCA+linear SVM if true, else the .coef_ attribute of the linear SVM is used and the sorted averaged loadings of features across all selected PCA components are used to determine the most important features |
bool |
Variable | Example | Description | Type |
---|---|---|---|
parallel_method | 'ipyparallel' | Parallel backend agent, 'ipyparallel' (HPC), 'threading', 'multiprocess', 'loki' (local) | str |
n_jobs | -1 | Number of jobs for distributed tasks, will be adjusted if ipyparallel is enabled | int |
thresh_near_constant | 0.001 | Thresh for a continuous feature near-constance by variance-mean-ratio | float |
enable_data_split | True | True if data should be split based on the binary split feature below | bool |
split_feature | 'PM-sex' | Feature based on which data is split | str |
enable_subgroups | False | True if data shall be limited to subgroups, else full feature input | bool |
subgroups_to_keep | 'all' | Prefix of subgroups to keep for the analysis | tuple of str, str or 'all' |
enable_engineered_input_removal | True | Change to enable or disable removal of engineered input features | bool |
engineered_input_prefix | ('IM-', 'ID-') | Prefix of features used in engineering | str, tuple of str, or empty |
enable_rhcf | True | Change to True to enable & False to disable removing highly correlated features | bool |
thresh_cramer | (0.6, 'decimal') | Corrected Cramer's V correlation threshold, choose decimal or percentile | tuple of int or float and str |
thresh_spearman | (95, 'percentile') | Spearman's Rank correlation threshold, choose decimal or percentile | tuple of int or float and str |
thresh_pbs | (0.6, 'decimal') | Point bi-serial correlation threshold, choose decimal or percentile | tuple of int or float and str |
enable_resampling | True | Change to True to enable & False to disable resampling | bool |
resampling_tech | 'rus' | 'rus' (random under-sampling), 'smote' (synthetic minority over-sampling technique) or empty | str |
enable_ft | True | Change to True to enable & False to disable feature transformation | bool |
scaler_tech | 'standard' | Change scaler function to 'standard', 'minmax' or 'robust' scaler | str |
pca_tech | 'normal_pca' | Select pca technique to choose between 'normal_pca' and 'kernel_pca' | str |
da_tech | 'lda' | Select discriminant analysis tech for continuous features, 'lda' (LDA, later QDA) | str |
kbest_tech | 'cramer' | Select score function for kbest technique, 'chi2', 'cramer', or callable score func | str or callable |
pipeline_order | 'samples->features' | Order of the steps either 'samples->features' or 'features->samples' | str |
drop_or_pass_non_treated_features | 'drop' | Either 'drop' or 'passthrough' untransformed features | str |
enable_feature_importance | True | Change to True to enable & False to disable feature importance | bool |
top_features | 40 | Top integer positions of most important features for deeper analysis, default 40 | int |
feature_importance_method | 'all' | Change to 'eli5', 'mlxtend', 'sklearn', or 'all' to enable methods, default 'all' | str |
enable_box_bar_plots | True | True to enable box and bar plots of most important features or False to disable, default True | bool |
box_bar_figures | 'combined' | Whether the box and bar plots should be separated or combined figure, 'separated' or 'combined' | str |
Variable | Example | Description | Type |
---|---|---|---|
regularization_lpsr | [x for x in np.logspace(-2, 6, 9)] | Regularization parameter, default 1 | int |
shrinking_lpsr | [True, False] | Shrinking heuristic, default True | bool |
tolerance_lpsr | [x for x in np.logspace(-4, -2, 3)] | Stopping criterion tolerance, default 0.001 | float |
gamma_psr | ['scale', 'auto', 0.00001, 0.0001, 0.001, 0.01, 0.1, 1.0, 10] | Single training influence, default | 'scale' |
degree_p | [2, 3, 4, 5] | Polynomial degree, default 3 | int |
coef0_ps | [0.0, 0.01, 0.1, 0.5] | Independent term in kernel function, default 0.0 | float |
k_neighbors_smote_lpsr | [2, 3, 5] | K nearest neighbor for smote resampling, default 5 | int or a kneighborsmixin func |
k_best_lpsr | [1, 2, 5, 10, 15] | Number of k best features to select by chi squared, default 10 | int |
pca_lpsr | [2, 5, 10, 15, 20] | Number of PCA components, default None | int |
kernel_pca_kernel_lpsr | ['poly', 'rbf', 'sigmoid'] | kernels for kernelPCA, default 'linear' | str |
kernel_pca_lpsr | [2, 5, 10, 15, 20] | Number of components, default None | int |
kernel_pca_tol_lpsr | [0.0, 0.001, 0.01] | Tolerance, default 0 | float |
kernel_pca_gamma_lpsr | [None, 0.1, 1.0, 10.0] | Gamma parameter, default None | float |
kernel_pca_degree_lpsr | [2, 3, 4, 5] | Polynomial degree, default 3 | int |
kernel_pca_coef0_lpsr | [0.1, 0.5, 1.0] | Coef0 parameter, default 1 | float |
lda_shrinkage_lpsr | [None] | This should be left to None if no covariance estimator is used, default None | float |
lda_priors_lpsr | [None] | Class prior prob., proportions are inferred from train data if def, default, None | np.array |
lda_components_lpsr | [1] | LDA components, if None, will be set to min(n_classes-1, n_features), default, None | int |
lda_tol_lpsr | [0.0001, 0.001, 0.01] | Tolerance for singular value x to be considered significant, default, 0.0001 | float |
Variable | Example | Description | Type |
---|---|---|---|
total_params_and_splits | {'regularization_lpsr': regularization_lpsr, 'shrinking_lpsr': shrinking_lpsr, 'tolerance_lpsr': tolerance_lpsr, 'gamma_psr': gamma_psr, 'coef0_ps': coef0_ps, 'degree_p': degree_p, 'pca_lpsr': pca_lpsr, 'k_best_lpsr': k_best_lpsr, 'k_neighbors_smote_lpsr': k_neighbors_smote_lpsr, 'splits': splits} | Dictionary of parameters for SVC and normal PCA | dict |
pca_kernel_dict | {'kpca_components_lpsr': kernel_pca_lpsr, 'kpca_kernel_lpsr': kernel_pca_kernel_lpsr, 'kpca_gamma_lpsr': kernel_pca_gamma_lpsr, 'kpca_tol_lpsr': kernel_pca_tol_lpsr, 'kpca_degree_lpsr': kernel_pca_degree_lpsr, 'kpca_coef0_lpsr': kernel_pca_coef0_lpsr} | Dictionary of parameters specific to the kernel PCA technique | dict |
lda_dict | {'lda_shrinkage_lpsr': lda_shrinkage_lpsr, 'lda_priors_lpsr': lda_priors_lpsr, 'lda_components_lpsr': lda_components_lpsr, 'lda_tol_lpsr': lda_tol_lpsr} | Dictionary of parameters specific to the DA technique (currently only LDA) | dict |
additional_params | False | Change to True if additional non pre-supported parameters are added | bool |
additional_kernel_params | {} | Add additional kernel parameter to introduce here if not supported already | dict |
additional_technique_params | {} | Add additional technique parameter to introduce here if not supported already | dict |
For running the pipeline on a local machine it is recommended to reduce the grid search parameter intervals accordingly to guarantee reasonable computational time. For a better overview on the progress of the classification step, the clf_verbose
can be set to true
and the grid_verbose
to 2
. Also, if enough computational memory is available, the SVM parameter cache_size
can be increased to up to 2000
(mb).
Currently, the parallel_method
supported for local machine analysis is limited to threading
and multiprocess
, as the method ipyparallel
is reserved for the analysis on the HPC clusters. the number of available workers n_jobs
can be set to -1
for all CPUs or to a specific number of CPUs available to the local machine.
For running the pipeline on HPC clusters, it is first necessary to set up and activate the appropriate environment including the Python version 3.8.6
and the required python packages that are listed in the requirements.txt using pip install -r requirements.txt
.
In the provided regular HPC launcher script and the long HPC launcher script, the following information may need to be adjusted to your settings:
--mail-use
for job status notifications (fill-in valid e-mail address and uncomment if needed)language_to_load
to specify the python language module to loadenvironment_to_load
to specify the path to the environment source- Please also verify the resource allocation before submitting a job and adjust if necessary.
When the environment is set up, packages installed and files in eli5, mlxtend, and smote package directories replaced, you can start adjusting the configuration file to the needs of your data and experimental setting. Please note that on HPC, parallelization via the parallel_method
ipyparallel
is highly recommended. The number of jobs will be automatically set to the number of available workers based on the launcher script. The clf_verbose
can be set to false
and the grid_verbose
to 1
to avoid massive printouts, and the SVM parameter cache_size
can be adjusted to 200
(mb).
On the HPC node, the files should be accessible and stored in the same way as found in this repository, and verify the path to the cleaned imputed training and test data set and the path where results should be stored so that it matches the variables in the configuration file.
If everything is set and ready, run the pipeline with the configured experimental settings on HPC clusters using the below command:
sbatch HPC_SVM_launcher.sh CBD_pipeline_SVM_HPC.py
The results will be stored in the configured folder_prefix
folder and bear the combined and sorted abbreviations of enabled steps including 3-digit numericals to avoid duplicated folders, e.g. <possible_prefix>-DS-REI-RHCF-ST-chi2KBEST-PCA-FT-RUS-FI-BBP-SVM-both-lin-and-non-lin-HPC_00
for a pipeline deployed with data splitting (DS), removing engineered input (REI), removing highly correlated features (RHCF), standard scaler (ST), chi2 SelectKBest (chi2KBEST) normal PCA (PCA), feature transformation (FT), imbalance resampling with random under-sampling (RUS), calculated feature importance (FI), with box and bar plots (BBP), using support vector machine classification (SVM) with linear and non-linear kernels (both-lin-and-non-lin) and run on the high performance computing clusters (HPC) for the first time (00). If run a second time, a second folder would be created with the same name and ending with _01. Note the pipeline-order in the name being features first by FT, then resampling by RUS.
Other possible abbreviations are: MI minmax scaler, RO robust scaler, SMOTE synthetic minority over-sampling technique, kPCA kernel PCA (which will be preceeded by the actual kernel if one analyses them one by one, e.g. polykPCA to save computational time).
The results will consist of confusion matrices, roc_auc curves, precision-recall curves, summarising heatmap and venn diagram plots for RHCF, summarising plots for feature importance and shuffling effects, comparison scatter plots for the different feature importance methods, and the code execution output file generated either in the terminal (local machine) or in a readable .out file (HPC).
-
Continue editing this README file03/14/2022 -
Add boxplot of most important features in Original data03/18/2022 -
Make pipeline generate a similar .out file of the code execution when running locally compared to HPC .out03/23/2022 -
In case of linear SVM kernel combined with linear PCA, enable extraction of feature importance by shuffling and by the05/30/2022.coef_
attribute of the linear classifier -
Near-constancy check added for continuous and binary features05/30/2022 -
Enable launching analysis with any combination of the supported techniques, including single or double feature transformation05/30/2022 -
Generate detailed list of highly correlated features during that pre processing step05/30/2022 -
Added corrected Cramer's V correlation as a possible score function for selectKBest05/30/2022 -
Updated the pipeline flow-chart and README descriptions07/14/2022 -
Added roc_auc as GridSearchCV scorer07/20/2022 -
Add Diagnostic Odds Ratio, balanced_accuracy, and matthews correlation coefficient as Cross-validation scorers10/13/2022 -
Add cross-validation roc-auc and precision-recall curve plots10/13/2022 -
Add PCA or LDA plot with variances of data after RHCF if linear PCA or LDA is used as transformer01/26/2023 -
Enable tagging a specific subgroup of samples based on available features and visualize the performance within that group vs the other01/26/2023 -
For overall results in the group, draw confusion matrix with boxplots of target-related features in TP/TN/FP/FN groups01/29/2023 -
Correlation plots of features that remain after RHCF with the output feature (cramer & chi2, PBS)01/26/2023 -
Grouped version of these correlation plots of data_split is enabled02/12/2023 - Extend the pipeline to allow tree-based classification (e.g. RF, XGBoost)
- Make the pipeline compatible with additional processing techniques, e.g. dimensionality reduction, feature selection, ...
- Add QDA as non-linear supervised continuous transformer
- Allow to choose if ONLY the data subgroups (e.g. male & female) should be processed
- Add decision-making possibilities for tagging features to be removed during the RHCF step e.g., remove the one least correlating with output_feature
- Allow the use of other correlation metrics during RHCF
- Look for categorical selector or transformer other than SelectKBest
-
Allow the possibility to 'drop' or 'passthrough' features that are not eligible for the transformer steps (e.g. case where categorical transformer is disabled, what should happen to the input categorical features, drop or pass them through?10/13/2022 - Add delight to the experience when all tasks are complete 🎉