- The required packages are installed as a new conda environment including both R and Python dependencies with the following command:
conda create --name envbase -f requirements_conda.yml
- The
missing R packages
can be found in the "requirements_r.rda" file and can be downloaded using the following commands:
load("requirements_r.rda")
for (count in 1:length(installed_packages)) {
install.packages(installed_packages[count])
}
⚠️ Forreticulate
, if asked for default python virtual environment, the answer should beno
to take the default conda environment into consideration
- The
permimp
package can be downloaded with the following commands:
install.packages('permimp', repos=NULL, type='source')
- The
sandbox
package can be downloaded going inside "code_dcrt" with:
* python setup.py build_ext --inplace
* pip install -e
- For the 3 first experiments,
compute_simulations
is used along withplot_simulations_all
:-
Set
N_SIMULATIONS
to 1:100 to perform the 100 runs. -
Set
N_CPU
according to the reserved resources (parallel) or 1 (serial). -
For the first experiment:
- Set
DEBUG
to FALSE. - Uncomment both
permfit
andcpi
. n_samples
is set to 300 andn_featues
is set to 100- Uncomment all the
rho
values. - Set
prob_sim_data
toregression_perm
. - In
stat_knockoff
, uncomment (lasso_cv
). - The output csv file
simulation_results_blocks_100_Mi_dnn_dnn_py_300:100
is found inresults/results_csv
.
- Set
-
For the second experiment:
- Set
DEBUG
to FALSE. - Uncomment both
permfit
andcpi
. - Set
n_samples
ton_samples = `if`(!DEBUG, seq(100, 1000, by = 100), 10L)
(comment line 84 and uncomment line 85). - Set
n_features
to 50 - In
prob_sim_data
, commentregression_perm
and uncomment all the rest. - In
stat_knockoff
, uncomment (lasso_cv
). - The output csv file
simulation_results_blocks_100_dnn_dnn_py_perm_100--1000
is found inresults/results_csv
.
- Set
-
For the third experiment:
- Uncomment all methods.
- Set
n_samples
to 1000 andn_features
to 50. - In
prob_sim_data
, commentregression_perm
and uncomment all the rest. - In
stat_knockoff
, uncomment (lasso_cv
,bart
anddeep
). - The output csv file
simulation_results_blocks_100_allMethods_pred_final
is found inresults/results_csv
.
-
For the forth experiment, we move to the
ukbb
folder:- The data are the public data from UK Biobank that needs to sign an agreement before using it (Any personal data are already removed).
- In the
process
scripts, change method topermfit_dnn
orcpi_dnn
to process the data and explore the importance of the variables using one of the methods. - The corresponding results per method are found in
Results_variables
folder.
-
-
For the section D:
- Set
DEBUG
to FALSE. - Uncomment both
cpi
andloco_dnn
(The last item uncommitted shouldn't be followed by a comma). - Set
n_samples
to 1000,n_features
to 50 andrho
to 0.8. - In
prob_sim_data
, uncommentregression
.
- Set
-
The output csv file
simulation_results_blocks_100_CPI_LOCO_DNN
is found inresults/results_csv
. -
For the section M:
-
We use
compute_simulations_py
. -
Large scale simulation:
- The script can be launched with the following command:
python -u compute_simulations_py.py --n 10000 --p 50 --nsig 20 --nblocks 10 --intra 0.8 --conditional 1 --f 1 --s 100 --njobs 1
--n
stands for the number of samples--p
stands for the number of variables--nsig
stands for the number of significant variables randomly chosen--nblocks
stands for the number of blocks/groups in the data structure--intra
stands for the intra correlation inside the groups--conditional
stands for the use of CPI (1
) or PI (0
)--f
stands for the first point of the range (Default1
)--s
stands for the step-size i.e. range size (Default100
)--njobs
stands for the serial/parallel implementation underJoblib
(Default1
)- The csv output file
simulation_results_blocks_100_n_10000_p_50_cpi_permfit
is found inresults/results_csv
.
- The script can be launched with the following command:
-
UK Biobank semi-simulation:
- The
filename
should be changed to the corresponding UKBB data (not publicly available). - The script can be launched with the following command:
python -u compute_simulations_py.py --nsig 115 --conditional 1 --f 1 --s 100 --njobs 1 python -u compute_simulations_py.py --nsig 115 --conditional 0 --f 1 --s 100 --njobs 1
--nsig
stands for the number of significant variables randomly chosen--conditional
stands for the use of CPI (1
) or PI (0
)--f
stands for the first point of the range (Default1
)--s
stands for the step-size i.e. range size (Default100
)--njobs
stands for the serial/parallel implementation underJoblib
(Default1
)- The csv output file
simulation_results_blocks_100_UKBB_single
is found inresults/results_csv
.
- The
-
-
For the section N:
- The Cam-CAN data is not publicly available, thus we provide the script process_age_prediction_CamCAN in order to compute the degree of significance for each frequency band.
- The output csv file
Result_single_FREQ_all_imp_outer_10_inner
is found incamcan
.
- We move to the
plot_simulations_all
:
-
For the first experiment with simulation_results_blocks_100_Mi_dnn_dnn_py_300:100 as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all_Mi.R")
. - Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - Set
run_plot_auc
,run_plot_type1error
,run_plot_power
andrun_time
one by one to TRUE. - Set
run_plot_combine
andrun_all_methods
to FALSE. - Uncomment (
Permfit-DNN
andCPI-DNN
). - The output csv files
AUC_blocks_100_Mi_dnn_dnn_py_300:100
,power_blocks_100_Mi_dnn_dnn_py_300:100
,type1error_blocks_100_Mi_dnn_dnn_py_300:100
andtime_bars_blocks_100_Mi_dnn_dnn_py_300:100
are found inresults/results_csv
.
- Change
-
For the second experiment with simulation_results_blocks_100_dnn_dnn_py_perm_100--1000 as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all_increasing_combine.R")
. - Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - Set
run_plot_combine
to TRUE. - Set
run_all_methods
to FALSE. - Set
run_plot_auc
,run_plot_type1error
,run_plot_power
andrun_time
one by one to FALSE. - Uncomment (
Permfit-DNN
andCPI-DNN
). - The output csv files
AUC_blocks_100_dnn_dnn_py_perm_100--1000
andtype1error_blocks_100_dnn_dnn_py_perm_100--1000
are found inresults/results_csv
.
- Change
-
For the third experiment with simulation_results_blocks_100_allMethods_pred_final as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all.R")
. - Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - Set
run_plot_auc
,run_plot_type1error
,run_plot_power
one by one to TRUE. - Set
run_plot_combine
andrun_time
to FALSE. - Set
run_all_methods
andwith_pval
to TRUE. - Uncomment (
Marg
,d0CRT
,Permfit-DNN
,CPI-DNN
,CPI-RF
,lazyvi
,cpi_knockoff
,loco
andStrobl
). - The output csv files
AUC_blocks_100_allMethods_pred_imp_final_withPval
,power_blocks_100_allMethods_pred_imp_final`` and
type1error_blocks_100_allMethods_pred_imp_finalare found in
results/results_csv```.
- Change
- For the supplementary experiments:
-
For the section D with simulation_results_blocks_100_CPI_LOCO_DNN as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all.R")
. - Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - Set
run_plot_auc
,run_plot_type1error
,run_plot_power
andrun_time
one by one to TRUE. - Set
run_plot_combine
andrun_all_methods
to FALSE. - Uncomment (
LOCO-DNN
andCPI-DNN
). - The output csv files
AUC_blocks_100_CPI_LOCO_DNN
,power_blocks_100_CPI_LOCO_DNN
,type1error_blocks_100_CPI_LOCO_DNN
andtime_bars_blocks_100_CPI_LOCO_DNN
are found inresults/results_csv
.
- Change
-
For the section I with simulation_results_blocks_100_allMethods_pred_final as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all.R")
. - Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - Set
run_plot_auc
to TRUE. - Set
run_plot_type1error
,run_plot_power
,run_time``````run_plot_combine
andwith_pval
to FALSE. - Set
run_all_methods
to TRUE. - Uncomment (
Knockoff_bart
,Knockoff_lasso
,Shap
,SAGE
,MDI
,BART
,Knockoff_deep
,Knockoff_path
andKnockoff_lasso
). - The output csv file
AUC_blocks_100_allMethods_pred_imp_final_withoutPval
is found inresults/results_csv
.
- Change
-
For the section K with simulation_results_blocks_100_allMethods_pred_final as input:
- Change
source
(at line 2) tosource("utils/plot_methods_all.R")
. - Set
run_time
to TRUE and the rest to FALSE. - Uncomment all the methods.
- The output csv file
time_bars_blocks_100_allMethods_pred_imp_final
is found inresults/results_csv
.
- Change
-
For the section M:
- Change
source
(at line 2) tosource("utils/plot_methods_all.R")
. - Set
run_plot_auc
,run_plot_type1error
,run_plot_power
andrun_time
one by one to TRUE. - Set
run_plot_combine
,run_all_methods
andwith_pval
to FALSE. - Uncomment (
Permfit-DNN
,CPI-DNN
). - Large scale simulation with simulation_results_blocks_100_n_10000_p_50_cpi_permfit as input:
- Set
nb_relevant
to 20 andN_CPU
to the number of dedicated resources. - The output csv files are found in
results/results_csv
under[AUC-type1error-power-time_bars]_blocks_100_groups_CPI_n_10000_p_50_cpi_permfit
.
- Set
- UK Biobank semi simulation:
- Set
nb_relevant
to 115 andN_CPU
to the number of dedicated resources. - The output csv files are found in
results/results_csv
under[AUC-type1error-power-time_bars]_blocks_100_UKBB_single
.
- Set
- Change
-
- We move to the
visualization
with 4 notebooksplot_figure_simulations
,plot_figure_simulations_2
,plot_figure_simulations_3
plot_ukbb_results
andplot_freqRes
:plot_figure_simulations
for the plots in the main text.plot_figure_simulations_2
andplot_figure_simulations_3
for the plots in the supplement.plot_ukbb_results
for the plot of the forth experiment.plot_freqRes
for the Cam-CAN corresponding plot.