Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is "train_data_processed_w_static.csv" obtained for out-of-domain task on MIMIC? #23

Open
jjgarciac opened this issue Feb 24, 2022 · 6 comments

Comments

@jjgarciac
Copy link

jjgarciac commented Feb 24, 2022

When executing python3 src/experiments/out_of_domain.py --models PPCA, I encounter: FileNotFoundError: (...)/in-hospital-mortality/train_data_processed_w_static.csv

I performed the 6 pre-processing steps listed here to setup the MIMIC mortality benchmark. The resulting directory does not include the file causing the error. It is as follows:
-in-hospital-mortality/
--train/
--test/

Note: I am using the MIMIC-III-demo dataset; I was able to run python -um mimic3models.in_hospital_mortality.logistic.main --l2 --C 0.001 --output_dir mimic3models/in_hospital_mortality/logistic from mimic3-benchmarks.

@Kaleidophon
Copy link
Collaborator

Hey! Sorry for the late response. I haven't touched this project in a long while, but it suspect that perhaps the preprocessing pipeline from the MIMIC benchmark produces train / test splits that are named differently than in this repo. The names of the splits are defined in this module, and you could try changing them according to your output files in your local clone.

It occurs to me that based on this script that at least the test split output is simply named testset.csv. What is your training set file called after running the pipeline?

@jjgarciac
Copy link
Author

jjgarciac commented Mar 5, 2022

No worries and thank you for the response.

After running the pipeline, the \train and \test folders contain multiple files. Inside \train, there are 46 <.#>_episode<#>_timeseries.csv (e.g. 10013_episode1_timeseries.csv) and a single listfile.csv. <.#>_episode<#>_timeseries.csv contains the feature values. listfile.csv relates the timeseries files to a label (e.g. 10013_episode1_timeseries.csv, 0). The format of the files is the same for \test

What format should the files in that module be?

I also notice that train_X array inside this function produces the feature set described in the paper (with the inclusion of a couple extra variables).

@Kaleidophon
Copy link
Collaborator

Mhh, is that the output even after running the python -m mimic3benchmark.scripts.create_in_hospital_mortality data/root/ data/in-hospital-mortality/ command? I will check my the format of the dataset again until the end of the week. Until then, I will also mention @karinazad and especially @LMeijerink here in case they have some advice on this issue.

@jjgarciac
Copy link
Author

Yes. Though this is expected, as it is mentioned here:
"After the above commands are done, there will be a directory data/{task} for each created benchmark task. These directories have two sub-directories: train and test. Each of them contains bunch of ICU stays and one file with name listfile.csv, which lists all samples in that particular set. Each row of listfile.csv has the following form: icu_stay, period_length, label(s). A row specifies a sample for which the input is the collection of ICU event of icu_stay that occurred in the first period_length hours of the stay and the target is/are label(s). In in-hospital mortality prediction task period_length is always 48 hours, so it is not listed in corresponding listfiles."

@Kaleidophon
Copy link
Collaborator

Heyo! I was looking at some code again. Since this project happened some time ago and I hadn't worked on that specific aspect of it, I am not sure if I can comprehensively help you solving this problem, unfortunately. Since I do not work at Pacmed anymore, I also don't have access to data, so I can't provide any more detail on the format of the dataset. It should correspond to the names in this pickle file, though. What I feel like I am understanding is the following:

  • After producing the timeseries, it does seem like me that the function you pointed out here seems to be useful to read the time series. You might be able to look at this script here for eICU to get a better idea on how to process the resulting data to reproduce the data as mentioned in the paper.
  • Afterwards, adapt the paths in this module to correspond to the file you created.

I am very sorry reproducing this part is being such a hassle! Also mentioning @Giovannicina in case he might be able to provide some more info on this problem, and best of luck. In case you do figure out the procedure, let me know here so that we can improve the documentation of the repo in this regard.

@yowald
Copy link

yowald commented Jun 9, 2022

Hi there. @jjgarciac, did you find a solution to this issue?
I was also trying this out and ran into the same error. Any help from the authors would also be greatly appreciated of course.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants