Dataset aggregation #1

plbenveniste · 2024-02-08T21:48:23Z

Here is an issue to describe the aggregation of available datasets.
The dataset which are of interest for this project are:

Labeled datasets
- canproco : PSIR and STIR contrast
- sct-testing-large : T1, T2 and T2*
- basel-mp2rage : MP2RAGE
- Bavaria : T2w
- msseg_challenge_2021 : FLAIR but need to first crop the top of the image : only very little of the spinal cord is included and I think that there are no lesion segmented in the spinal cord
- ms-nyu: T2w
Unlabeled datasets:
- nih-ms-mp2rage : MP2RAGE
- umass-ms-* : T2w sag, STIR_T2w sag, T1w sag, Gad T1w sag, FMPIR_T2w sag, T2w ax, PD ax, Gad T1w ax
- karolinska : still in DICOM format : T1 and T2 (issue related : 76)

jcohenadad · 2024-02-28T03:45:07Z

I just remember that we also have a lot of data from UMass (git-annex data : umass-ms-* (3 datasets))

plbenveniste · 2024-06-12T15:30:39Z

Referencing these issues regarding the problem caused by the dataset_correction.py script: issue 301 and issue 305.

The script dataset_correction.py shouldn't be used anymore or should be corrected at least.

plbenveniste · 2024-06-26T15:49:18Z

I updated the new code to aggregate the following datasets, which are labelled:

basel-ms-mp2rage
bavaria-quebec-spine-ms-unstitched
canproco
nih-ms-mp2rage
sct-testing-large

The command ran on kronos was:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

The output is the following:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1636
Number of images in validation set: 569
Number of images in test set: 544
Total number of images in the dataset: 2749

The total number of images in the dataset (2749) is different from the total number of derivatives (4407) because we decided to keep only those which have lesions.

The output is the following file: dataset_2024-06-26_seed42_lesionOnly.json

jcohenadad · 2024-06-26T16:23:23Z

we decided to keep only those which have lesions.

for now, but maybe in the future it would be desirable to develop a model that also has good specificity (ie: high true negative rate)

plbenveniste · 2024-06-26T20:15:21Z

There was an issue in the code when gathering segmentations from nih-ms-mp2rage.
The code was ran again:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

This is the output of the code:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1712
Number of images in validation set: 590
Number of images in test set: 569
Total number of images in the dataset: 2871

plbenveniste · 2024-10-15T15:05:35Z

For the purpose of writing of an abstract for Actrims, I am referencing some information about the data that we use.
Sites used for training, validation and testing:

University of Basel (basel-mp2rage)
TUM University (bavaria-quebec-spine-ms-unstitched)
St. Michael’s Hospital at the University of Toronto (canproco)
Centre hospitalier de l’Université de Montréal (canproco)
Djavad Mowafaghian Centre for Brain Health at the University of British Columbia (canproco)
Calgary Multiple Sclerosis Clinic (canproco)
Northern Alberta MS Clinic at the University of Alberta (canproco)
NIH (nih-ms-mp2rage)
Aix-Marseille Université (sct-testing-large - amuVirginie)
Università Vita-Salute San Raffaele (sct-testing-large - milanFilippi)
UCL (sct-testing-large - uclCiccarelli)
Karolinska Institutet (sct-testing-large - karoTobiasMS)
Brigham Women's Hospital (sct-testing-large - bwh)
Massachusetts General Hospital (sct-testing-large - mghCaterina)
CHU de Rennes (sct-testing-large - rennesMS)
University Hospital of Montpellier (sct-testing-large - montpellierLesion)
Vanderbilt University (sct-testing-large - vanderbiltSeth)
NYU (sct-testing-large - nyuShepherd)
Université Claude Bernard Lyon 1 (sct-testing-large - lyonOfsep)
USCF (sct-testing-large - ucsfTalbott)

Sites used for external validation:

University of Basel (ms-basel-2018 and ms-basel-2020)
University of Massachusetts (umass-ms-ge-hdxt1.5, umass-ms-ge-pioneer3, umass-ms-siemens-espree1.5 and umass-ms-ge-excite1.5)

plbenveniste · 2024-10-15T16:56:04Z

I ran the script to analyze the dataset : python dataset_analysis/msd_data_analysis.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-folder ~/net/ms-lesion-agnostic/dataset_analysis/analysis_output --dataset-path ~/net/ms-lesion-agnostic/data/

EDIT: I fixed the re-orientation problem so that the resolution be all taken in RPI orientation.

Here is the output:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
Number of images per orientation:  {'iso': 272, 'ax': 1693, 'sag': 906}
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T2w', 'T1w'}
Number of images per contrast in ms-basel-2018:  {'T2w': 24, 'T1w': 22}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
-------------------------------------
Number of images in umass:  3516
Contrast in umass:  {'T2w', 'PD', 'T1w'}
Number of images per contrast in umass:  {'T2w': 1806, 'PD': 537, 'T1w': 1173}

plbenveniste · 2024-10-18T21:43:05Z

I added the ms-nmo-beijing dataset, where we only get some T1w images. I also added the computation of the resolution and the orientation for every dataset
In the first ran I was surprised to have some coronal images, so I looked at some:

umass-ms-ge-pioneer3/sub-ms1115/ses-01/anat/sub-ms1115_ses-01_acq-ax_ce-gad_T1w.nii.gz: useless image (huge artifact).
umass-ms-ge-pioneer3/sub-ms1098/ses-01/anat/sub-ms1098_ses-01_acq-ax_ce-gad_T1w.nii.gz: useless image (some artifact as well)
umass-ms-ge-pioneer3/sub-ms1234/ses-03/anat/sub-ms1234_ses-03_acq-ax_ce-gad_T1w.nii.gz: useless image
I excluded those when doing the aggregation.

Here is the output after:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
Number of images per orientation:  {'iso': 272, 'ax': 1693, 'sag': 906}
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
Minimum pixel dimension:  0.1874999850988388
Maximum pixel dimension:  9.541563034057617
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T1w', 'T2w'}
Number of images per contrast in ms-basel-2018:  {'T1w': 22, 'T2w': 24}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
Average resolution:  [2.43636375 0.61377165 0.61377165]
Std resolution:  [0.90967523 0.25884103 0.25884103]
Median resolution:  [2.99999976 0.57291669 0.57291669]
Minimum pixel dimension:  0.3385416567325592
Maximum pixel dimension:  3.300001859664917
Number of images per orientation in basel:  {'sag': 55, 'iso': 22}
-------------------------------------
Number of images in umass:  3512
Contrast in umass:  {'T1w', 'T2w', 'PD'}
Number of images per contrast in umass:  {'T1w': 1169, 'T2w': 1806, 'PD': 537}
Average resolution:  [2.14845656 0.43490765 1.70391261]
Std resolution:  [1.4642299  0.10913737 1.53917671]
Median resolution:  [3.29995835 0.42969999 0.42970002]
Minimum pixel dimension:  0.3124999701976776
Maximum pixel dimension:  11.24999713897705
Number of images per orientation in umass:  {'sag': 2088, 'ax': 1424}
-------------------------------------
Number of images in beijing:  346
Contrast in beijing:  {'T1w'}
Number of images per contrast in beijing:  {'T1w': 346}
Average resolution:  [1.33011619 0.924434   2.19872953]
Std resolution:  [1.05835971 0.13317857 2.83172577]
Median resolution:  [1.00000072 1.         1.        ]
Minimum pixel dimension:  0.390625
Maximum pixel dimension:  13.799997329711914
Number of images per orientation in beijing:  {'sag': 113, 'iso': 174, 'ax': 59}

plbenveniste · 2024-10-22T13:51:45Z

I have adapted the code to only take 20 images for umass (5 per site) and 20 images from beijing. Also, I have computed the orientation in a more correct fashion than what I was doing before.
Output:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
PSIR are 2D sagital images: count PSIR images: 286
STIR are 2D sagital images: count STIR images: 72
UNIT1 are 3D images: count UNIT1 images: 265
T1w are 3D images: count T1w images: 1
For T2w, we have only 2D images:  1234  axial images and  539  sagital images
For T2star, we have only 2D images:  459  axial images and  15  sagital images
Total number of sagital images:  912
Total number of axial images:  1693
Total number of 3D images:  266
Number of subjects:  1541
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
Minimum pixel dimension:  0.1874999850988388
Maximum pixel dimension:  9.541563034057617
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T1w', 'T2w'}
Number of images per contrast in ms-basel-2018:  {'T1w': 22, 'T2w': 24}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
Total number of images: 77
2D sagital images: 55
3D images: 22
Number of subjects in ms-basel-2018:  23
Number of subjects in ms-basel-2020:  16
Average resolution:  [2.43636375 0.61377165 0.61377165]
Std resolution:  [0.90967523 0.25884103 0.25884103]
Median resolution:  [2.99999976 0.57291669 0.57291669]
Minimum pixel dimension:  0.3385416567325592
Maximum pixel dimension:  3.300001859664917
-------------------------------------
Number of images in umass:  20
Contrast in umass:  {'T1w', 'PD', 'T2w'}
Number of images per contrast in umass:  {'T1w': 9, 'PD': 2, 'T2w': 9}
Number of subjects in umass:  20
For umass, we have  13  axial images,  7  sagital images and  0  3D images
Average resolution:  [1.55297141 0.51953438 2.68904921]
Std resolution:  [1.3874418  0.17851401 1.65879995]
Median resolution:  [0.78125    0.42969374 3.62499213]
Minimum pixel dimension:  0.35159996151924133
Maximum pixel dimension:  5.000114440917969
-------------------------------------
Number of images in beijing:  20
Contrast in beijing:  {'T1w'}
Number of images per contrast in beijing:  {'T1w': 20}
For beijing, we have  2  axial images,  2  sagital images and  16  3D images
Number of subjects in beijing:  11
Average resolution:  [1.24874925 0.93906249 1.62031279]
Std resolution:  [0.69777444 0.12440287 1.96204985]
Median resolution:  [1.00000021 1.         1.        ]
Minimum pixel dimension:  0.625
Maximum pixel dimension:  7.500004768371582
-------------------------------------
-------------------------------------
Total number of images:  2988
Total number of subjects:  1611
Total number of sagital images:  976
Total number of axial images:  1708
Total number of 3D images:  304

plbenveniste · 2024-11-08T16:36:14Z

I have aggregated all the annotated data from the following datasets basel-mp2rage, bavaria-quebec-spine-ms-unstitched, canproco, ms-basel-2018, ms-basel-2020, ms-karolinska-2020, nih-ms-mp2rage, ms-nyu and sct-testing-large.

This accounts for 4824 MRI scans which come from 2019 subjects.

plbenveniste · 2024-11-28T20:43:00Z

After a discussion about the complexity of seeing lesions in PDw images, we decided not to include them in the study.
Referencing this #40 (comment) for more input.

plbenveniste self-assigned this Feb 8, 2024

This comment was marked as off-topic.

Sign in to view

plbenveniste mentioned this issue Feb 16, 2024

Correcting header segmentation files neuropoly/data-management#301

Closed

This comment was marked as off-topic.

Sign in to view

plbenveniste added the data label Feb 22, 2024

plbenveniste mentioned this issue Feb 22, 2024

Orientation problem labels-disc M12 subjects ivadomed/canproco#71

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset aggregation #1

Dataset aggregation #1

plbenveniste commented Feb 8, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

jcohenadad commented Feb 28, 2024 •

edited by plbenveniste

Loading

plbenveniste commented Jun 12, 2024

plbenveniste commented Jun 26, 2024 •

edited

Loading

jcohenadad commented Jun 26, 2024

plbenveniste commented Jun 26, 2024

plbenveniste commented Oct 15, 2024 •

edited

Loading

plbenveniste commented Oct 15, 2024 •

edited

Loading

plbenveniste commented Oct 18, 2024 •

edited

Loading

plbenveniste commented Oct 22, 2024 •

edited

Loading

plbenveniste commented Nov 8, 2024 •

edited

Loading

plbenveniste commented Nov 28, 2024

Dataset aggregation #1

Dataset aggregation #1

Comments

plbenveniste commented Feb 8, 2024 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

jcohenadad commented Feb 28, 2024 • edited by plbenveniste Loading

plbenveniste commented Jun 12, 2024

plbenveniste commented Jun 26, 2024 • edited Loading

jcohenadad commented Jun 26, 2024

plbenveniste commented Jun 26, 2024

plbenveniste commented Oct 15, 2024 • edited Loading

plbenveniste commented Oct 15, 2024 • edited Loading

plbenveniste commented Oct 18, 2024 • edited Loading

plbenveniste commented Oct 22, 2024 • edited Loading

plbenveniste commented Nov 8, 2024 • edited Loading

plbenveniste commented Nov 28, 2024

plbenveniste commented Feb 8, 2024 •

edited

Loading

jcohenadad commented Feb 28, 2024 •

edited by plbenveniste

Loading

plbenveniste commented Jun 26, 2024 •

edited

Loading

plbenveniste commented Oct 15, 2024 •

edited

Loading

plbenveniste commented Oct 15, 2024 •

edited

Loading

plbenveniste commented Oct 18, 2024 •

edited

Loading

plbenveniste commented Oct 22, 2024 •

edited

Loading

plbenveniste commented Nov 8, 2024 •

edited

Loading