Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset aggregation #1

Open
plbenveniste opened this issue Feb 8, 2024 · 13 comments
Open

Dataset aggregation #1

plbenveniste opened this issue Feb 8, 2024 · 13 comments
Assignees
Labels

Comments

@plbenveniste
Copy link
Collaborator

plbenveniste commented Feb 8, 2024

Here is an issue to describe the aggregation of available datasets.
The dataset which are of interest for this project are:

  • Labeled datasets

    • canproco : PSIR and STIR contrast
    • sct-testing-large : T1, T2 and T2*
    • basel-mp2rage : MP2RAGE
    • Bavaria : T2w
    • msseg_challenge_2021 : FLAIR but need to first crop the top of the image : only very little of the spinal cord is included and I think that there are no lesion segmented in the spinal cord
    • ms-nyu: T2w
  • Unlabeled datasets:

    • nih-ms-mp2rage : MP2RAGE
    • umass-ms-* : T2w sag, STIR_T2w sag, T1w sag, Gad T1w sag, FMPIR_T2w sag, T2w ax, PD ax, Gad T1w ax
    • karolinska : still in DICOM format : T1 and T2 (issue related : 76)
@plbenveniste plbenveniste self-assigned this Feb 8, 2024
@plbenveniste

This comment was marked as off-topic.

@plbenveniste

This comment was marked as off-topic.

@jcohenadad
Copy link
Member

jcohenadad commented Feb 28, 2024

I just remember that we also have a lot of data from UMass (git-annex data : umass-ms-* (3 datasets))

@plbenveniste
Copy link
Collaborator Author

Referencing these issues regarding the problem caused by the dataset_correction.py script: issue 301 and issue 305.

The script dataset_correction.py shouldn't be used anymore or should be corrected at least.

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Jun 26, 2024

I updated the new code to aggregate the following datasets, which are labelled:

  • basel-ms-mp2rage
  • bavaria-quebec-spine-ms-unstitched
  • canproco
  • nih-ms-mp2rage
  • sct-testing-large

The command ran on kronos was:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

The output is the following:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1636
Number of images in validation set: 569
Number of images in test set: 544
Total number of images in the dataset: 2749

The total number of images in the dataset (2749) is different from the total number of derivatives (4407) because we decided to keep only those which have lesions.

The output is the following file: dataset_2024-06-26_seed42_lesionOnly.json

@jcohenadad
Copy link
Member

we decided to keep only those which have lesions.

for now, but maybe in the future it would be desirable to develop a model that also has good specificity (ie: high true negative rate)

@plbenveniste
Copy link
Collaborator Author

There was an issue in the code when gathering segmentations from nih-ms-mp2rage.
The code was ran again:

python ms-lesion-agnostic/monai/1_create_msd_data.py -pd ~/net/ms-lesion-agnostic/data/ -po ~/net/ms-lesion-agnostic/msd_data/ --lesion-only --canproco-exclude canproco/exclude.yml

This is the output of the code:

Total number of derivatives in the root directory: 4407
Number of images in train set: 1712
Number of images in validation set: 590
Number of images in test set: 569
Total number of images in the dataset: 2871

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Oct 15, 2024

For the purpose of writing of an abstract for Actrims, I am referencing some information about the data that we use.
Sites used for training, validation and testing:

  • University of Basel (basel-mp2rage)
  • TUM University (bavaria-quebec-spine-ms-unstitched)
  • St. Michael’s Hospital at the University of Toronto (canproco)
  • Centre hospitalier de l’Université de Montréal (canproco)
  • Djavad Mowafaghian Centre for Brain Health at the University of British Columbia (canproco)
  • Calgary Multiple Sclerosis Clinic (canproco)
  • Northern Alberta MS Clinic at the University of Alberta (canproco)
  • NIH (nih-ms-mp2rage)
  • Aix-Marseille Université (sct-testing-large - amuVirginie)
  • Università Vita-Salute San Raffaele (sct-testing-large - milanFilippi)
  • UCL (sct-testing-large - uclCiccarelli)
  • Karolinska Institutet (sct-testing-large - karoTobiasMS)
  • Brigham Women's Hospital (sct-testing-large - bwh)
  • Massachusetts General Hospital (sct-testing-large - mghCaterina)
  • CHU de Rennes (sct-testing-large - rennesMS)
  • University Hospital of Montpellier (sct-testing-large - montpellierLesion)
  • Vanderbilt University (sct-testing-large - vanderbiltSeth)
  • NYU (sct-testing-large - nyuShepherd)
  • Université Claude Bernard Lyon 1 (sct-testing-large - lyonOfsep)
  • USCF (sct-testing-large - ucsfTalbott)

Sites used for external validation:

  • University of Basel (ms-basel-2018 and ms-basel-2020)
  • University of Massachusetts (umass-ms-ge-hdxt1.5, umass-ms-ge-pioneer3, umass-ms-siemens-espree1.5 and umass-ms-ge-excite1.5)

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Oct 15, 2024

I ran the script to analyze the dataset : python dataset_analysis/msd_data_analysis.py --msd-data-path ~/net/ms-lesion-agnostic/msd_data/dataset_2024-07-24_seed42_lesionOnly.json --output-folder ~/net/ms-lesion-agnostic/dataset_analysis/analysis_output --dataset-path ~/net/ms-lesion-agnostic/data/

EDIT: I fixed the re-orientation problem so that the resolution be all taken in RPI orientation.

Here is the output:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
Number of images per orientation:  {'iso': 272, 'ax': 1693, 'sag': 906}
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T2w', 'T1w'}
Number of images per contrast in ms-basel-2018:  {'T2w': 24, 'T1w': 22}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
-------------------------------------
Number of images in umass:  3516
Contrast in umass:  {'T2w', 'PD', 'T1w'}
Number of images per contrast in umass:  {'T2w': 1806, 'PD': 537, 'T1w': 1173}

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Oct 18, 2024

I added the ms-nmo-beijing dataset, where we only get some T1w images. I also added the computation of the resolution and the orientation for every dataset
In the first ran I was surprised to have some coronal images, so I looked at some:

  • umass-ms-ge-pioneer3/sub-ms1115/ses-01/anat/sub-ms1115_ses-01_acq-ax_ce-gad_T1w.nii.gz: useless image (huge artifact).
  • umass-ms-ge-pioneer3/sub-ms1098/ses-01/anat/sub-ms1098_ses-01_acq-ax_ce-gad_T1w.nii.gz: useless image (some artifact as well)
  • umass-ms-ge-pioneer3/sub-ms1234/ses-03/anat/sub-ms1234_ses-03_acq-ax_ce-gad_T1w.nii.gz: useless image
    I excluded those when doing the aggregation.

Here is the output after:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
Number of images per orientation:  {'iso': 272, 'ax': 1693, 'sag': 906}
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
Minimum pixel dimension:  0.1874999850988388
Maximum pixel dimension:  9.541563034057617
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T1w', 'T2w'}
Number of images per contrast in ms-basel-2018:  {'T1w': 22, 'T2w': 24}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
Average resolution:  [2.43636375 0.61377165 0.61377165]
Std resolution:  [0.90967523 0.25884103 0.25884103]
Median resolution:  [2.99999976 0.57291669 0.57291669]
Minimum pixel dimension:  0.3385416567325592
Maximum pixel dimension:  3.300001859664917
Number of images per orientation in basel:  {'sag': 55, 'iso': 22}
-------------------------------------
Number of images in umass:  3512
Contrast in umass:  {'T1w', 'T2w', 'PD'}
Number of images per contrast in umass:  {'T1w': 1169, 'T2w': 1806, 'PD': 537}
Average resolution:  [2.14845656 0.43490765 1.70391261]
Std resolution:  [1.4642299  0.10913737 1.53917671]
Median resolution:  [3.29995835 0.42969999 0.42970002]
Minimum pixel dimension:  0.3124999701976776
Maximum pixel dimension:  11.24999713897705
Number of images per orientation in umass:  {'sag': 2088, 'ax': 1424}
-------------------------------------
Number of images in beijing:  346
Contrast in beijing:  {'T1w'}
Number of images per contrast in beijing:  {'T1w': 346}
Average resolution:  [1.33011619 0.924434   2.19872953]
Std resolution:  [1.05835971 0.13317857 2.83172577]
Median resolution:  [1.00000072 1.         1.        ]
Minimum pixel dimension:  0.390625
Maximum pixel dimension:  13.799997329711914
Number of images per orientation in beijing:  {'sag': 113, 'iso': 174, 'ax': 59}

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Oct 22, 2024

I have adapted the code to only take 20 images for umass (5 per site) and 20 images from beijing. Also, I have computed the orientation in a more correct fashion than what I was doing before.
Output:

Number of images:  2871
Number of images for training:  1712
Number of images for validation:  590
Number of images for testing:  569
Number of images per contrast:  {'UNIT1': 265, 'T2w': 1773, 'STIR': 72, 'PSIR': 286, 'T2star': 474, 'T1w': 1}
PSIR are 2D sagital images: count PSIR images: 286
STIR are 2D sagital images: count STIR images: 72
UNIT1 are 3D images: count UNIT1 images: 265
T1w are 3D images: count T1w images: 1
For T2w, we have only 2D images:  1234  axial images and  539  sagital images
For T2star, we have only 2D images:  459  axial images and  15  sagital images
Total number of sagital images:  912
Total number of axial images:  1693
Total number of 3D images:  266
Number of subjects:  1541
Average resolution:  [1.25201238 0.54277958 2.93960629]
Std resolution:  [1.17157561 0.23521095 1.95652873]
Median resolution:  [0.57291669 0.5625     3.29999995]
Minimum pixel dimension:  0.1874999850988388
Maximum pixel dimension:  9.541563034057617
-------------------------------------
Number of images in ms-basel-2018:  46
Contrast in ms-basel-2018:  {'T1w', 'T2w'}
Number of images per contrast in ms-basel-2018:  {'T1w': 22, 'T2w': 24}
Number of images in ms-basel-2020:  31
Contrast in ms-basel-2020:  {'PD'}
Number of images per contrast in ms-basel-2020:  {'PD': 31}
Total number of images: 77
2D sagital images: 55
3D images: 22
Number of subjects in ms-basel-2018:  23
Number of subjects in ms-basel-2020:  16
Average resolution:  [2.43636375 0.61377165 0.61377165]
Std resolution:  [0.90967523 0.25884103 0.25884103]
Median resolution:  [2.99999976 0.57291669 0.57291669]
Minimum pixel dimension:  0.3385416567325592
Maximum pixel dimension:  3.300001859664917
-------------------------------------
Number of images in umass:  20
Contrast in umass:  {'T1w', 'PD', 'T2w'}
Number of images per contrast in umass:  {'T1w': 9, 'PD': 2, 'T2w': 9}
Number of subjects in umass:  20
For umass, we have  13  axial images,  7  sagital images and  0  3D images
Average resolution:  [1.55297141 0.51953438 2.68904921]
Std resolution:  [1.3874418  0.17851401 1.65879995]
Median resolution:  [0.78125    0.42969374 3.62499213]
Minimum pixel dimension:  0.35159996151924133
Maximum pixel dimension:  5.000114440917969
-------------------------------------
Number of images in beijing:  20
Contrast in beijing:  {'T1w'}
Number of images per contrast in beijing:  {'T1w': 20}
For beijing, we have  2  axial images,  2  sagital images and  16  3D images
Number of subjects in beijing:  11
Average resolution:  [1.24874925 0.93906249 1.62031279]
Std resolution:  [0.69777444 0.12440287 1.96204985]
Median resolution:  [1.00000021 1.         1.        ]
Minimum pixel dimension:  0.625
Maximum pixel dimension:  7.500004768371582
-------------------------------------
-------------------------------------
Total number of images:  2988
Total number of subjects:  1611
Total number of sagital images:  976
Total number of axial images:  1708
Total number of 3D images:  304

@plbenveniste
Copy link
Collaborator Author

plbenveniste commented Nov 8, 2024

I have aggregated all the annotated data from the following datasets basel-mp2rage, bavaria-quebec-spine-ms-unstitched, canproco, ms-basel-2018, ms-basel-2020, ms-karolinska-2020, nih-ms-mp2rage, ms-nyu and sct-testing-large.

This accounts for 4824 MRI scans which come from 2019 subjects.

@plbenveniste
Copy link
Collaborator Author

After a discussion about the complexity of seeing lesions in PDw images, we decided not to include them in the study.
Referencing this #40 (comment) for more input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants