Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MHub / GC - Add grt123 Model for lung cancer prediction based on lung nodules #27

Merged
merged 22 commits into from
Feb 28, 2024

Conversation

silvandeleemput
Copy link
Contributor

@silvandeleemput silvandeleemput commented Jul 4, 2023

Hi, this PR contains the required code for getting our modified version of the publicly available grt123 model, winner of the Data Science Bowl 2017 Kaggle challenge.

Caveats

  • The PR target is main, but should be something like m-gc-grt123-lung-cancer
  • In the Dockerfile the MHub model integration is currently marked TODO since it requires the creation of the appropriate branch for this code first.
  • This implementation still uses a run.py script because the MHAConverter doesn't have the panimg backend yet.

Algorithm I/O

  • Input should be a CT Lung image (MHA) (This MHub implementation expects a Dicom which is converted to MHA using MhaPanImgConverter)
  • Output is a JSON file containing the predicted nodule findings (locations) and predicted cancer probability scores.
  • Test images for this algorithm can be found here

@silvandeleemput
Copy link
Contributor Author

This PR has been updated for the new base image.

@LennyN95
Copy link
Member

Some suggestions

  • For the final prediction score, add a ValueOutput to the Runner module
  • Remove thetmp_path config
  • Use requestTempDir for temporary folders (example)
  • Remove custom gpu checks if not vital for the model runner (we may implement such a feature globally in mhubio, let's discuss this in our next meeting!)

@LennyN95 LennyN95 self-assigned this Aug 13, 2023
silvandeleemput and others added 2 commits September 13, 2023 12:00
* default.yml
  * renamed from config.yml
  * added version and description
  * updated pipeline with panimg mhaconverter
* Dockerfile
  * added fixed commit hash for grt123 repo git clone
  * updated entrypoint
* LungCancerClassifierRunner.py
  * removed tmp_path config option
  * added requestTempDir for tmp_path
  * added more comments
* Removed script files and custom PanImgConverter
@silvandeleemput
Copy link
Contributor Author

silvandeleemput commented Sep 13, 2023

I just updated and cleaned up this PR:

  • default.yml
    • renamed from config.yml
    • added version and description
    • updated pipeline with panimg mhaconverter
  • Dockerfile
    • added fixed commit hash for grt123 repo git clone
    • updated entrypoint
  • LungCancerClassifierRunner.py
    • removed tmp_path config option
    • added requestTempDir for tmp_path
    • added more comments
  • Removed script files and custom PanImgConverter

This satisfies all but two of your suggestions:

  • Remove custom gpu checks if not vital for the model runner (we may implement such a feature globally in mhubio, let's discuss this in our next meeting!)

Is required for the model to know how the number of GPUs it has available, so I left it in.

  • For the final prediction score, add a ValueOutput to the Runner module

This is tricky, since it outputs quite a detailed report with multiple scores per nodule per scan (see below). Maybe it is better to leave it as is? Or maybe we could only add a single cancer probability ValueOutput for the whole scan (last probability value in the report after cancerinfo)).

{
    "lungcad": {
        "revision": "9a4ca0415c7fc1d3023a16650bf1cdce86f8bb59",
        "name": "grt123",
        "datetimeofexecution": "09/13/2023 12:28:46",
        "coordinatesystem": "World",
        "computationtimeinseconds": 48.483961
    },
    "imageinfo": {
        "dimensions": [
            512,
            512,
            140
        ],
        "voxelsize": [
            0.820312,
            0.820312,
            2.5
        ],
        "origin": [
            -228.800003,
            -210.0,
            -379.0
        ],
        "orientation": [
            1.0,
            0.0,
            0.0,
            0.0,
            1.0,
            0.0,
            0.0,
            0.0,
            1.0
        ],
        "seriesuid": "dicom"
    },
    "findings": [
        {
            "id": 0,
            "x": -46.800003000000004,
            "y": -31.0,
            "z": -168.0,
            "probability": 0.9999904632568359,
            "cancerprobability": 0.75371253490448
        },
        {
            "id": 1,
            "x": 73.199997,
            "y": 79.0,
            "z": -191.0,
            "probability": 0.999943733215332,
            "cancerprobability": 0.6662029027938843
        },
        {
            "id": 2,
            "x": 24.199996999999982,
            "y": -47.0,
            "z": -171.0,
            "probability": 0.9999247789382935,
            "cancerprobability": 0.47204485535621643
        },
        {
            "id": 3,
            "x": -82.800003,
            "y": 108.0,
            "z": -159.0,
            "probability": 0.8182298541069031,
            "cancerprobability": 0.004258527886122465
        },
        {
            "id": 4,
            "x": 49.199996999999996,
            "y": -67.0,
            "z": -275.0,
            "probability": 0.7932767271995544,
            "cancerprobability": 0.014665956608951092
        },
        {
            "id": 5,
            "x": 69.19999700000001,
            "y": -63.0,
            "z": -287.0,
            "probability": 0.6662660241127014,
            "cancerprobability": 0.0054708984680473804
        },
        {
            "id": 6,
            "x": 126.199997,
            "y": 28.0,
            "z": -334.0,
            "probability": 0.6302552819252014,
            "cancerprobability": 0.008180161006748676
        },
        {
            "id": 7,
            "x": -62.800003000000004,
            "y": -4.0,
            "z": -102.99999999999999,
            "probability": 0.48750755190849304,
            "cancerprobability": 0.00867788027971983
        },
        {
            "id": 8,
            "x": 73.199997,
            "y": 80.0,
            "z": -215.0,
            "probability": 0.36484286189079285,
            "cancerprobability": 0.03693072870373726
        }
    ],
    "cancerinfo": {
        "casecancerprobability": 0.9574154615402222,
        "referencenoduleids": [
            0,
            1,
            2,
            3,
            4
        ]
    }
}

@LennyN95 LennyN95 added the Requires MHub-IO Update Only for PR with model suggestions which require extended MHub-IO functionality. label Nov 22, 2023
@LennyN95
Copy link
Member

NOTE: This PR proposes a model currently using a dynamic length output. We are working on a method to export dynamic length outputs and will likely introduce a @IO.Out.Data.Many decorator (similar to our @IO.Output -> @IO.Outputs decorators).

@LennyN95
Copy link
Member

This is tricky, since it outputs quite a detailed report with multiple scores per nodule per scan (see below). Maybe it is better to leave it as is? Or maybe we could only add a single cancer probability ValueOutput for the whole scan (last probability value in the report after cancerinfo)).

Let's do both, or actually all three :)

  • Export the original JSON file
  • Use a Value Output for the overall score
  • Use a second, dynamic Value Output for all findings.

Dynamic Value Outputs are supported by the newest MHub-IO release now :)
The documentation can be found here and are fully supported in our ReportExporter Module by using an aggregate function (similar to files) but with some additional value operations.

An example implementation would look like this:

@ValueOutput.Name('lnrisk')
@ValueOutput.Label('Lung Nodule Risk-Score.')
@ValueOutput.Type(int)
@ValueOutput.Description('The predicted risk score for a single lung nodule detected by the alggorithm.')
class LNRisk(ValueOutput):
   pass

def getLungNodulesRiskScores(dicom_dir) -> List[int]:
   # ... find lung nodules, and report back an array of risk scores
   return lst_scores
   
class MyModule(Module):

   @IO.Instance
   @IO.Input('in_data', 'dicom:mod=ct', the='chest CT image')
   @IO.OutputDatas('lnrisks', LNRisk)
   def task(self, instance: Instance, in_data: InstanceData, lnrisks: LNRisk):

      scores = getLungNodulesRiskScores(in_data.abspath)

      for nodule_i, score in enumerate(scores):

         # create value output instance and set the value (we can also modify the description)
         lnrisk = LNRisk()
         lnrisk.description += f" (for nodule {nodule_i})"
         lnrisk.value = score

         # add to collection
         lnrisks.add(lnrisk)

@LennyN95 LennyN95 added +Model: ACTION REQUIRED and removed Requires MHub-IO Update Only for PR with model suggestions which require extended MHub-IO functionality. labels Nov 23, 2023
@silvandeleemput
Copy link
Contributor Author

@LennyN95 I have added the case level score and the dynamic scores per finding.
I also found that it might be useful for the dynamic scores to set the metadata for associated values (like position and id) per output value, like so:

for finding in results_dict["findings"]:
nodule_cancer_prob = LNCancerProb()
nodule_cancer_prob.meta = Meta(id=finding['id'], x=finding['x'], y=finding['y'], z=finding['z'], )
nodule_cancer_prob.description += f" (for nodule {finding['id']} at location ({finding['x']}, {finding['y']}, {finding['z']}))"
nodule_cancer_prob.value = finding["cancerprobability"]
lncancerprobs.add(nodule_cancer_prob)

Setting the metadata in this way helps with debugging as well, with debug output like:

├── lncancerprob [Lung Nodule cancer probability score.]
│   The predicted cancer probability score for a single lung nodule detected by the algorithm (for nodule 8 at location (73.199997, 80.0, -215.0))
│   └── Lung Nodule cancer probability score. (0.03693072870373726)
│   ├── id: 8
│   ├── x: 73.199997
│   ├── y: 80.0
│   └── z: -215.0

Do you think this would be of added value?
Furthermore, could you give some feedback on the general implementation of the output values, i.e. is it satisfactory?

@LennyN95
Copy link
Member

Adding the ID (and coordinates) to the metadata is excellent (because then, when exporting the report, you could technically filter by these values).

However, metadata is used to query files/data and is not available in the report exporter. So if a value is to be exportable, it needs its own value output. We could of course also implement a directive to export data metadata to the report. However, I find that this could create confusion about the ReportExporter, as it is less clear when information is stored in metadata and when as value output.

models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
@LennyN95
Copy link
Member

Note: test passed (09.01.2024).

DICOM case from NLST on IDC: 
https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=nlst

Case ID: 100002
StudyInstanceUID: 1.2.840.113654.2.55.68425808326883186792123057288612355322
SeriesInstanceUID: 1.2.840.113654.2.55.229650531101716203536241646069123704792

s5cmd --no-sign-request --endpoint-url https://s3.amazonaws.com cp 's3://idc-open-data/b22ed5a6-ad69-4b00-ba26-ae75e96345f8/*' .

Expected output:
---------------
grt123_lung_cancer_findings.json

├── lncancerprob [Lung Nodule cancer probability score.]
│   The predicted cancer probability score for a single lung nodule detected by the algorithm (for nodule 0 at location (112.39999399999999, -92.0, -260.545013))
│   └── Lung Nodule cancer probability score. (0.01483230385929346)
│   ├── id: 0
│   ├── x: 112.39999399999999
│   ├── y: -92.0
│   └── z: -260.545013
├── lncancerprob [Lung Nodule cancer probability score.]
│   The predicted cancer probability score for a single lung nodule detected by the algorithm (for nodule 1 at location (-47.60000600000001, 75.0, -245.54501299999998))
│   └── Lung Nodule cancer probability score. (0.01104212086647749)
│   ├── id: 1
│   ├── x: -47.60000600000001
│   ├── y: 75.0
│   └── z: -245.54501299999998
├── clcancerprob [Case level cancer probability score.]
│   Case level probability score
│   └── Case level cancer probability score. (0.025710642337799072)
│   ├── min: 0.0
│   ├── max: 1.0
│   └── type: probability

@silvandeleemput
Copy link
Contributor Author

We have updated the meta.json, please have a look if you agree. There were no changes to the code, so the test should still pass. We should be able to move on with this.

@LennyN95
Copy link
Member

There's one last problem showing up here:

When running the model, the following print out is not captured and pollutes the console: (diag-)image_loader not found, loading dicom will not be possible, caused by this line.

The problem is, that the print statement is executed at import time, which is before we (technically can) start capturing prints.

You may be able to resolve the issue by moving the import inside the task() method of the runner Module.

For the upcoming models, please always check that the print-out is clean and when running in normal mode (no --print or --debug), the output is consistent (any uncaptured outputs like this one will break the appearance).

@silvandeleemput
Copy link
Contributor Author

@LennyN95 Good catch. The print statement issue has been addressed by doing as you suggested. I missed the minor glitch when running it in normal mode the first time I checked it. I'll be more aware of the issue with the upcoming models.

Copy link
Member

@LennyN95 LennyN95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requested changes addressed, tests passed.

Copy link
Member

@LennyN95 LennyN95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more changes required to enhance the integration into the MHub model repository and UX when searching / browsing models on mhub.ai/models.

models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
models/gc_grt123_lung_cancer/meta.json Outdated Show resolved Hide resolved
@LennyN95 LennyN95 closed this Jan 16, 2024
@LennyN95 LennyN95 reopened this Jan 16, 2024
@LennyN95 LennyN95 merged commit ada037d into MHubAI:main Feb 28, 2024
1 check passed
@silvandeleemput silvandeleemput deleted the m-grt123 branch March 5, 2024 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants