Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Segmentation: Data Preprocessing Verification -- Checksum fails #545

Closed
nmcglo opened this issue Apr 19, 2022 · 7 comments
Closed
Assignees
Labels

Comments

@nmcglo
Copy link

nmcglo commented Apr 19, 2022

The Image Segmentation (Pytorch UNet3D) benchmark relies on the KITS19 dataset. I've followed the instructions from the KITS19 dataset repository for downloading the dataset and have been trying to run the data preprocessing script (https://github.com/mlcommons/training/blob/master/image_segmentation/pytorch/preprocess_dataset.py)

The cases all pre-process just fine but I get an error when the verify_dataset() function is called. At least one of the cases (Case 00043 specifically) has an md5 checksum hash that does not match the expected checksum value from the mlcommons image segmentation repo (https://github.com/mlcommons/training/blob/master/image_segmentation/pytorch/checksum.json). I haven't exhaustively checked each of them but if I run my own md5 hash on these case files, a random sampling of 10 or so all matched the expected values but the hash for case 43 does not match.

I have downloaded the dataset using both download scripts a total of 7 times and get the exact same invalid checksum each time so it isn't a corrupted download (at least on my end).

@mmarcinkiewicz
Copy link
Contributor

Hi @nmcglohon , is this still a problem for you? I'm going to take a look and try to repro early next week

@mmarcinkiewicz
Copy link
Contributor

I am able to repro. I'll reach out to the dataset owners asking for clarification whether anything has changed.

@nmcglo
Copy link
Author

nmcglo commented Dec 2, 2022

Thanks, apologies for the delay in response - I was away last month.

@sepzjh
Copy link

sepzjh commented Sep 24, 2023

I have the same problem, I get an error when the verify_dataset() function is called,Has this issue been resolved? or can i skip the function?

@hiwotadese
Copy link
Contributor

Closing because we are dropping UNet3D

@wahabk
Copy link

wahabk commented Jul 30, 2024

I have ran into this error as well during dataset verification:

Case 299. Skipped.
Mean value: -1.850000023841858, std: 0.9800000190734863, d: 256.0, h: 333.0, w: 333.0
  0%|▊                                                                                                                                                                                | 2/420 [00:00<01:08,  6.12it/s]
Traceback (most recent call last):
  File "preprocess_dataset.py", line 147, in <module>
    verify_dataset(args.results_dir)
  File "preprocess_dataset.py", line 132, in verify_dataset
    assert md5_hash == source[volume], f"Invalid hash for {volume}."
AssertionError: Invalid hash for case_00183_x.npy.

This time on case_00183_x.npy

@hiwotadese Can I please ask why UNet3D is being dropped? Which part of the MLCommons WG work on training and inference?

@ShriyaPalsamudram
Copy link
Contributor

There were multiple reasons taken into consideration before dropping unet3d from the training benchmark suite. In case you are interested, the training WG meets weekly and decisions regarding which benchmarks to keep and which ones to drop are discussed in that forum.

This table lists all the current benchmarks for Training v4.1.

Note that unet3d is still a part of the inference benchmark suite as listed in this table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants