Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to reproduce reported performance on AAPM16 dataset #3

Open
JixiangChen-Jimmy opened this issue Feb 5, 2025 · 3 comments
Open

Comments

@JixiangChen-Jimmy
Copy link

Dear authors, thank you so much for your fantastic work and for open-sourcing the code.
I’m reaching out because I’m having some trouble reproducing the 18-view AAPM16 results from Table 2 (reported PSNR/SSIM: 37.91/0.9458). After following the code closely, my best results so far are 34.80/0.9165 by following your code.
My modification lies in changing the batch size to 16 and only train using one NVIDIA 3090 GPU card.

Could you share any tips for aligning closer to the paper’s performance? For example:

  • Would you recommend adjusting hyperparameters (e.g., learning rate) when scaling down the batch size?
  • Are there any subtle implementation details or data prep steps that might impact results?
  • Environment/package nuances: If there are specific versions or environment details critical for reproducibility, I’d love to confirm!
@longzilicart
Copy link
Owner

Dear authors, thank you so much for your fantastic work and for open-sourcing the code. I’m reaching out because I’m having some trouble reproducing the 18-view AAPM16 results from Table 2 (reported PSNR/SSIM: 37.91/0.9458). After following the code closely, my best results so far are 34.80/0.9165 by following your code. My modification lies in changing the batch size to 16 and only train using one NVIDIA 3090 GPU card.

Could you share any tips for aligning closer to the paper’s performance? For example:

  • Would you recommend adjusting hyperparameters (e.g., learning rate) when scaling down the batch size?
  • Are there any subtle implementation details or data prep steps that might impact results?
  • Environment/package nuances: If there are specific versions or environment details critical for reproducibility, I’d love to confirm!

Hi,

Thank you for your attention. I agree that the impact of hyperparameters should generally be minimal, usually less than 0.5db for different settings if the training is stable. It sounds like there might have been a step missed in the process. Specifically, as described in the article, "we finetune each model for another ten epochs on AAPM dataset to bridge the domain gap following the same setting." This is because we find AAPM dataset is relatively small and many networks require more data for effective training.

Please let me know if this step was overlooked in your procedure.

Best regards.

@JixiangChen-Jimmy
Copy link
Author

Thank you so much for the quick and helpful clarification—this makes a lot of sense!

To confirm: The results in Table 2 are based on a pre-trained model (DeepLesion) → finetuned on AAPM16 for 10 epochs, whereas our current implementation trained directly on AAPM16 from scratch (without pre-training). This explains the performance gap we observed!

Just to ensure we’re aligned:

  1. Did your team ever benchmark the model’s performance without pre-training (i.e., training solely on AAPM16)? If so, do our results (~34.8 PSNR) fall within the expected range for this setup?
  2. For future reference, would you recommend domain-specific pre-training as a critical step for smaller datasets like AAPM16?

Thanks again for your patience and guidance—it’s been a huge help!

@longzilicart
Copy link
Owner

Thank you so much for the quick and helpful clarification—this makes a lot of sense!

To confirm: The results in Table 2 are based on a pre-trained model (DeepLesion) → finetuned on AAPM16 for 10 epochs, whereas our current implementation trained directly on AAPM16 from scratch (without pre-training). This explains the performance gap we observed!

Just to ensure we’re aligned:

  1. Did your team ever benchmark the model’s performance without pre-training (i.e., training solely on AAPM16)? If so, do our results (~34.8 PSNR) fall within the expected range for this setup?
  2. For future reference, would you recommend domain-specific pre-training as a critical step for smaller datasets like AAPM16?

Thanks again for your patience and guidance—it’s been a huge help!

Hi, you are welcome. I think we are aligned.

  1. I've just found the log, but did not find aapm without pretrained settings; however, it is likely expected for the result. Here is a log on another dataset with N = 36 only for reference. But never mind, you can report what you got.
  2. Yes, just to note that I only do this to save computational costs, as running so many methods on two datasets is large for me at that time.

Image

BTW, personally, I haven't caught up with this area for some time, but I do have some findings that may benefit your study:

  1. Global skip can significantly enhance the performance for all methods, and I found adding distillation on a smaller dataset may not be as useful, because the teacher itself may not be trained well. To say, for fast performance gain, distillation may not be a good strategy.
  2. Taking into account that the experimental setting in my paper is somewhat toy (all datasets share the same simulation process, so their distribution is the same), pretraining on a large dataset is effective, although it may not be as useful if the geometries of different CTs differ (as a more realistic setting). In fact, we find that directly zero-shot all methods on AAPM still works well, and suppress the model trained only on AAPM. Just to mention, I am not sure such a data-centric setting aligns with real clinical use.

Best regards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants