Unable to reproduce reported performance on AAPM16 dataset #3

JixiangChen-Jimmy · 2025-02-05T13:06:43Z

Dear authors, thank you so much for your fantastic work and for open-sourcing the code.
I’m reaching out because I’m having some trouble reproducing the 18-view AAPM16 results from Table 2 (reported PSNR/SSIM: 37.91/0.9458). After following the code closely, my best results so far are 34.80/0.9165 by following your code.
My modification lies in changing the batch size to 16 and only train using one NVIDIA 3090 GPU card.

Could you share any tips for aligning closer to the paper’s performance? For example:

Would you recommend adjusting hyperparameters (e.g., learning rate) when scaling down the batch size?
Are there any subtle implementation details or data prep steps that might impact results?
Environment/package nuances: If there are specific versions or environment details critical for reproducibility, I’d love to confirm!

longzilicart · 2025-02-05T13:35:52Z

Dear authors, thank you so much for your fantastic work and for open-sourcing the code. I’m reaching out because I’m having some trouble reproducing the 18-view AAPM16 results from Table 2 (reported PSNR/SSIM: 37.91/0.9458). After following the code closely, my best results so far are 34.80/0.9165 by following your code. My modification lies in changing the batch size to 16 and only train using one NVIDIA 3090 GPU card.

Could you share any tips for aligning closer to the paper’s performance? For example:

Would you recommend adjusting hyperparameters (e.g., learning rate) when scaling down the batch size?

Are there any subtle implementation details or data prep steps that might impact results?

Environment/package nuances: If there are specific versions or environment details critical for reproducibility, I’d love to confirm!

Hi,

Thank you for your attention. I agree that the impact of hyperparameters should generally be minimal, usually less than 0.5db for different settings if the training is stable. It sounds like there might have been a step missed in the process. Specifically, as described in the article, "we finetune each model for another ten epochs on AAPM dataset to bridge the domain gap following the same setting." This is because we find AAPM dataset is relatively small and many networks require more data for effective training.

Please let me know if this step was overlooked in your procedure.

Best regards.

JixiangChen-Jimmy · 2025-02-06T04:57:54Z

Thank you so much for the quick and helpful clarification—this makes a lot of sense!

To confirm: The results in Table 2 are based on a pre-trained model (DeepLesion) → finetuned on AAPM16 for 10 epochs, whereas our current implementation trained directly on AAPM16 from scratch (without pre-training). This explains the performance gap we observed!

Just to ensure we’re aligned:

Did your team ever benchmark the model’s performance without pre-training (i.e., training solely on AAPM16)? If so, do our results (~34.8 PSNR) fall within the expected range for this setup?
For future reference, would you recommend domain-specific pre-training as a critical step for smaller datasets like AAPM16?

Thanks again for your patience and guidance—it’s been a huge help!

longzilicart · 2025-02-08T02:59:11Z

Thank you so much for the quick and helpful clarification—this makes a lot of sense!

To confirm: The results in Table 2 are based on a pre-trained model (DeepLesion) → finetuned on AAPM16 for 10 epochs, whereas our current implementation trained directly on AAPM16 from scratch (without pre-training). This explains the performance gap we observed!

Just to ensure we’re aligned:

Did your team ever benchmark the model’s performance without pre-training (i.e., training solely on AAPM16)? If so, do our results (~34.8 PSNR) fall within the expected range for this setup?

For future reference, would you recommend domain-specific pre-training as a critical step for smaller datasets like AAPM16?

Thanks again for your patience and guidance—it’s been a huge help!

Hi, you are welcome. I think we are aligned.

I've just found the log, but did not find aapm without pretrained settings; however, it is likely expected for the result. Here is a log on another dataset with N = 36 only for reference. But never mind, you can report what you got.
Yes, just to note that I only do this to save computational costs, as running so many methods on two datasets is large for me at that time.

BTW, personally, I haven't caught up with this area for some time, but I do have some findings that may benefit your study:

Global skip can significantly enhance the performance for all methods, and I found adding distillation on a smaller dataset may not be as useful, because the teacher itself may not be trained well. To say, for fast performance gain, distillation may not be a good strategy.
Taking into account that the experimental setting in my paper is somewhat toy (all datasets share the same simulation process, so their distribution is the same), pretraining on a large dataset is effective, although it may not be as useful if the geometries of different CTs differ (as a more realistic setting). In fact, we find that directly zero-shot all methods on AAPM still works well, and suppress the model trained only on AAPM. Just to mention, I am not sure such a data-centric setting aligns with real clinical use.

Best regards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to reproduce reported performance on AAPM16 dataset #3

Unable to reproduce reported performance on AAPM16 dataset #3

JixiangChen-Jimmy commented Feb 5, 2025

longzilicart commented Feb 5, 2025

JixiangChen-Jimmy commented Feb 6, 2025

longzilicart commented Feb 8, 2025

Unable to reproduce reported performance on AAPM16 dataset #3

Unable to reproduce reported performance on AAPM16 dataset #3

Comments

JixiangChen-Jimmy commented Feb 5, 2025

longzilicart commented Feb 5, 2025

JixiangChen-Jimmy commented Feb 6, 2025

longzilicart commented Feb 8, 2025