Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error When Running image_train.py #3

Open
hwan-sig opened this issue Dec 30, 2024 · 3 comments
Open

Error When Running image_train.py #3

hwan-sig opened this issue Dec 30, 2024 · 3 comments

Comments

@hwan-sig
Copy link

Description:
I attempted to run the following command to train the model:

OPENAI_LOGDIR='./Logs' PYTHONPATH='.' CUDA_VISIBLE_DEVICES=0 python scripts/image_train.py --optimizer adamw --image_size 32 --num_channels 128 --num_res_blocks 3 --diffusion_steps 1000 --noise_schedule cosine --lr 1e-4 --batch_size 128 --learn_sigma True --eps_scaler=0 --lr_anneal_steps 100000 &

However, the script produced an error as shown in the attached screenshot.
Could you help identify the cause of this issue and provide guidance on how to resolve it? Thank you!

스크린샷 2024-12-30 오후 2 11 37
@forever208
Copy link
Owner

@hwan-sig it is very likely that the training dataset is not clean, which dataset are you training on?

@hwan-sig
Copy link
Author

hwan-sig commented Jan 2, 2025

I used CIFAR10 dataset.
When I perform the code without --learn_sigma True, it works well (It denotes all loss, mse, and vb values)
However, with --learn_sigma True, it causes the error. Thank you.!

@forever208
Copy link
Owner

forever208 commented Jan 3, 2025

@hwan-sig it is a bit weird the problem happens on CIFAR-10. Does the nan occur immediately at the beginning of the training? Can you also print out the loss of eps and vlb (when using --learn_sigma) for debugging, and see if the loss scale is unusual? Also, consider to re-download the dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants