Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

Open
Wanyidon opened this issue Dec 26, 2023 · 9 comments

Comments

@Wanyidon
Copy link

Thank you very much for your great work on open source. I encountered the following problems when training the model according to the training sequence, dataset, and weight values provided by you:
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel: Unexpected key(s) in state_dict: "module.text_encoder.clip_text_model.text_model.embeddings.position_ids".
I sincerely hope to receive your reply.

@Yuxin-Du-Lab
Copy link
Collaborator

Yuxin-Du-Lab commented Dec 27, 2023

Could you provide more details about the version of your 'transformers' pak? We recommand running SegVol on 'transformers==4.18.0'.

@Wanyidon
Copy link
Author

Thank you very much for your reply. I have resolved the issue as per your suggestion and have also discovered that adding False to load_state_dict can also resolve the issue.
However, during the code run, only gpu7 is utilized with 60G of GPU memory, with gpu0-6 only taking up 5G of GPU memory.

@Yuxin-Du-Lab
Copy link
Collaborator

I strongly recommend not using 'strict=False' in load_state_dict to load parameters, as this can result in random initialization of some parameters.
For the unbalance of GPU memory usage, I think you can double check to see if there are other programs running or there are unkilled zombie processes in your background.

@Wanyidon
Copy link
Author

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct.
290e14f416658e558e17f0be8205575

@Yuxin-Du-Lab
Copy link
Collaborator

Have you fixed the bug yet? I'm sorry I can't reproduce it. I don't know if anyone else has been in a similar situation.🤦

@kennyWJB
Copy link

kennyWJB commented Jun 5, 2024

您好,我正在尝试复现demo也遇到了同样的报错,我用的是window系统,cuda==12.2,pytorch==2.0.1,monai==1.3.1(因为0.9.0安装失败),其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

@Yuxin-Du-Lab
Copy link
Collaborator

您好,我正在尝试复现demo也遇到了同样的报错,我用的是window系统,cuda==12.2,pytorch==2.0.1,monai==1.3.1(因为0.9.0安装失败),其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

请确保'transformers==4.18.0'以及加载的是SegVol_v1.pth

@mrokuss
Copy link

mrokuss commented Jun 13, 2024

transformers==4.25.1 also works and does not fail with the pip installation ;)

@Yuxin-Du-Lab
Copy link
Collaborator

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct. 290e14f416658e558e17f0be8205575

The problem may be that the batch_size is too large and the dataset size is limited. Try to reduce the batch_size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants