RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

Wanyidon · 2023-12-26T14:36:22Z

Thank you very much for your great work on open source. I encountered the following problems when training the model according to the training sequence, dataset, and weight values provided by you：
RuntimeError: Error(s) in loading state_dict for DistributedDataParallel: Unexpected key(s) in state_dict: "module.text_encoder.clip_text_model.text_model.embeddings.position_ids".
I sincerely hope to receive your reply.

Yuxin-Du-Lab · 2023-12-27T03:02:49Z

Could you provide more details about the version of your 'transformers' pak? We recommand running SegVol on 'transformers==4.18.0'.

Wanyidon · 2023-12-27T04:31:57Z

Thank you very much for your reply. I have resolved the issue as per your suggestion and have also discovered that adding False to load_state_dict can also resolve the issue.
However, during the code run, only gpu7 is utilized with 60G of GPU memory, with gpu0-6 only taking up 5G of GPU memory.

Yuxin-Du-Lab · 2023-12-27T05:06:49Z

I strongly recommend not using 'strict=False' in load_state_dict to load parameters, as this can result in random initialization of some parameters.
For the unbalance of GPU memory usage, I think you can double check to see if there are other programs running or there are unkilled zombie processes in your background.

Wanyidon · 2023-12-27T06:35:14Z

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct.

Yuxin-Du-Lab · 2024-01-04T02:52:04Z

Have you fixed the bug yet? I'm sorry I can't reproduce it. I don't know if anyone else has been in a similar situation.🤦

kennyWJB · 2024-06-05T08:55:57Z

您好，我正在尝试复现demo也遇到了同样的报错，我用的是window系统，cuda==12.2，pytorch==2.0.1，monai==1.3.1(因为0.9.0安装失败)，其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

Yuxin-Du-Lab · 2024-06-06T04:05:19Z

您好，我正在尝试复现demo也遇到了同样的报错，我用的是window系统，cuda==12.2，pytorch==2.0.1，monai==1.3.1(因为0.9.0安装失败)，其余都和推荐版本相同。我目前也是通过strict=False规避该问题。

请确保'transformers==4.18.0'以及加载的是SegVol_v1.pth

mrokuss · 2024-06-13T15:03:20Z

transformers==4.25.1 also works and does not fail with the pip installation ;)

Yuxin-Du-Lab · 2024-08-06T03:32:09Z

Thank you for your advice. I have checked my program and my current GPU running status is as shown in the image. I wonder if this is correct.

The problem may be that the batch_size is too large and the dataset size is limited. Try to reduce the batch_size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

Wanyidon commented Dec 26, 2023

Yuxin-Du-Lab commented Dec 27, 2023 •

edited

Loading

Wanyidon commented Dec 27, 2023

Yuxin-Du-Lab commented Dec 27, 2023

Wanyidon commented Dec 27, 2023

Yuxin-Du-Lab commented Jan 4, 2024

kennyWJB commented Jun 5, 2024

Yuxin-Du-Lab commented Jun 6, 2024

mrokuss commented Jun 13, 2024

Yuxin-Du-Lab commented Aug 6, 2024

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

RuntimeError: Error(s) in loading state_dict for DistributedDataParallel #7

Comments

Wanyidon commented Dec 26, 2023

Yuxin-Du-Lab commented Dec 27, 2023 • edited Loading

Wanyidon commented Dec 27, 2023

Yuxin-Du-Lab commented Dec 27, 2023

Wanyidon commented Dec 27, 2023

Yuxin-Du-Lab commented Jan 4, 2024

kennyWJB commented Jun 5, 2024

Yuxin-Du-Lab commented Jun 6, 2024

mrokuss commented Jun 13, 2024

Yuxin-Du-Lab commented Aug 6, 2024

Yuxin-Du-Lab commented Dec 27, 2023 •

edited

Loading