两张v100部署失败 #362

Cocoalate · 2023-08-16T10:02:17Z

本人环境
两张v100（32G*2）
cuda11.0
pytorch版本 1.7.1

由于pytorch版本比较低，无法支持量化版本，所以选择部署fnlp/moss-moon-003-sft这个模型，但是fp16精度会报以下错
File "/root/anaconda3/envs/mossgpu/lib/python3.8/site-packages/torch/tensor.py", line 547, in __rpow__ return torch.tensor(other, dtype=dtype, device=self.device) ** self RuntimeError: "pow" not implemented for 'Half'
所以只好改成
raw_model = MossForCausalLM._from_config(config, torch_dtype=torch.float32)
运行
python moss_cli_demo.py --model_name fnlp/moss-moon-003-sft --gpu 0,2
报错如下
Traceback (most recent call last): File "moss_cli_demo.py", line 48, in <module> raw_model = MossForCausalLM._from_config(config, torch_dtype=torch.float32) File "/root/anaconda3/envs/mossgpu/lib/python3.8/site-packages/transformers/modeling_utils.py", line 1024, in _from_config model = cls(config, **kwargs) File "/data_a/keke/workspace/MOSS/models/modeling_moss.py", line 607, in __init__ self.transformer = MossModel(config) File "/data_a/keke/workspace/MOSS/models/modeling_moss.py", line 401, in __init__ self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) File "/data_a/keke/workspace/MOSS/models/modeling_moss.py", line 401, in <listcomp> self.h = nn.ModuleList([MossBlock(config) for _ in range(config.n_layer)]) File "/data_a/keke/workspace/MOSS/models/modeling_moss.py", line 256, in __init__ self.mlp = MossMLP(inner_dim, config) File "/data_a/keke/workspace/MOSS/models/modeling_moss.py", line 235, in __init__ self.fc_in = nn.Linear(embed_dim, intermediate_size) File "/root/anaconda3/envs/mossgpu/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 78, in __init__ self.weight = Parameter(torch.Tensor(out_features, in_features)) File "/root/anaconda3/envs/mossgpu/lib/python3.8/site-packages/torch/nn/modules/module.py", line 796, in __setattr__ self.register_parameter(name, value) File "/root/anaconda3/envs/mossgpu/lib/python3.8/site-packages/accelerate/big_modeling.py", line 108, in register_empty_parameter module._parameters[name] = param_cls(module._parameters[name].to(device), **kwargs) RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 31.75 GiB total capacity; 30.01 GiB already allocated; 548.00 MiB free; 30.02 GiB reserved in total by PyTorch)

请问有大神知道怎么调么

The text was updated successfully, but these errors were encountered:

lizhixi212 · 2023-08-19T15:22:17Z

RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 31.75 GiB total capacity; 30.01 GiB already allocated; 548.00 MiB free; 30.02 GiB reserved in total by PyTorch)
爆显存了

Cocoalate · 2023-08-28T07:13:22Z

RuntimeError: CUDA out of memory. Tried to allocate 576.00 MiB (GPU 0; 31.75 GiB total capacity; 30.01 GiB already allocated; 548.00 MiB free; 30.02 GiB reserved in total by PyTorch) 爆显存了

谢谢我已经调通了还是用的fp16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

两张v100部署失败 #362

两张v100部署失败 #362

Cocoalate commented Aug 16, 2023 •

edited

Loading

lizhixi212 commented Aug 19, 2023

Cocoalate commented Aug 28, 2023

两张v100部署失败 #362

两张v100部署失败 #362

Comments

Cocoalate commented Aug 16, 2023 • edited Loading

lizhixi212 commented Aug 19, 2023

Cocoalate commented Aug 28, 2023

Cocoalate commented Aug 16, 2023 •

edited

Loading