-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu训练问题 #8
Comments
你好,请你指定一个gpu试一下,不要多卡跑。 |
Training configurationGPU: [0] VERBOSE: False MODEL: Optimization arguments.OPTIM: EPOCH_DECAY: [10]LR_INITIAL: 0.0008 BETA1: 0.9TRAINING: |
你好非常感谢你的回答,在更换了torch的版本之后,又更换了mamba_ssm的版本为1.2.0.post1之后,问题解决了,感谢您的帮助。 |
请问在训练中第四轮训练过后loss呈现nan,这是正常现象吗? |
config中的参数都是我们反复训练已经调好的,如果lr值较大就会nan |
你好作者,我在跑你们训练的时候遇到了这个问题,请问有解决的方式吗?
/home/amax/anaconda3/bin/conda run -n WalMaFa --no-capture-output python /data1/WalMaFa/train.py
load training yaml file: ./configs/LOL/train/training_LOL.yaml
==> Build the model
Let's use 3 GPUs!
/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/optim/lr_scheduler.py:139: UserWarning: Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. In PyTorch 1.1.0 and later, you should call them in the opposite order:optimizer.step()
beforelr_scheduler.step()
. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-ratewarnings.warn("Detected call of
lr_scheduler.step()
beforeoptimizer.step()
. "==> Loading datasets
==> Training details:
==> Training start:
0%| | 0/41 [00:04<?, ?it/s]
Traceback (most recent call last):
File "/data1/WalMaFa/train.py", line 191, in
restored = model_restored(input_)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/WalMaFa/model/Walmafa.py", line 444, in forward
out_enc_level1_0 = self.decoder_level1_0(inp_enc_level1_0)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/WalMaFa/model/Walmafa.py", line 286, in forward
input_high = self.mb(input_high)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/data1/WalMaFa/model/Walmafa.py", line 260, in forward
y = self.model1(x).permute(0, 2, 1)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/mamba_ssm/modules/mamba_simple.py", line 189, in forward
y = selective_scan_fn(
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 88, in selective_scan_fn
return SelectiveScanFn.apply(u, delta, A, B, C, D, z, delta_bias, delta_softplus, return_last_state)
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/amax/anaconda3/envs/WalMaFa/lib/python3.8/site-packages/mamba_ssm/ops/selective_scan_interface.py", line 42, in forward
out, x, *rest = selective_scan_cuda.fwd(u, delta, A, B, C, D, z, delta_bias, delta_softplus)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.The text was updated successfully, but these errors were encountered: