You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I get this error constantly when training a new model with a batch size of 3 or more.
Initially, I started training with the default Batch size of 4, but it started crashing very easily.
When it encounters the problem my screen gets black for a couple of seconds and then the cmd gives out the error, saying:
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I also noticed that my training crashed if left for a long time (batch size was set to 4) and I simply had to move the mouse a bit to make it error out instantly.
I think this time it might have crashed because I was using my laptop in the background, doing simple chatting/monitoring Tensorboard graphs and watching RVC YT tutorials. However, it's not an out of memory error, as others have previously encountered.
The log:
write filelist done
use gpus: 0
runtime\python.exe train_nsf_sim_cache_sid_load_pretrain.py -e Den4ikSeekersV3 -sr 40k -f0 1 -bs 3 -g 0 -te 50 -se 5 -pg pretrained_v2/f0G40k.pth -pd pretrained_v2/f0D40k.pth -l 1 -c 0 -sw 1 -v v2 -li 279
INFO:Den4ikSeekersV3:{'train': {'log_interval': 279, 'seed': 1234, 'epochs': 20000, 'learning_rate': 0.0001, 'betas': [0.8, 0.99], 'eps': 1e-09, 'batch_size': 3, 'fp16_run': True, 'lr_decay': 0.999875, 'segment_size': 12800, 'init_lr_ratio': 1, 'warmup_epochs': 0, 'c_mel': 45, 'c_kl': 1.0}, 'data': {'max_wav_value': 32768.0, 'sampling_rate': 40000, 'filter_length': 2048, 'hop_length': 400, 'win_length': 2048, 'n_mel_channels': 125, 'mel_fmin': 0.0, 'mel_fmax': None, 'training_files': './logs\\Den4ikSeekersV3/filelist.txt'}, 'model': {'inter_channels': 192, 'hidden_channels': 192, 'filter_channels': 768, 'n_heads': 2, 'n_layers': 6, 'kernel_size': 3, 'p_dropout': 0, 'resblock': '1', 'resblock_kernel_sizes': [3, 7, 11], 'resblock_dilation_sizes': [[1,3, 5], [1, 3, 5], [1, 3, 5]], 'upsample_rates': [10, 10, 2, 2], 'upsample_initial_channel': 512, 'upsample_kernel_sizes': [16, 16, 4, 4], 'use_spectral_norm': False, 'gin_channels': 256, 'spk_embed_dim': 109}, 'model_dir': './logs\\Den4ikSeekersV3', 'experiment_dir': './logs\\Den4ikSeekersV3', 'save_every_epoch': 5, 'name': 'Den4ikSeekersV3', 'total_epoch': 50, 'pretrainG': 'pretrained_v2/f0G40k.pth', 'pretrainD': 'pretrained_v2/f0D40k.pth', 'version': 'v2', 'gpus': '0', 'sample_rate': '40k', 'if_f0': 1, 'if_latest': 1, 'save_every_weights': '1', 'if_cache_data_in_gpu': 0}
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
gin_channels: 256 self.spk_embed_dim: 109
INFO:Den4ikSeekersV3:loaded pretrained pretrained_v2/f0G40k.pth
<All keys matched successfully>
INFO:Den4ikSeekersV3:loaded pretrained pretrained_v2/f0D40k.pth
<All keys matched successfully>
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
D:\INSTALLS\RVC\runtime\lib\site-packages\torch\autograd\__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicategrad was not created according to the gradient layout contract, or that theparam's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [64, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:337.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
INFO:Den4ikSeekersV3:Train Epoch: 1 [0%]
INFO:Den4ikSeekersV3:[0, 0.0001]
INFO:Den4ikSeekersV3:loss_disc=3.713, loss_gen=3.484, loss_fm=11.058,loss_mel=25.817, loss_kl=5.464
DEBUG:matplotlib:matplotlib data path: D:\INSTALLS\RVC\runtime\lib\site-packages\matplotlib\mpl-data
DEBUG:matplotlib:CONFIGDIR=C:\Users\Omem\.matplotlib
DEBUG:matplotlib:interactive is False
DEBUG:matplotlib:platform is win32
INFO:torch.nn.parallel.distributed:Reducer buckets have been rebuilt in this iteration.
max value is tensor(1.0843)
INFO:Den4ikSeekersV3:====> Epoch: 1 [2024-03-12 18:49:07] | (0:02:22.698938)
INFO:Den4ikSeekersV3:Train Epoch: 2 [2%]
INFO:Den4ikSeekersV3:[279, 9.99875e-05]
INFO:Den4ikSeekersV3:loss_disc=4.333, loss_gen=3.062, loss_fm=5.908,loss_mel=18.143, loss_kl=1.504
INFO:Den4ikSeekersV3:====> Epoch: 2 [2024-03-12 18:51:24] | (0:02:17.632440)
INFO:Den4ikSeekersV3:Train Epoch: 3 [4%]
INFO:Den4ikSeekersV3:[558, 9.99750015625e-05]
INFO:Den4ikSeekersV3:loss_disc=3.556, loss_gen=3.042, loss_fm=9.422,loss_mel=20.361, loss_kl=1.660
INFO:Den4ikSeekersV3:====> Epoch: 3 [2024-03-12 18:53:37] | (0:02:13.120967)
INFO:Den4ikSeekersV3:Train Epoch: 4 [5%]
INFO:Den4ikSeekersV3:[837, 9.996250468730469e-05]
INFO:Den4ikSeekersV3:loss_disc=4.042, loss_gen=3.086, loss_fm=10.286,loss_mel=17.957, loss_kl=1.465
INFO:Den4ikSeekersV3:====> Epoch: 4 [2024-03-12 18:55:53] | (0:02:15.803684)
INFO:Den4ikSeekersV3:Train Epoch: 5 [7%]
INFO:Den4ikSeekersV3:[1116, 9.995000937421877e-05]
INFO:Den4ikSeekersV3:loss_disc=4.149, loss_gen=2.704, loss_fm=9.035,loss_mel=18.226, loss_kl=1.469
INFO:Den4ikSeekersV3:Saving model and optimizer state at epoch 5 to ./logs\Den4ikSeekersV3\G_2333333.pth
INFO:Den4ikSeekersV3:Saving model and optimizer state at epoch 5 to ./logs\Den4ikSeekersV3\D_2333333.pth
INFO:Den4ikSeekersV3:saving ckpt Den4ikSeekersV3_e5:Success.
INFO:Den4ikSeekersV3:====> Epoch: 5 [2024-03-12 18:58:08] | (0:02:14.966068)
INFO:Den4ikSeekersV3:Train Epoch: 6 [9%]
INFO:Den4ikSeekersV3:[1395, 9.993751562304699e-05]
INFO:Den4ikSeekersV3:loss_disc=4.231, loss_gen=3.099, loss_fm=10.232,loss_mel=19.459, loss_kl=1.666
INFO:Den4ikSeekersV3:====> Epoch: 6 [2024-03-12 19:00:21] | (0:02:12.955729)
INFO:Den4ikSeekersV3:Train Epoch: 7 [11%]
INFO:Den4ikSeekersV3:[1674, 9.99250234335941e-05]
INFO:Den4ikSeekersV3:loss_disc=4.177, loss_gen=3.267, loss_fm=10.140,loss_mel=19.995, loss_kl=1.684
INFO:Den4ikSeekersV3:====> Epoch: 7 [2024-03-12 19:02:31] | (0:02:09.405485)
INFO:Den4ikSeekersV3:Train Epoch: 8 [13%]
INFO:Den4ikSeekersV3:[1953, 9.991253280566489e-05]
INFO:Den4ikSeekersV3:loss_disc=3.863, loss_gen=2.904, loss_fm=12.747,loss_mel=21.048, loss_kl=1.518
Process Process-1:
Traceback (most recent call last):
File "multiprocessing\process.py", line 315, in _bootstrap
File "multiprocessing\process.py", line 108, in run
File "D:\INSTALLS\RVC\train_nsf_sim_cache_sid_load_pretrain.py", line 225, in run
train_and_evaluate(
File "D:\INSTALLS\RVC\train_nsf_sim_cache_sid_load_pretrain.py", line 461, in train_and_evaluate
scaler.step(optim_g)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 370, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\cuda\amp\grad_scaler.py", line 290, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\adamw.py", line 171, in step
adamw(
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\adamw.py", line 321, in adamw
func(
File "D:\INSTALLS\RVC\runtime\lib\site-packages\torch\optim\adamw.py", line 564, in _multi_tensor_adamw
exp_avg_sq_sqrt = torch._foreach_sqrt(device_exp_avg_sqs)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
I'm new with RVC training, so any help would be greatly appreciated! Thanks :3
The text was updated successfully, but these errors were encountered:
I get this error constantly when training a new model with a batch size of 3 or more.
Initially, I started training with the default Batch size of 4, but it started crashing very easily.
When it encounters the problem my screen gets black for a couple of seconds and then the cmd gives out the error, saying:
I also noticed that my training crashed if left for a long time (batch size was set to 4) and I simply had to move the mouse a bit to make it error out instantly.
My laptop hardware:
CPU: Intel Core i7-11800H
RAM: 32 GB
GPU: RTX 3070 8GB
Dataset length (single file): 53 minutes
Dataset format: WAV
I think this time it might have crashed because I was using my laptop in the background, doing simple chatting/monitoring Tensorboard graphs and watching RVC YT tutorials. However, it's not an out of memory error, as others have previously encountered.
The log:
I'm new with RVC training, so any help would be greatly appreciated! Thanks :3
The text was updated successfully, but these errors were encountered: