-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flux LoRA training crash randomly - Segmentation fault (core dumped)
#3082
Comments
How much RAM (physical + swap) is being used while training? A segfault might indicate a lack of RAM. |
Well I've got 128 GB of RAM and a equal amount of swap, it's a custom build pc. EDIT. |
Edit: New traceback: (This time trying dreambooth) Traceback (most recent call last): |
Are you using a recent Intel CPU, specifically 13th or 14th generation? Those chips are known to have instability issues if the bios isn't up to date. At least one user here has ran into random crashes with their kohya training. |
Yeah I've read this comment before opening this issue. I've got a i9 13900 and the bios is up to date. In the mean while I've tried the same installation pipeline in runpod and it works fine.
root@1f3693863cca:/workspace# pip list
Package Version
--------------------------------- ---------------
anyio 4.2.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
attrs 23.2.0
Babel 2.14.0
beautifulsoup4 4.12.3
bleach 6.1.0
blinker 1.4
certifi 2024.2.2
cffi 1.16.0
charset-normalizer 3.3.2
comm 0.2.1
cryptography 3.4.8
dbus-python 1.2.18
debugpy 1.8.0
decorator 5.1.1
defusedxml 0.7.1
distro 1.7.0
entrypoints 0.4
exceptiongroup 1.2.0
executing 2.0.1
fastjsonschema 2.19.1
filelock 3.13.1
fqdn 1.5.1
fsspec 2024.2.0
h11 0.14.0
httpcore 1.0.2
httplib2 0.20.2
httpx 0.26.0
idna 3.6
importlib-metadata 4.6.4
ipykernel 6.29.0
ipython 8.21.0
ipython-genutils 0.2.0
ipywidgets 8.1.1
isoduration 20.11.0
jedi 0.19.1
jeepney 0.7.1
Jinja2 3.1.3
json5 0.9.14
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter-archive 3.4.0
jupyter_client 7.4.9
jupyter_contrib_core 0.4.2
jupyter_contrib_nbextensions 0.7.0
jupyter_core 5.7.1
jupyter-events 0.9.0
jupyter-highlight-selected-word 0.2.0
jupyter-lsp 2.2.2
jupyter-nbextensions-configurator 0.6.3
jupyter_server 2.12.5
jupyter_server_terminals 0.5.2
jupyterlab 4.1.0
jupyterlab_pygments 0.3.0
jupyterlab_server 2.25.2
jupyterlab-widgets 3.0.9
keyring 23.5.0
launchpadlib 1.10.16
lazr.restfulclient 0.14.4
lazr.uri 1.0.6
lxml 5.1.0
MarkupSafe 2.1.5
matplotlib-inline 0.1.6
mistune 3.0.2
more-itertools 8.10.0
mpmath 1.3.0
nbclassic 1.0.0
nbclient 0.9.0
nbconvert 7.14.2
nbformat 5.9.2
nest-asyncio 1.6.0
networkx 3.2.1
notebook 6.5.5
notebook_shim 0.2.3
numpy 1.26.3
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.19.3
nvidia-nvjitlink-cu12 12.3.101
nvidia-nvtx-cu12 12.1.105
oauthlib 3.2.0
overrides 7.7.0
packaging 23.2
pandocfilters 1.5.1
parso 0.8.3
pexpect 4.9.0
pillow 10.2.0
pip 24.0
platformdirs 4.2.0
prometheus-client 0.19.0
prompt-toolkit 3.0.43
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
Pygments 2.17.2
PyGObject 3.42.1
PyJWT 2.3.0
pyparsing 2.4.7
python-apt 2.4.0+ubuntu2
python-dateutil 2.8.2
python-json-logger 2.0.7
PyYAML 6.0.1
pyzmq 24.0.1
referencing 0.33.0
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.17.1
SecretStorage 3.3.1
Send2Trash 1.8.2
setuptools 69.0.3
six 1.16.0
sniffio 1.3.0
soupsieve 2.5
stack-data 0.6.3
sympy 1.12
terminado 0.18.0
tinycss2 1.2.1
tomli 2.0.1
torch 2.2.0
torchaudio 2.2.0
torchvision 0.17.0
tornado 6.4
traitlets 5.14.1
triton 2.2.0
types-python-dateutil 2.8.19.20240106
typing_extensions 4.9.0
uri-template 1.3.0
urllib3 2.2.0
wadllib 1.3.6
wcwidth 0.2.13
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
wheel 0.42.0
widgetsnbextension 4.0.9
zipp 1.0.0
Somehow, this layer addon stables a little the training process. Now the 50% - 70% of the trainings arrive to the end, before none of them arrived to the end. I will looking forward for updating the bios. |
Hello, I'm experiencing random crashes during the training of a LoRA on FLU.X.
The crash occurs randomly during training, and I never manage to complete it. The output is mostly
Segmentation fault (core dumped)
Sometimes the error is different, pain!
I'm working on a machine with Arch Linux and two 4090 but I train only in one, the 0.
Here is my nvidia-smi output:
+-----------------------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 752900 C python 14164MiB |
+-----------------------------------------------------------------------------------------+`
Here is my .toml file, and I'm launching from sd-script because it crashes even with the GUI:
The .json file:
The latest crash is this:
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 4487905.28it/s]
INFO prepare dataset train_util.py:948
INFO preparing accelerator train_network.py:373
accelerator device: cuda
INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:43
INFO Building Flux model dev from BFL checkpoint flux_utils.py:101
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/flux1-dev.safetensors flux_utils.py:118
INFO Loaded Flux: flux_utils.py:137
INFO Building CLIP-L flux_utils.py:163
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/clip_l.safetensors flux_utils.py:259
INFO Loaded CLIP-L: flux_utils.py:262
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/t5xxl_fp16.safetensors flux_utils.py:314
INFO Loaded T5xxl: flux_utils.py:317
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/ae.safetensors flux_utils.py:149
INFO Loaded AE: flux_utils.py:152
import network module: networks.lora_flux
INFO [Dataset 0] train_util.py:2493
INFO caching latents with caching strategy. train_util.py:1048
INFO caching latents... train_util.py:1097
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 25678.92it/s]
INFO move vae and unet to cpu to save memory flux_train_network.py:205
INFO move text encoders to gpu flux_train_network.py:213
2025-02-12 17:16:04 INFO [Dataset 0] train_util.py:2515
INFO caching Text Encoder outputs with caching strategy. train_util.py:1231
INFO checking cache validity... train_util.py:1242
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 14259.99it/s]
INFO no Text Encoder outputs to cache train_util.py:1269
INFO cache Text Encoder outputs for sample prompt: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/sample/prompt.txt flux_train_network.py:229
INFO cache Text Encoder outputs for prompt: David Caruso a man in a blue shirt and black pants carrying a suitcase flux_train_network.py:240
INFO cache Text Encoder outputs for prompt: flux_train_network.py:240
INFO move CLIP-L back to cpu flux_train_network.py:251
INFO move t5XXL back to cpu flux_train_network.py:253
2025-02-12 17:16:06 INFO move vae and unet back to original device flux_train_network.py:258
INFO create LoRA network. base dim (rank): 16, alpha: 16 lora_flux.py:594
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:595
INFO train all blocks only lora_flux.py:605
INFO create LoRA for Text Encoder 1: lora_flux.py:741
INFO create LoRA for Text Encoder 1: 72 modules. lora_flux.py:744
2025-02-12 17:16:07 INFO create LoRA for FLUX all blocks: 304 modules. lora_flux.py:765
INFO enable LoRA for U-Net: 304 modules lora_flux.py:916
FLUX: Gradient checkpointing enabled. CPU offload: False
prepare optimizer, data loader etc.
INFO use 8-bit AdamW optimizer | {} train_util.py:4605
enable fp8 training for U-Net.
enable fp8 training for Text Encoder.
INFO set U-Net weight dtype to torch.float8_e4m3fn, device to cuda train_network.py:598
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1605
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 1605
num epochs / epoch数: 7
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 10000
steps: 0%| | 0/10000 [00:00<?, ?it/s]2025-02-12 17:16:19 INFO text_encoder is not needed for training. deleting to save memory. train_network.py:1067
INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1089
epoch 1/7
INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715
/home/ste/projects/flux_kohya_02_11/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning:
torch.cpu.amp.autocast(args...)
is deprecated. Please usetorch.amp.autocast('cpu', args...)
instead.with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
steps: 2%|███▌ | 250/10000 [03:30<2:16:55, 1.19it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000250.safetensors
/home/ste/projects/flux_kohya_02_11/kohya_ss/sd-scripts/networks/lora_flux.py:861: FutureWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/main/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
return super().state_dict(destination, prefix, keep_vars)
steps: 5%|███████▏ | 500/10000 [07:04<2:14:34, 1.18it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000500.safetensors
steps: 8%|██████████▊ | 750/10000 [10:39<2:11:24, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000750.safetensors
steps: 10%|██████████████▏ | 1000/10000 [14:13<2:08:00, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001000.safetensors
steps: 12%|█████████████████▉ | 1250/10000 [17:47<2:04:32, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001250.safetensors
steps: 15%|█████████████████████▎ | 1500/10000 [21:21<2:01:03, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001500.safetensors
steps: 16%|██████████████████████▊ | 1605/10000 [22:51<1:59:35, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000001.safetensors
epoch 2/7
2025-02-12 17:39:11 INFO epoch is incremented. current_epoch: 1, epoch: 2 train_util.py:715
steps: 18%|████████████████████████▊ | 1750/10000 [24:56<1:57:33, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001750.safetensors
steps: 20%|████████████████████████████▍ | 2000/10000 [28:30<1:54:01, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002000.safetensors
steps: 22%|███████████████████████████████▉ | 2250/10000 [32:04<1:50:29, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002250.safetensors
steps: 25%|███████████████████████████████████▌ | 2500/10000 [35:39<1:46:57, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002500.safetensors
steps: 28%|███████████████████████████████████████ | 2750/10000 [39:13<1:43:24, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002750.safetensors
steps: 30%|██████████████████████████████████████████▌ | 3000/10000 [42:47<1:39:51, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003000.safetensors
steps: 32%|█████████████████████████████████████████████▌ | 3210/10000 [45:47<1:36:52, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000002.safetensors
epoch 3/7
2025-02-12 18:02:07 INFO epoch is incremented. current_epoch: 2, epoch: 3 train_util.py:715
steps: 32%|██████████████████████████████████████████████▏ | 3250/10000 [46:22<1:36:18, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003250.safetensors
steps: 35%|█████████████████████████████████████████████████▋ | 3500/10000 [49:56<1:32:45, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003500.safetensors
steps: 38%|█████████████████████████████████████████████████████▎ | 3750/10000 [53:31<1:29:11, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003750.safetensors
steps: 40%|████████████████████████████████████████████████████████▊ | 4000/10000 [57:05<1:25:38, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004000.safetensors
steps: 42%|███████████████████████████████████████████████████████████▌ | 4250/10000 [1:00:39<1:22:04, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004250.safetensors
steps: 45%|███████████████████████████████████████████████████████████████ | 4500/10000 [1:04:14<1:18:30, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004500.safetensors
steps: 48%|██████████████████████████████████████████████████████████████████▌ | 4750/10000 [1:07:48<1:14:56, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004750.safetensors
steps: 48%|███████████████████████████████████████████████████████████████████▍ | 4815/10000 [1:08:44<1:14:01, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000003.safetensors
epoch 4/7
2025-02-12 18:25:04 INFO epoch is incremented. current_epoch: 3, epoch: 4 train_util.py:715
steps: 50%|██████████████████████████████████████████████████████████████████████ | 5000/10000 [1:11:23<1:11:23, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005000.safetensors
steps: 52%|█████████████████████████████████████████████████████████████████████████▌ | 5250/10000 [1:14:57<1:07:49, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005250.safetensors
steps: 55%|█████████████████████████████████████████████████████████████████████████████ | 5500/10000 [1:18:31<1:04:15, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005500.safetensors
steps: 57%|████████████████████████████████████████████████████████████████████████████████▌ | 5750/10000 [1:22:06<1:00:41, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005750.safetensors
steps: 60%|█████████████████████████████████████████████████████████████████████████████████████▏ | 6000/10000 [1:25:40<57:06, 1.17it/s, avr_loss=0.443]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00006000.safetensors
steps: 62%|█████████████████████████████████████████████████████████████████████████████████████████▍ | 6250/10000 [1:29:14<53:32, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00006250.safetensors
steps: 63%|██████████████████████████████████████████████████████████████████████████████████████████▍ | 6322/10000 [1:30:16<52:31, 1.17it/s, avr_loss=0.44]
Segmentation fault (core dumped)
I am on the sd3-flux1 branch and have tried to:
This is my pip list
Thanks a lor for the help!
The text was updated successfully, but these errors were encountered: