Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flux LoRA training crash randomly - Segmentation fault (core dumped) #3082

Open
ingTutui opened this issue Feb 12, 2025 · 5 comments
Open

Flux LoRA training crash randomly - Segmentation fault (core dumped) #3082

ingTutui opened this issue Feb 12, 2025 · 5 comments

Comments

@ingTutui
Copy link

ingTutui commented Feb 12, 2025

Hello, I'm experiencing random crashes during the training of a LoRA on FLU.X.
The crash occurs randomly during training, and I never manage to complete it. The output is mostly Segmentation fault (core dumped)
Sometimes the error is different, pain!
I'm working on a machine with Arch Linux and two 4090 but I train only in one, the 0.

Here is my nvidia-smi output:

nvidia-smi
Wed Feb 12 16:04:22 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.77 Driver Version: 565.77 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 Off | Off | | 69% 83C P0 423W / 450W | 14174MiB / 24564MiB | 100% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:02:00.0 Off | Off | | 0% 28C P8 20W / 450W | 4MiB / 24564MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:
|
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 752900 C python 14164MiB |
+-----------------------------------------------------------------------------------------+`

Here is my .toml file, and I'm launching from sd-script because it crashes even with the GUI:

.toml
ae = "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/ae.safetensors" bucket_reso_steps = 64 cache_latents = true cache_latents_to_disk = true cache_text_encoder_outputs = true cache_text_encoder_outputs_to_disk = true caption_extension = ".txt" clip_l = "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/clip_l.safetensors" clip_skip = 1 discrete_flow_shift = 3.0 dynamo_backend = "no" epoch = 1 flip_aug = true fp8_base = true gradient_accumulation_steps = 1 gradient_checkpointing = true guidance_scale = 1.0 highvram = true huber_c = 0.1 huber_schedule = "snr" logging_dir = "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/log" loss_type = "l2" lr_scheduler = "linear" lr_scheduler_args = [] lr_scheduler_num_cycles = 1 lr_scheduler_power = 1 max_bucket_reso = 2048 max_data_loader_n_workers = 0 max_grad_norm = 1 max_timestep = 1000 max_train_steps = 10000 mem_eff_attn = true min_bucket_reso = 256 mixed_precision = "bf16" model_prediction_type = "raw" network_alpha = 16 network_args = [ "train_double_block_indices=all", "train_single_block_indices=all",] network_dim = 16 network_module = "networks.lora_flux" network_train_unet_only = true noise_offset_type = "Original" optimizer_args = [] optimizer_type = "AdamW8bit" output_dir = "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model" output_name = "flux_base_caruso" pretrained_model_name_or_path = "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/flux1-dev.safetensors" prior_loss_weight = 1 resolution = "512,512" sample_prompts = "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/sample/prompt.txt" sample_sampler = "euler" save_every_n_epochs = 1 save_every_n_steps = 250 save_model_as = "safetensors" save_precision = "bf16" sdpa = true seed = 42 t5xxl = "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/t5xxl_fp16.safetensors" t5xxl_max_token_length = 512 text_encoder_lr = [] timestep_sampling = "sigmoid" train_batch_size = 1 train_data_dir = "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/img" unet_lr = 0.0002 wandb_run_name = "flux_base_caruso"

The .json file:

.json
{ "LoRA_type": "Flux1", "LyCORIS_preset": "full", "adaptive_noise_scale": 0, "additional_parameters": "", "ae": "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/ae.safetensors", "apply_t5_attn_mask": false, "async_upload": false, "block_alphas": "", "block_dims": "", "block_lr_zero_threshold": "", "bucket_no_upscale": false, "bucket_reso_steps": 64, "bypass_mode": false, "cache_latents": true, "cache_latents_to_disk": true, "caption_dropout_every_n_epochs": 0, "caption_dropout_rate": 0, "caption_extension": ".txt", "clip_l": "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/clip_l.safetensors", "clip_skip": 1, "color_aug": false, "constrain": 0, "conv_alpha": 1, "conv_block_alphas": "", "conv_block_dims": "", "conv_dim": 1, "cpu_offload_checkpointing": false, "dataset_config": "", "debiased_estimation_loss": false, "decompose_both": false, "dim_from_weights": false, "discrete_flow_shift": 3, "dora_wd": false, "down_lr_weight": "", "dynamo_backend": "no", "dynamo_mode": "default", "dynamo_use_dynamic": false, "dynamo_use_fullgraph": false, "enable_all_linear": false, "enable_bucket": false, "epoch": 1, "extra_accelerate_launch_args": "", "factor": -1, "flip_aug": true, "flux1_cache_text_encoder_outputs": true, "flux1_cache_text_encoder_outputs_to_disk": true, "flux1_checkbox": true, "fp8_base": true, "fp8_base_unet": false, "full_bf16": false, "full_fp16": false, "gpu_ids": "0", "gradient_accumulation_steps": 1, "gradient_checkpointing": true, "guidance_scale": 1, "highvram": true, "huber_c": 0.1, "huber_schedule": "snr", "huggingface_path_in_repo": "", "huggingface_repo_id": "", "huggingface_repo_type": "", "huggingface_repo_visibility": "", "huggingface_token": "", "img_attn_dim": "", "img_mlp_dim": "", "img_mod_dim": "", "in_dims": "", "ip_noise_gamma": 0, "ip_noise_gamma_random_strength": false, "keep_tokens": 0, "learning_rate": 0.0002, "log_config": false, "log_tracker_config": "", "log_tracker_name": "", "log_with": "", "logging_dir": "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/log", "loraplus_lr_ratio": 0, "loraplus_text_encoder_lr_ratio": 0, "loraplus_unet_lr_ratio": 0, "loss_type": "l2", "lowvram": false, "lr_scheduler": "linear", "lr_scheduler_args": "", "lr_scheduler_num_cycles": 1, "lr_scheduler_power": 1, "lr_scheduler_type": "", "lr_warmup": 0, "lr_warmup_steps": 0, "main_process_port": 0, "masked_loss": false, "max_bucket_reso": 2048, "max_data_loader_n_workers": 0, "max_grad_norm": 1, "max_resolution": "512,512", "max_timestep": 1000, "max_token_length": 75, "max_train_epochs": 0, "max_train_steps": 10000, "mem_eff_attn": true, "mem_eff_save": false, "metadata_author": "", "metadata_description": "", "metadata_license": "", "metadata_tags": "", "metadata_title": "", "mid_lr_weight": "", "min_bucket_reso": 256, "min_snr_gamma": 0, "min_timestep": 0, "mixed_precision": "bf16", "model_list": "custom", "model_prediction_type": "raw", "module_dropout": 0, "multi_gpu": false, "multires_noise_discount": 0.3, "multires_noise_iterations": 0, "network_alpha": 16, "network_dim": 16, "network_dropout": 0, "network_weights": "", "noise_offset": 0, "noise_offset_random_strength": false, "noise_offset_type": "Original", "num_cpu_threads_per_process": 1, "num_machines": 1, "num_processes": 1, "optimizer": "AdamW8bit", "optimizer_args": "", "output_dir": "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model", "output_name": "flux_base_caruso", "persistent_data_loader_workers": false, "pretrained_model_name_or_path": "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/flux1-dev.safetensors", "prior_loss_weight": 1, "random_crop": false, "rank_dropout": 0, "rank_dropout_scale": false, "reg_data_dir": "", "rescaled": false, "resume": "", "resume_from_huggingface": "", "sample_every_n_epochs": 0, "sample_every_n_steps": 0, "sample_prompts": "David Caruso a man in a blue shirt and black pants carrying a suitcase --w 832 --h 1216 --s 20 --l 4 --d 42", "sample_sampler": "euler", "save_every_n_epochs": 1, "save_every_n_steps": 250, "save_last_n_epochs": 0, "save_last_n_epochs_state": 0, "save_last_n_steps": 0, "save_last_n_steps_state": 0, "save_model_as": "safetensors", "save_precision": "bf16", "save_state": false, "save_state_on_train_end": false, "save_state_to_huggingface": false, "scale_v_pred_loss_like_noise_pred": false, "scale_weight_norms": 0, "sdxl": false, "sdxl_cache_text_encoder_outputs": true, "sdxl_no_half_vae": true, "seed": 42, "shuffle_caption": false, "single_dim": "", "single_mod_dim": "", "skip_cache_check": false, "split_mode": false, "split_qkv": false, "stop_text_encoder_training": 0, "t5xxl": "/home/ste/projects/flux_kohya_02_11/kohya_ss/models/t5xxl_fp16.safetensors", "t5xxl_lr": 0, "t5xxl_max_token_length": 512, "text_encoder_lr": 0, "timestep_sampling": "sigmoid", "train_batch_size": 1, "train_blocks": "all", "train_data_dir": "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/img", "train_double_block_indices": "all", "train_norm": false, "train_on_input": true, "train_single_block_indices": "all", "train_t5xxl": false, "training_comment": "", "txt_attn_dim": "", "txt_mlp_dim": "", "txt_mod_dim": "", "unet_lr": 0.0002, "unit": 1, "up_lr_weight": "", "use_cp": false, "use_scalar": false, "use_tucker": false, "v2": false, "v_parameterization": false, "v_pred_like_loss": 0, "vae": "", "vae_batch_size": 0, "wandb_api_key": "", "wandb_run_name": "", "weighted_captions": false, "xformers": "sdpa" }

The latest crash is this:

crash
(venv) [ste@SHML1-ALNX-MO kohya_ss]$ python sd-scripts/flux_train_network.py --config_file dataset/c4rr4r4_p4tt3rn/formatted/model/config_lora-20250212-170027.toml /home/ste/projects/flux_kohya_02_11/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead. torch.utils._pytree._register_pytree_node( 2025-02-12 17:16:01.455243: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2025-02-12 17:16:01.474055: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2025-02-12 17:16:01.474073: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2025-02-12 17:16:01.474632: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2025-02-12 17:16:01.477524: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2025-02-12 17:16:01.851990: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/ste/projects/flux_kohya_02_11/kohya_ss/venv/lib/python3.10/site-packages/diffusers/utils/outputs.py:63: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead. torch.utils._pytree._register_pytree_node( 2025-02-12 17:16:02 INFO Loading settings from dataset/c4rr4r4_p4tt3rn/formatted/model/config_lora-20250212-170027.toml... train_util.py:4451 INFO dataset/c4rr4r4_p4tt3rn/formatted/model/config_lora-20250212-170027 train_util.py:4470 INFO highvram is enabled / highvramが有効です train_util.py:4122 2025-02-12 17:16:02 INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:43 INFO t5xxl_max_token_length: 512 flux_train_network.py:152 /home/ste/projects/flux_kohya_02_11/kohya_ss/venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn( You are using the default legacy behaviour of the . This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 2025-02-12 17:16:03 INFO Using DreamBooth method. train_network.py:319 INFO prepare images. train_util.py:1969 INFO get image size from name of cache files train_util.py:1886 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 476929.36it/s] INFO set image size from cache files: 107/107 train_util.py:1914 INFO found directory /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/img/15_c4rr4r4 texture contains 107 image files train_util.py:1916 read caption: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 124146.76it/s] INFO 1605 train images with repeating. train_util.py:2010 INFO 0 reg images. train_util.py:2013 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:2018 INFO [Dataset 0] config_util.py:567 batch_size: 1 resolution: (512, 512) enable_bucket: False network_multiplier: 1.0
                           [Subset 0 of Dataset 0]                                                                                                                                                           
                             image_dir: "/home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/img/15_c4rr4r4 texture"                                                              
                             image_count: 107                                                                                                                                                                
                             num_repeats: 15                                                                                                                                                                 
                             shuffle_caption: False                                                                                                                                                          
                             keep_tokens: 0                                                                                                                                                                  
                             keep_tokens_separator:                                                                                                                                                          
                             caption_separator: ,                                                                                                                                                            
                             secondary_separator: None                                                                                                                                                       
                             enable_wildcard: False                                                                                                                                                          
                             caption_dropout_rate: 0.0                                                                                                                                                       
                             caption_dropout_every_n_epochs: 0                                                                                                                                               
                             caption_tag_dropout_rate: 0.0                                                                                                                                                   
                             caption_prefix: None                                                                                                                                                            
                             caption_suffix: None                                                                                                                                                            
                             color_aug: False                                                                                                                                                                
                             flip_aug: True                                                                                                                                                                  
                             face_crop_aug_range: None                                                                                                                                                       
                             random_crop: False                                                                                                                                                              
                             token_warmup_min: 1                                                                                                                                                             
                             token_warmup_step: 0                                                                                                                                                            
                             alpha_mask: False                                                                                                                                                               
                             custom_attributes: {}                                                                                                                                                           
                             is_reg: False                                                                                                                                                                   
                             class_tokens: c4rr4r4 texture                                                                                                                                                   
                             caption_extension: .txt                                                                                                                                                         
                                                                                                                                                                                                             
                                                                                                                                                                                                             
                INFO     [Dataset 0]                                                                                                                                                       config_util.py:573
                INFO     loading image sizes.                                                                                                                                               train_util.py:923

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 4487905.28it/s]
INFO prepare dataset train_util.py:948
INFO preparing accelerator train_network.py:373
accelerator device: cuda
INFO Checking the state dict: Diffusers or BFL, dev or schnell flux_utils.py:43
INFO Building Flux model dev from BFL checkpoint flux_utils.py:101
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/flux1-dev.safetensors flux_utils.py:118
INFO Loaded Flux: flux_utils.py:137
INFO Building CLIP-L flux_utils.py:163
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/clip_l.safetensors flux_utils.py:259
INFO Loaded CLIP-L: flux_utils.py:262
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/t5xxl_fp16.safetensors flux_utils.py:314
INFO Loaded T5xxl: flux_utils.py:317
INFO Building AutoEncoder flux_utils.py:144
INFO Loading state dict from /home/ste/projects/flux_kohya_02_11/kohya_ss/models/ae.safetensors flux_utils.py:149
INFO Loaded AE: flux_utils.py:152
import network module: networks.lora_flux
INFO [Dataset 0] train_util.py:2493
INFO caching latents with caching strategy. train_util.py:1048
INFO caching latents... train_util.py:1097
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 25678.92it/s]
INFO move vae and unet to cpu to save memory flux_train_network.py:205
INFO move text encoders to gpu flux_train_network.py:213
2025-02-12 17:16:04 INFO [Dataset 0] train_util.py:2515
INFO caching Text Encoder outputs with caching strategy. train_util.py:1231
INFO checking cache validity... train_util.py:1242
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 107/107 [00:00<00:00, 14259.99it/s]
INFO no Text Encoder outputs to cache train_util.py:1269
INFO cache Text Encoder outputs for sample prompt: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/sample/prompt.txt flux_train_network.py:229
INFO cache Text Encoder outputs for prompt: David Caruso a man in a blue shirt and black pants carrying a suitcase flux_train_network.py:240
INFO cache Text Encoder outputs for prompt: flux_train_network.py:240
INFO move CLIP-L back to cpu flux_train_network.py:251
INFO move t5XXL back to cpu flux_train_network.py:253
2025-02-12 17:16:06 INFO move vae and unet back to original device flux_train_network.py:258
INFO create LoRA network. base dim (rank): 16, alpha: 16 lora_flux.py:594
INFO neuron dropout: p=None, rank dropout: p=None, module dropout: p=None lora_flux.py:595
INFO train all blocks only lora_flux.py:605
INFO create LoRA for Text Encoder 1: lora_flux.py:741
INFO create LoRA for Text Encoder 1: 72 modules. lora_flux.py:744
2025-02-12 17:16:07 INFO create LoRA for FLUX all blocks: 304 modules. lora_flux.py:765
INFO enable LoRA for U-Net: 304 modules lora_flux.py:916
FLUX: Gradient checkpointing enabled. CPU offload: False
prepare optimizer, data loader etc.
INFO use 8-bit AdamW optimizer | {} train_util.py:4605
enable fp8 training for U-Net.
enable fp8 training for Text Encoder.
INFO set U-Net weight dtype to torch.float8_e4m3fn, device to cuda train_network.py:598
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 1605
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 1605
num epochs / epoch数: 7
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 10000
steps: 0%| | 0/10000 [00:00<?, ?it/s]2025-02-12 17:16:19 INFO text_encoder is not needed for training. deleting to save memory. train_network.py:1067
INFO unet dtype: torch.float8_e4m3fn, device: cuda:0 train_network.py:1089

epoch 1/7
INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:715
/home/ste/projects/flux_kohya_02_11/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: torch.cpu.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cpu', args...) instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
steps: 2%|███▌ | 250/10000 [03:30<2:16:55, 1.19it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000250.safetensors
/home/ste/projects/flux_kohya_02_11/kohya_ss/sd-scripts/networks/lora_flux.py:861: FutureWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/main/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
return super().state_dict(destination, prefix, keep_vars)
steps: 5%|███████▏ | 500/10000 [07:04<2:14:34, 1.18it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000500.safetensors
steps: 8%|██████████▊ | 750/10000 [10:39<2:11:24, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00000750.safetensors
steps: 10%|██████████████▏ | 1000/10000 [14:13<2:08:00, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001000.safetensors
steps: 12%|█████████████████▉ | 1250/10000 [17:47<2:04:32, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001250.safetensors
steps: 15%|█████████████████████▎ | 1500/10000 [21:21<2:01:03, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001500.safetensors
steps: 16%|██████████████████████▊ | 1605/10000 [22:51<1:59:35, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000001.safetensors

epoch 2/7
2025-02-12 17:39:11 INFO epoch is incremented. current_epoch: 1, epoch: 2 train_util.py:715
steps: 18%|████████████████████████▊ | 1750/10000 [24:56<1:57:33, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00001750.safetensors
steps: 20%|████████████████████████████▍ | 2000/10000 [28:30<1:54:01, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002000.safetensors
steps: 22%|███████████████████████████████▉ | 2250/10000 [32:04<1:50:29, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002250.safetensors
steps: 25%|███████████████████████████████████▌ | 2500/10000 [35:39<1:46:57, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002500.safetensors
steps: 28%|███████████████████████████████████████ | 2750/10000 [39:13<1:43:24, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00002750.safetensors
steps: 30%|██████████████████████████████████████████▌ | 3000/10000 [42:47<1:39:51, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003000.safetensors
steps: 32%|█████████████████████████████████████████████▌ | 3210/10000 [45:47<1:36:52, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000002.safetensors

epoch 3/7
2025-02-12 18:02:07 INFO epoch is incremented. current_epoch: 2, epoch: 3 train_util.py:715
steps: 32%|██████████████████████████████████████████████▏ | 3250/10000 [46:22<1:36:18, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003250.safetensors
steps: 35%|█████████████████████████████████████████████████▋ | 3500/10000 [49:56<1:32:45, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003500.safetensors
steps: 38%|█████████████████████████████████████████████████████▎ | 3750/10000 [53:31<1:29:11, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00003750.safetensors
steps: 40%|████████████████████████████████████████████████████████▊ | 4000/10000 [57:05<1:25:38, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004000.safetensors
steps: 42%|███████████████████████████████████████████████████████████▌ | 4250/10000 [1:00:39<1:22:04, 1.17it/s, avr_loss=0.436]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004250.safetensors
steps: 45%|███████████████████████████████████████████████████████████████ | 4500/10000 [1:04:14<1:18:30, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004500.safetensors
steps: 48%|██████████████████████████████████████████████████████████████████▌ | 4750/10000 [1:07:48<1:14:56, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00004750.safetensors
steps: 48%|███████████████████████████████████████████████████████████████████▍ | 4815/10000 [1:08:44<1:14:01, 1.17it/s, avr_loss=0.438]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-000003.safetensors

epoch 4/7
2025-02-12 18:25:04 INFO epoch is incremented. current_epoch: 3, epoch: 4 train_util.py:715
steps: 50%|██████████████████████████████████████████████████████████████████████ | 5000/10000 [1:11:23<1:11:23, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005000.safetensors
steps: 52%|█████████████████████████████████████████████████████████████████████████▌ | 5250/10000 [1:14:57<1:07:49, 1.17it/s, avr_loss=0.437]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005250.safetensors
steps: 55%|█████████████████████████████████████████████████████████████████████████████ | 5500/10000 [1:18:31<1:04:15, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005500.safetensors
steps: 57%|████████████████████████████████████████████████████████████████████████████████▌ | 5750/10000 [1:22:06<1:00:41, 1.17it/s, avr_loss=0.439]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00005750.safetensors
steps: 60%|█████████████████████████████████████████████████████████████████████████████████████▏ | 6000/10000 [1:25:40<57:06, 1.17it/s, avr_loss=0.443]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00006000.safetensors
steps: 62%|█████████████████████████████████████████████████████████████████████████████████████████▍ | 6250/10000 [1:29:14<53:32, 1.17it/s, avr_loss=0.44]
saving checkpoint: /home/ste/projects/flux_kohya_02_11/kohya_ss/dataset/c4rr4r4_p4tt3rn/formatted/model/flux_base_caruso-step00006250.safetensors
steps: 63%|██████████████████████████████████████████████████████████████████████████████████████████▍ | 6322/10000 [1:30:16<52:31, 1.17it/s, avr_loss=0.44]
Segmentation fault (core dumped)

I am on the sd3-flux1 branch and have tried to:

  • Manually configure accelerate
  • Downgrade torch to 2.4.1+cu12.4
  • Update/downgrade accelerate, actually 0.33.0
  • Update/downgrade bitsandbyte, actually 0.43.3
  • Upgrade/downgrade xformers, actually 0.0.28.post1
  • Train in full bf16 and fp16 (the up-to-date branch)
  • Pull the commit from November 11th (7edcbb0) to have a functioning version, but the crash persists. Now I'm in this commit.

This is my pip list

pip list
Package Version Editable project location ---------------------------- ------------ ------------------------------------------------------- absl-py 2.1.0 accelerate 0.33.0 aiofiles 23.2.1 aiohappyeyeballs 2.4.6 aiohttp 3.11.12 aiosignal 1.3.2 altair 4.2.2 annotated-types 0.7.0 antlr4-python3-runtime 4.9.3 anyio 4.8.0 astunparse 1.6.3 async-timeout 5.0.1 attrs 25.1.0 bitsandbytes 0.43.3 cachetools 5.5.1 certifi 2025.1.31 charset-normalizer 3.4.1 click 8.1.8 coloredlogs 15.0.1 dadaptation 3.2 diffusers 0.25.0 docker-pycreds 0.4.0 easygui 0.98.3 einops 0.7.0 entrypoints 0.4 exceptiongroup 1.2.2 fairscale 0.4.13 fastapi 0.115.8 ffmpy 0.5.0 filelock 3.17.0 flatbuffers 25.2.10 frozenlist 1.5.0 fsspec 2025.2.0 ftfy 6.1.1 gast 0.6.0 gitdb 4.0.12 GitPython 3.1.44 google-auth 2.38.0 google-auth-oauthlib 1.2.1 google-pasta 0.2.0 gradio 5.4.0 gradio_client 1.4.2 grpcio 1.70.0 h11 0.14.0 h5py 3.12.1 httpcore 1.0.7 httpx 0.28.1 huggingface-hub 0.25.2 humanfriendly 10.0 idna 3.10 imagesize 1.4.1 importlib_metadata 8.6.1 invisible-watermark 0.2.0 Jinja2 3.1.5 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 keras 2.15.0 libclang 18.1.1 library 0.0.0 /home/ste/projects/flux_kohya_02_11/kohya_ss/sd-scripts lightning-utilities 0.12.0 lion-pytorch 0.0.6 lycoris_lora 3.1.0 Markdown 3.7 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 ml-dtypes 0.2.0 mpmath 1.3.0 multidict 6.1.0 networkx 3.4.2 numpy 1.26.4 nvidia-cublas-cu12 12.4.2.65 nvidia-cuda-cupti-cu12 12.4.99 nvidia-cuda-nvrtc-cu12 12.4.99 nvidia-cuda-runtime-cu12 12.4.99 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.2.0.44 nvidia-curand-cu12 10.3.5.119 nvidia-cusolver-cu12 11.6.0.99 nvidia-cusparse-cu12 12.3.0.142 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.4.99 oauthlib 3.2.2 omegaconf 2.3.0 onnx 1.16.1 onnxruntime-gpu 1.19.2 open-clip-torch 2.20.0 opencv-python 4.10.0.84 opt_einsum 3.4.0 orjson 3.10.15 packaging 24.2 pandas 2.2.3 pillow 11.1.0 pip 23.0.1 platformdirs 4.3.6 prodigyopt 1.0 propcache 0.2.1 protobuf 3.20.3 psutil 6.1.1 pyasn1 0.6.1 pyasn1_modules 0.4.1 pydantic 2.10.6 pydantic_core 2.27.2 pydub 0.25.1 Pygments 2.19.1 python-dateutil 2.9.0.post0 python-multipart 0.0.12 pytorch-lightning 1.9.0 pytz 2025.1 PyWavelets 1.8.0 PyYAML 6.0.2 referencing 0.36.2 regex 2024.11.6 requests 2.32.3 requests-oauthlib 2.0.0 rich 13.9.4 rpds-py 0.22.3 rsa 4.9 ruff 0.9.6 safehttpx 0.1.6 safetensors 0.4.4 schedulefree 1.2.7 scipy 1.11.4 semantic-version 2.10.0 sentencepiece 0.2.0 sentry-sdk 2.21.0 setproctitle 1.3.4 setuptools 65.5.0 shellingham 1.5.4 six 1.17.0 smmap 5.0.2 sniffio 1.3.1 starlette 0.45.3 sympy 1.13.1 tensorboard 2.15.2 tensorboard-data-server 0.7.2 tensorflow 2.15.0.post1 tensorflow-estimator 2.15.0 tensorflow-io-gcs-filesystem 0.37.1 termcolor 2.5.0 timm 0.6.12 tk 0.1.0 tokenizers 0.19.1 toml 0.10.2 tomlkit 0.12.0 toolz 1.0.0 torch 2.4.1+cu124 torchmetrics 1.6.1 torchvision 0.19.1+cu124 tqdm 4.67.1 transformers 4.44.2 triton 3.0.0 typer 0.15.1 typing_extensions 4.12.2 tzdata 2025.1 urllib3 2.3.0 uvicorn 0.34.0 voluptuous 0.13.1 wandb 0.18.0 wcwidth 0.2.13 websockets 12.0 Werkzeug 3.1.3 wheel 0.45.1 wrapt 1.14.1 xformers 0.0.28.post1 yarl 1.18.3 zipp 3.21.0
I don't know what else to do! What could it be? Are there some tweaks I'm missing or compatibility between library that I don't know?

Thanks a lor for the help!

@b-fission
Copy link
Contributor

How much RAM (physical + swap) is being used while training? A segfault might indicate a lack of RAM.

@ingTutui
Copy link
Author

ingTutui commented Feb 13, 2025

Well I've got 128 GB of RAM and a equal amount of swap, it's a custom build pc.

EDIT.
I'm using ComfyUI every day, I've tuned a 35k step SDXL LoRA last month with the master branch (another project, another folder, another venv). Ant everything goes smoothly! Only this exacly project, many fresh installation, fails!

@ingTutui
Copy link
Author

ingTutui commented Feb 13, 2025

Edit: New traceback: (This time trying dreambooth)

Traceback (most recent call last):
File "/home/ste/projects/flux_kohya_02_13/kohya_ss/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ste/projects/flux_kohya_02_13/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/ste/projects/flux_kohya_02_13/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/ste/projects/flux_kohya_02_13/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ste/projects/flux_kohya_02_13/kohya_ss/venv/bin/python3.10', '/home/ste/projects/flux_kohya_02_13/kohya_ss/sd-scripts/flux_train.py', '--config_file', '/home/ste/projects/flux_kohya_02_13/kohya_ss/dataset/c4rr4r4_finetuning/formatted/model/config_dreambooth-20250213-175633.toml']' died with <Signals.SIGSEGV: 11>.

@b-fission
Copy link
Contributor

b-fission commented Feb 15, 2025

Are you using a recent Intel CPU, specifically 13th or 14th generation? Those chips are known to have instability issues if the bios isn't up to date. At least one user here has ran into random crashes with their kohya training.

@ingTutui
Copy link
Author

ingTutui commented Feb 17, 2025

Yeah I've read this comment before opening this issue. I've got a i9 13900 and the bios is up to date.

In the mean while I've tried the same installation pipeline in runpod and it works fine.
So I switch to docker in my pc, building the runpod images (runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04) and using this pip list:

pip list
root@1f3693863cca:/workspace# pip list Package Version --------------------------------- --------------- anyio 4.2.0 argon2-cffi 23.1.0 argon2-cffi-bindings 21.2.0 arrow 1.3.0 asttokens 2.4.1 async-lru 2.0.4 attrs 23.2.0 Babel 2.14.0 beautifulsoup4 4.12.3 bleach 6.1.0 blinker 1.4 certifi 2024.2.2 cffi 1.16.0 charset-normalizer 3.3.2 comm 0.2.1 cryptography 3.4.8 dbus-python 1.2.18 debugpy 1.8.0 decorator 5.1.1 defusedxml 0.7.1 distro 1.7.0 entrypoints 0.4 exceptiongroup 1.2.0 executing 2.0.1 fastjsonschema 2.19.1 filelock 3.13.1 fqdn 1.5.1 fsspec 2024.2.0 h11 0.14.0 httpcore 1.0.2 httplib2 0.20.2 httpx 0.26.0 idna 3.6 importlib-metadata 4.6.4 ipykernel 6.29.0 ipython 8.21.0 ipython-genutils 0.2.0 ipywidgets 8.1.1 isoduration 20.11.0 jedi 0.19.1 jeepney 0.7.1 Jinja2 3.1.3 json5 0.9.14 jsonpointer 2.4 jsonschema 4.21.1 jsonschema-specifications 2023.12.1 jupyter-archive 3.4.0 jupyter_client 7.4.9 jupyter_contrib_core 0.4.2 jupyter_contrib_nbextensions 0.7.0 jupyter_core 5.7.1 jupyter-events 0.9.0 jupyter-highlight-selected-word 0.2.0 jupyter-lsp 2.2.2 jupyter-nbextensions-configurator 0.6.3 jupyter_server 2.12.5 jupyter_server_terminals 0.5.2 jupyterlab 4.1.0 jupyterlab_pygments 0.3.0 jupyterlab_server 2.25.2 jupyterlab-widgets 3.0.9 keyring 23.5.0 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 lxml 5.1.0 MarkupSafe 2.1.5 matplotlib-inline 0.1.6 mistune 3.0.2 more-itertools 8.10.0 mpmath 1.3.0 nbclassic 1.0.0 nbclient 0.9.0 nbconvert 7.14.2 nbformat 5.9.2 nest-asyncio 1.6.0 networkx 3.2.1 notebook 6.5.5 notebook_shim 0.2.3 numpy 1.26.3 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.3.101 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.0 overrides 7.7.0 packaging 23.2 pandocfilters 1.5.1 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 24.0 platformdirs 4.2.0 prometheus-client 0.19.0 prompt-toolkit 3.0.43 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pycparser 2.21 Pygments 2.17.2 PyGObject 3.42.1 PyJWT 2.3.0 pyparsing 2.4.7 python-apt 2.4.0+ubuntu2 python-dateutil 2.8.2 python-json-logger 2.0.7 PyYAML 6.0.1 pyzmq 24.0.1 referencing 0.33.0 requests 2.31.0 rfc3339-validator 0.1.4 rfc3986-validator 0.1.1 rpds-py 0.17.1 SecretStorage 3.3.1 Send2Trash 1.8.2 setuptools 69.0.3 six 1.16.0 sniffio 1.3.0 soupsieve 2.5 stack-data 0.6.3 sympy 1.12 terminado 0.18.0 tinycss2 1.2.1 tomli 2.0.1 torch 2.2.0 torchaudio 2.2.0 torchvision 0.17.0 tornado 6.4 traitlets 5.14.1 triton 2.2.0 types-python-dateutil 2.8.19.20240106 typing_extensions 4.9.0 uri-template 1.3.0 urllib3 2.2.0 wadllib 1.3.6 wcwidth 0.2.13 webcolors 1.13 webencodings 0.5.1 websocket-client 1.7.0 wheel 0.42.0 widgetsnbextension 4.0.9 zipp 1.0.0

Somehow, this layer addon stables a little the training process. Now the 50% - 70% of the trainings arrive to the end, before none of them arrived to the end.

I will looking forward for updating the bios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants