Cannot find the best model after training #31734

aladinggit · 2024-07-01T17:15:44Z

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0
Python version: 3.10.0
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: One node with 8 A100 40G GPUs

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

from tqdm import tqdm
import torch
import json, re, os, sys
import numpy as np
from datasets import load_dataset, DatasetDict
import ipdb
import random
from accelerate import Accelerator
from torch.utils.data import DataLoader
import evaluate
from trl import SFTConfig, SFTTrainer

dataset = load_dataset("allenai/c4", data_files="en/c4-train.00000-of-01024.json.gz")
model_name = "meta-llama/Meta-Llama-3-8B"
train_testvalid = dataset["train"].train_test_split(test_size=0.99, seed=42) 
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42) 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # float 32
    device_map= "auto",
)

model.train()

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # tokenizer.pad_token == None
tokenizer.padding_side = "left"

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']})

repository_id = "llama3-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    max_seq_length=1024,
    # fp16_full_eval=True, # Overflows with fp16
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio = 0.1,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=0.1,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps= 0.1,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=0.1,
)



trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=sft_config,
)

trainer.train()
model.save_pretrained(repository_id)
tokenizer.save_pretrained(repository_id)

Expected behavior

At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "

Could not locate the best model at checkpoint-207/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.

I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-07-02T21:47:28Z

cc @muellerzr @SunMarc

github-actions · 2024-08-26T08:05:06Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-08-26T11:07:01Z

I think this was fixed in the latest version of transformers, could you try with pip install -U transformers

yaolu-zjut · 2024-08-29T08:55:53Z

I think this was fixed in the latest version of transformers, could you try with pip install -U transformers

Hi, I meet the same problem, When I finetune gemma2-2b-it, it reports 'No such file or directory: 'finetuned_lora_alpaca-llama/checkpoint-1400/pytorch_model.bin'. see horseee/LLM-Pruner#74 (comment) for details. And the version of transformers is 4.440

ArthurZucker · 2024-09-06T10:01:28Z

cc @muellerzr !

irislin1006 · 2024-09-06T16:55:11Z

Hi, just fyi, I would like to claim this issue :). Also, mentioned in #33345 (comment)

irislin1006 · 2024-09-07T23:25:31Z

Hi @ArthurZucker,

I attempted to reproduce this issue locally using google/gemma-2b and found that the problem appears to be resolved with the latest versions. For those encountering a similar issue, here are the package versions I used:

- PyTorch version: 2.4.1+cu121
- transformers version: 4.45.0.dev0

Here’s the code I used for testing, utilizing a small subset of the C4 dataset and configuring checkpointing and evaluation at each step:

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset, DatasetDict
from trl import SFTConfig, SFTTrainer

# Load dataset
raw_dataset = load_dataset('allenai/c4', data_files="en/c4-train.00000-of-01024.json.gz", split='train[:1%]')

# Split dataset
train_testvalid = raw_dataset.train_test_split(test_size=0.99, seed=42)
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42)

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)

# Training and evaluation setup
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

repository_id = "gemma2b-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_seq_length=1024,
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio=0.1,
    max_steps=6,
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=2,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=2,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=2,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    args=sft_config,
)

# Train and save model
trainer.train()
model.save_pretrained(f"{repository_id}/best_model")

print(f"Trainer loaded best model loc: {trainer.state.best_model_checkpoint}")

Additionally, I tested with save_safetensors=False as mentioned in horseee/LLM-Pruner#74, and can confirm that pytorch_model.bin files were saved as expected.

This setup successfully saved and loaded the best model without any issues. Let me know if you need further details or clarifications!

SunMarc · 2024-09-09T13:55:27Z

Thanks for your help @irislin1006 ! Can you confirm @aladinggit @yaolu-zjut that the issue is solved ?

yaolu-zjut · 2024-09-10T10:40:16Z

Hi, irislin1006, Thanks for your help, I think your solution is great. However, I think you use SFTTrainer in trl instead of trainer in transformer, so this problem does not occur. In light of your inspiration, I have switched to trl for training and there will be no errors.

SunMarc · 2024-09-10T14:49:42Z

The original issue is with SFTTrainer @yaolu-zjut. Did you have the issue with Trainer but not with SFTTrainer ?

yaolu-zjut · 2024-09-10T14:51:45Z

Yes

github-actions · 2024-10-05T08:08:32Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Ashutoshjangam · 2024-10-05T11:10:56Z

hi, i want to work on it

ArthurZucker · 2024-10-05T12:12:58Z

Feel free to open a PR!

muhd360 · 2024-10-08T18:22:38Z

HI, @ArthurZucker can i open a pr regarding the same????

muellerzr · 2024-10-08T21:24:07Z

@muhd360 feel free!

muhd360 · 2024-10-09T04:06:11Z

ok

muhd360 · 2024-10-09T18:46:16Z

@muellerzr wouldnt this consist of making changes in huggingface/trl rather than this repo???

muellerzr · 2024-10-10T17:05:36Z

@muhd360 no as the above report states that it works for TRL, but not the Trainer

muhd360 · 2024-10-10T17:21:29Z

ohk got confused,im on it will check

github-actions · 2024-11-04T08:09:04Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts added the trainer label Aug 1, 2024

huggingface deleted a comment from github-actions bot Aug 1, 2024

ArthurZucker mentioned this issue Sep 6, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

irislin1006 mentioned this issue Sep 7, 2024

No such file or directory: pytorch_model.bin horseee/LLM-Pruner#74

Open

github-actions bot closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot find the best model after training #31734

Cannot find the best model after training #31734

aladinggit commented Jul 1, 2024 •

edited

Loading

amyeroberts commented Jul 2, 2024

github-actions bot commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

yaolu-zjut commented Aug 29, 2024

ArthurZucker commented Sep 6, 2024

irislin1006 commented Sep 6, 2024

irislin1006 commented Sep 7, 2024 •

edited

Loading

SunMarc commented Sep 9, 2024 •

edited

Loading

yaolu-zjut commented Sep 10, 2024

SunMarc commented Sep 10, 2024 •

edited

Loading

yaolu-zjut commented Sep 10, 2024

github-actions bot commented Oct 5, 2024

Ashutoshjangam commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

muhd360 commented Oct 8, 2024

muellerzr commented Oct 8, 2024

muhd360 commented Oct 9, 2024

muhd360 commented Oct 9, 2024

muellerzr commented Oct 10, 2024

muhd360 commented Oct 10, 2024

github-actions bot commented Nov 4, 2024

Cannot find the best model after training #31734

Cannot find the best model after training #31734

Comments

aladinggit commented Jul 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Jul 2, 2024

github-actions bot commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

yaolu-zjut commented Aug 29, 2024

ArthurZucker commented Sep 6, 2024

irislin1006 commented Sep 6, 2024

irislin1006 commented Sep 7, 2024 • edited Loading

SunMarc commented Sep 9, 2024 • edited Loading

yaolu-zjut commented Sep 10, 2024

SunMarc commented Sep 10, 2024 • edited Loading

yaolu-zjut commented Sep 10, 2024

github-actions bot commented Oct 5, 2024

Ashutoshjangam commented Oct 5, 2024

ArthurZucker commented Oct 5, 2024

muhd360 commented Oct 8, 2024

muellerzr commented Oct 8, 2024

muhd360 commented Oct 9, 2024

muhd360 commented Oct 9, 2024

muellerzr commented Oct 10, 2024

muhd360 commented Oct 10, 2024

github-actions bot commented Nov 4, 2024

aladinggit commented Jul 1, 2024 •

edited

Loading

irislin1006 commented Sep 7, 2024 •

edited

Loading

SunMarc commented Sep 9, 2024 •

edited

Loading

SunMarc commented Sep 10, 2024 •

edited

Loading