-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot find the best model after training #31734
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I think this was fixed in the latest version of transformers, could you try with |
Hi, I meet the same problem, When I finetune gemma2-2b-it, it reports 'No such file or directory: 'finetuned_lora_alpaca-llama/checkpoint-1400/pytorch_model.bin'. see horseee/LLM-Pruner#74 (comment) for details. And the version of transformers is 4.440 |
cc @muellerzr ! |
Hi, just fyi, I would like to claim this issue :). Also, mentioned in #33345 (comment) |
Hi @ArthurZucker, I attempted to reproduce this issue locally using
Here’s the code I used for testing, utilizing a small subset of the C4 dataset and configuring checkpointing and evaluation at each step: from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset, DatasetDict
from trl import SFTConfig, SFTTrainer
# Load dataset
raw_dataset = load_dataset('allenai/c4', data_files="en/c4-train.00000-of-01024.json.gz", split='train[:1%]')
# Split dataset
train_testvalid = raw_dataset.train_test_split(test_size=0.99, seed=42)
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42)
dataset = DatasetDict({
'train': train_testvalid['train'],
'validation': valid_test['train'],
'test': valid_test['test']
})
model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
# Training and evaluation setup
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
repository_id = "gemma2b-tune"
sft_config = SFTConfig(
dataset_text_field="text",
output_dir=repository_id,
per_device_train_batch_size=1,
per_device_eval_batch_size=1,
max_seq_length=1024,
learning_rate=1e-4,
num_train_epochs=1,
optim="adamw_torch",
warmup_ratio=0.1,
max_steps=6,
logging_dir=f"{repository_id}/logs",
logging_strategy="steps",
logging_steps=2,
logging_first_step=True,
evaluation_strategy="steps",
save_strategy="steps",
save_steps=2,
save_total_limit=10,
load_best_model_at_end=True,
eval_accumulation_steps=2,
eval_steps=2,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset['train'],
eval_dataset=dataset['validation'],
args=sft_config,
)
# Train and save model
trainer.train()
model.save_pretrained(f"{repository_id}/best_model")
print(f"Trainer loaded best model loc: {trainer.state.best_model_checkpoint}") Additionally, I tested with This setup successfully saved and loaded the best model without any issues. Let me know if you need further details or clarifications! |
Thanks for your help @irislin1006 ! Can you confirm @aladinggit @yaolu-zjut that the issue is solved ? |
Hi, irislin1006, Thanks for your help, I think your solution is great. However, I think you use SFTTrainer in trl instead of trainer in transformer, so this problem does not occur. In light of your inspiration, I have switched to trl for training and there will be no errors. |
The original issue is with SFTTrainer @yaolu-zjut. Did you have the issue with Trainer but not with SFTTrainer ? |
Yes |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
hi, i want to work on it |
Feel free to open a PR! |
HI, @ArthurZucker can i open a pr regarding the same???? |
@muhd360 feel free! |
ok |
@muellerzr wouldnt this consist of making changes in huggingface/trl rather than this repo??? |
@muhd360 no as the above report states that it works for TRL, but not the Trainer |
ohk got confused,im on it will check |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
transformers
version: 4.40.2Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.
Expected behavior
At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "
I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!
The text was updated successfully, but these errors were encountered: