Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find the best model after training #31734

Closed
1 of 4 tasks
aladinggit opened this issue Jul 1, 2024 · 21 comments
Closed
1 of 4 tasks

Cannot find the best model after training #31734

aladinggit opened this issue Jul 1, 2024 · 21 comments
Labels

Comments

@aladinggit
Copy link

aladinggit commented Jul 1, 2024

System Info

  • transformers version: 4.40.2
  • Platform: Linux-5.15.0
  • Python version: 3.10.0
  • Huggingface_hub version: 0.22.2
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: One node with 8 A100 40G GPUs

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am using the SFTtrainer to fully finetune meta-Llama3-8B model. My SFT config and training arguments are as below.

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM, Trainer, TrainingArguments

from tqdm import tqdm
import torch
import json, re, os, sys
import numpy as np
from datasets import load_dataset, DatasetDict
import ipdb
import random
from accelerate import Accelerator
from torch.utils.data import DataLoader
import evaluate
from trl import SFTConfig, SFTTrainer

dataset = load_dataset("allenai/c4", data_files="en/c4-train.00000-of-01024.json.gz")
model_name = "meta-llama/Meta-Llama-3-8B"
train_testvalid = dataset["train"].train_test_split(test_size=0.99, seed=42) 
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42) 

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # float 32
    device_map= "auto",
)

model.train()

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # tokenizer.pad_token == None
tokenizer.padding_side = "left"

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']})

repository_id = "llama3-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    max_seq_length=1024,
    # fp16_full_eval=True, # Overflows with fp16
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio = 0.1,
    # logging & evaluation strategies
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=0.1,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps= 0.1,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=0.1,
)



trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=sft_config,
)

trainer.train()
model.save_pretrained(repository_id)
tokenizer.save_pretrained(repository_id)

Expected behavior

At the end of the training, I assume it should load the best model and save it in the directory. However, there is always a message pops up saying that "

Could not locate the best model at checkpoint-207/pytorch_model.bin, if you are running a distributed training on multiple nodes, you should activate `--save_on_each_node`.

I am only using one node for the training. I am not sure if the best model has been saved or loaded or it saved the model after the whole iteration finishes. Is this a bug related to safetensors? Could you please help me figure this out? Thanks!

@amyeroberts
Copy link
Collaborator

cc @muellerzr @SunMarc

@huggingface huggingface deleted a comment from github-actions bot Aug 1, 2024
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArthurZucker
Copy link
Collaborator

I think this was fixed in the latest version of transformers, could you try with pip install -U transformers

@yaolu-zjut
Copy link

I think this was fixed in the latest version of transformers, could you try with pip install -U transformers

Hi, I meet the same problem, When I finetune gemma2-2b-it, it reports 'No such file or directory: 'finetuned_lora_alpaca-llama/checkpoint-1400/pytorch_model.bin'. see horseee/LLM-Pruner#74 (comment) for details. And the version of transformers is 4.440

@ArthurZucker
Copy link
Collaborator

cc @muellerzr !

@irislin1006
Copy link

Hi, just fyi, I would like to claim this issue :). Also, mentioned in #33345 (comment)

@irislin1006
Copy link

irislin1006 commented Sep 7, 2024

Hi @ArthurZucker,

I attempted to reproduce this issue locally using google/gemma-2b and found that the problem appears to be resolved with the latest versions. For those encountering a similar issue, here are the package versions I used:

- PyTorch version: 2.4.1+cu121
- transformers version: 4.45.0.dev0

Here’s the code I used for testing, utilizing a small subset of the C4 dataset and configuring checkpointing and evaluation at each step:

from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset, DatasetDict
from trl import SFTConfig, SFTTrainer

# Load dataset
raw_dataset = load_dataset('allenai/c4', data_files="en/c4-train.00000-of-01024.json.gz", split='train[:1%]')

# Split dataset
train_testvalid = raw_dataset.train_test_split(test_size=0.99, seed=42)
valid_test = train_testvalid["test"].train_test_split(test_size=0.999, seed=42)

dataset = DatasetDict({
    'train': train_testvalid['train'],
    'validation': valid_test['train'],
    'test': valid_test['test']
})

model_name = "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)

# Training and evaluation setup
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

repository_id = "gemma2b-tune"

sft_config = SFTConfig(
    dataset_text_field="text",
    output_dir=repository_id,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    max_seq_length=1024,
    learning_rate=1e-4,
    num_train_epochs=1,
    optim="adamw_torch",
    warmup_ratio=0.1,
    max_steps=6,
    logging_dir=f"{repository_id}/logs",
    logging_strategy="steps",
    logging_steps=2,
    logging_first_step=True,
    evaluation_strategy="steps",
    save_strategy="steps",
    save_steps=2,
    save_total_limit=10,
    load_best_model_at_end=True,
    eval_accumulation_steps=2,
    eval_steps=2,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    args=sft_config,
)

# Train and save model
trainer.train()
model.save_pretrained(f"{repository_id}/best_model")

print(f"Trainer loaded best model loc: {trainer.state.best_model_checkpoint}")

Additionally, I tested with save_safetensors=False as mentioned in horseee/LLM-Pruner#74, and can confirm that pytorch_model.bin files were saved as expected.

This setup successfully saved and loaded the best model without any issues. Let me know if you need further details or clarifications!

@SunMarc
Copy link
Member

SunMarc commented Sep 9, 2024

Thanks for your help @irislin1006 ! Can you confirm @aladinggit @yaolu-zjut that the issue is solved ?

@yaolu-zjut
Copy link

Hi, irislin1006, Thanks for your help, I think your solution is great. However, I think you use SFTTrainer in trl instead of trainer in transformer, so this problem does not occur. In light of your inspiration, I have switched to trl for training and there will be no errors.

@SunMarc
Copy link
Member

SunMarc commented Sep 10, 2024

The original issue is with SFTTrainer @yaolu-zjut. Did you have the issue with Trainer but not with SFTTrainer ?

@yaolu-zjut
Copy link

Yes

Copy link

github-actions bot commented Oct 5, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Ashutoshjangam
Copy link

hi, i want to work on it

@ArthurZucker
Copy link
Collaborator

Feel free to open a PR!

@muhd360
Copy link

muhd360 commented Oct 8, 2024

HI, @ArthurZucker can i open a pr regarding the same????

@muellerzr
Copy link
Contributor

@muhd360 feel free!

@muhd360
Copy link

muhd360 commented Oct 9, 2024

ok

@muhd360
Copy link

muhd360 commented Oct 9, 2024

@muellerzr wouldnt this consist of making changes in huggingface/trl rather than this repo???

@muellerzr
Copy link
Contributor

@muhd360 no as the above report states that it works for TRL, but not the Trainer

@muhd360
Copy link

muhd360 commented Oct 10, 2024

ohk got confused,im on it will check

Copy link

github-actions bot commented Nov 4, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants