Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finetuning using notebook on custom dataset #788

Open
1 of 2 tasks
amoghskanda opened this issue Nov 14, 2024 · 2 comments
Open
1 of 2 tasks

finetuning using notebook on custom dataset #788

amoghskanda opened this issue Nov 14, 2024 · 2 comments

Comments

@amoghskanda
Copy link

amoghskanda commented Nov 14, 2024

System Info

python 3.10.15
torch 2.5.1
transformers 4.46.2
tokenizers 0.20.3

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

I had to finetune llama3.2 11B Vision Instruct and I downloaded the model from huggingface(https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)> I'm trying to finetune the model on a custom dataset of mine by following the finetuning notebook. When I start finetuning, I run into list conversion to tensor issue which I'm guessing is because the dataset is not in the right format. Could anybody suggest the dataset format?
I have ~4k images, metadata.csv which contains 20 columns encompassing all the information about the images, a prompt for finetuning.
The code I used for generating the dataset :

import os
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
from PIL import Image
from torchvision import transforms
import torch

image_folder = 'path to images folder'
csv_file = 'path to metadata.csv'
prompt = "The prompt used for FT"
metadata = pd.read_csv(csv_file)
metadata['image_path'] = metadata['file_name'].apply(lambda x: os.path.join(image_folder, x))


def load_image(image_path):
    image = Image.open(image_path).convert("RGB")
    return image

def preprocess_image(image):
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])
    return transform(image)


def tokenize_prompt(prompt, tokenizer):
    return tokenizer(prompt, return_tensors="pt", padding="max_length", truncation=True, max_length=512)


tokenizer = AutoTokenizer.from_pretrained("path to llama model")

data = []
for idx, row in metadata.iterrows():
    image_path = os.path.join(image_folder, row["image_path"])
    image = load_image(image_path)
    image = preprocess_image(image)

    tokenized_prompt = tokenize_prompt(prompt, tokenizer)
    
    data_entry = {
        "image": image,
        "text": prompt,
        "input_ids": tokenized_prompt["input_ids"].squeeze().tolist(),
        "attention_mask": tokenized_prompt["attention_mask"].squeeze().tolist(),
        "metadata": row.to_dict()
    }
    data.append(data_entry)

dataset = Dataset.from_pandas(pd.DataFrame(data))

dataset_dict = DatasetDict({
    "train": dataset
})

dataset_dict.save_to_disk("train_dataset")

Error logs

{
"name": "AttributeError",
"message": "'list' object has no attribute 'to'",
"stack": "---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[9], line 15
12 scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
14 # Start the training process
---> 15 results = train(
16 model,
17 train_dataloader['train'],
18 eval_dataloader['test'],
19 tokenizer,
20 optimizer,
21 scheduler,
22 train_config.gradient_accumulation_steps,
23 train_config,
24 None,
25 None,
26 None,
27 wandb_run=None,
28 )

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/llama_recipes/utils/train_utils.py:151, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
149 batch[key] = batch[key].to('xpu:0')
150 elif torch.cuda.is_available():
--> 151 batch[key] = batch[key].to('cuda:0')
152 with autocast():
153 loss = model(**batch).loss

AttributeError: 'list' object has no attribute 'to'"
}

I have tried keeping input_ids and attention_mask as pytorch tensors but there was a problem during conversion of tensors to arrow objects during dataset creation.

Expected behavior

Any guide on how to create a dataset compatible with llama3.2 11B Vision Instruct with images, metadata and a prompt

@HamidShojanazeri
Copy link
Contributor

cc: @wukaixingxp

@wukaixingxp
Copy link
Contributor

@amoghskanda You need to convert list into tensor, something like batch["labels"] = torch.tensor(label_list). Please check this example about how to convert the dialogs into tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants