finetuning using notebook on custom dataset #788

amoghskanda · 2024-11-14T06:21:45Z

System Info

python 3.10.15
torch 2.5.1
transformers 4.46.2
tokenizers 0.20.3

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

I had to finetune llama3.2 11B Vision Instruct and I downloaded the model from huggingface(https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)> I'm trying to finetune the model on a custom dataset of mine by following the finetuning notebook. When I start finetuning, I run into list conversion to tensor issue which I'm guessing is because the dataset is not in the right format. Could anybody suggest the dataset format?
I have ~4k images, metadata.csv which contains 20 columns encompassing all the information about the images, a prompt for finetuning.
The code I used for generating the dataset :

import os
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer
from PIL import Image
from torchvision import transforms
import torch

image_folder = 'path to images folder'
csv_file = 'path to metadata.csv'
prompt = "The prompt used for FT"
metadata = pd.read_csv(csv_file)
metadata['image_path'] = metadata['file_name'].apply(lambda x: os.path.join(image_folder, x))


def load_image(image_path):
    image = Image.open(image_path).convert("RGB")
    return image

def preprocess_image(image):
    transform = transforms.Compose([
        transforms.Resize((224, 224)),
        transforms.ToTensor(),
    ])
    return transform(image)


def tokenize_prompt(prompt, tokenizer):
    return tokenizer(prompt, return_tensors="pt", padding="max_length", truncation=True, max_length=512)


tokenizer = AutoTokenizer.from_pretrained("path to llama model")

data = []
for idx, row in metadata.iterrows():
    image_path = os.path.join(image_folder, row["image_path"])
    image = load_image(image_path)
    image = preprocess_image(image)

    tokenized_prompt = tokenize_prompt(prompt, tokenizer)
    
    data_entry = {
        "image": image,
        "text": prompt,
        "input_ids": tokenized_prompt["input_ids"].squeeze().tolist(),
        "attention_mask": tokenized_prompt["attention_mask"].squeeze().tolist(),
        "metadata": row.to_dict()
    }
    data.append(data_entry)

dataset = Dataset.from_pandas(pd.DataFrame(data))

dataset_dict = DatasetDict({
    "train": dataset
})

dataset_dict.save_to_disk("train_dataset")

Error logs

{
"name": "AttributeError",
"message": "'list' object has no attribute 'to'",
"stack": "---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[9], line 15
12 scheduler = StepLR(optimizer, step_size=1, gamma=train_config.gamma)
14 # Start the training process
---> 15 results = train(
16 model,
17 train_dataloader['train'],
18 eval_dataloader['test'],
19 tokenizer,
20 optimizer,
21 scheduler,
22 train_config.gradient_accumulation_steps,
23 train_config,
24 None,
25 None,
26 None,
27 wandb_run=None,
28 )

File ~/anaconda3/envs/llama/lib/python3.10/site-packages/llama_recipes/utils/train_utils.py:151, in train(model, train_dataloader, eval_dataloader, tokenizer, optimizer, lr_scheduler, gradient_accumulation_steps, train_config, fsdp_config, local_rank, rank, wandb_run)
149 batch[key] = batch[key].to('xpu:0')
150 elif torch.cuda.is_available():
--> 151 batch[key] = batch[key].to('cuda:0')
152 with autocast():
153 loss = model(**batch).loss

AttributeError: 'list' object has no attribute 'to'"
}

I have tried keeping input_ids and attention_mask as pytorch tensors but there was a problem during conversion of tensors to arrow objects during dataset creation.

Expected behavior

Any guide on how to create a dataset compatible with llama3.2 11B Vision Instruct with images, metadata and a prompt

The text was updated successfully, but these errors were encountered:

HamidShojanazeri · 2024-11-18T18:33:50Z

cc: @wukaixingxp

wukaixingxp · 2024-11-18T22:07:38Z

@amoghskanda You need to convert list into tensor, something like batch["labels"] = torch.tensor(label_list). Please check this example about how to convert the dialogs into tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finetuning using notebook on custom dataset #788

finetuning using notebook on custom dataset #788

amoghskanda commented Nov 14, 2024 •

edited

Loading

HamidShojanazeri commented Nov 18, 2024

wukaixingxp commented Nov 18, 2024

finetuning using notebook on custom dataset #788

finetuning using notebook on custom dataset #788

Comments

amoghskanda commented Nov 14, 2024 • edited Loading

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior

HamidShojanazeri commented Nov 18, 2024

wukaixingxp commented Nov 18, 2024

amoghskanda commented Nov 14, 2024 •

edited

Loading