-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merging lora adapter with Llama 3.2 vision #702
Comments
Hi! Can you show me what is inside of
|
Lastly, |
I was able to merge it without any issue. Here is my code that might help: from peft import PeftModel
base_model = MllamaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_id)
prompt = f"<|image|><|begin_of_text|>question:{question}"
inputs = processor(image, prompt, return_tensors="pt").to(basee_model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0]))
lora_model = PeftModel.from_pretrained(base_model, adapters_name)
model = lora_model.merge_and_unload()
processor = AutoProcessor.from_pretrained(model_id)
processor.bos_token_id = 1
inputs = processor(image, prompt, return_tensors="pt").to(m.device)
output2 = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output2[0])) However, the inference result is identical to the base model. @wukaixingxp, any suggestion? I also check there isn't any lora_params = {n: p for n, p in lora_model.named_parameters() if "lora" in n}
for n, p in lora_params.items():
print(n, p.sum()) |
Thank you for getting back to me. Do you happen to have a further explanation of why CPU must be used instead of GPU for this usage? |
@tymoma01 I use |
I merged lora adapter with base model(llama3.2 11B Vision Instructed) by following these commands adapters_name = "path/to/peft/model"
lora_model = PeftModel.from_pretrained(base_model, adapters_name)
merged_model = lora_model.merge_and_unload()
merged_model.save_pretrained("llama-merged") However, I'm unable to load the merged model. I'm getting I tried defining a quantization_config where I set |
System Info
CUDA Version: 12.4
GPU: A6000
Information
🐛 Describe the bug
After finetuning Llama3.2 vision using FSDP + peft LoRA with this command:
A folder at
PATH/to/save/PEFT/model
is created containing:I want to merge the adapter with the base model for inference.
To do that I used this code:
However, this code generates only 3 safetensors files in the output folder, whereas the base model originally had 5:
Error logs
When trying to run inference on this merged model using:
python multi_modal_infer.py --image_path "<image_path>" --prompt_text "Describe this image" --temperature 0.1 --top_p 0.8 --model_name ./model
--hf_token <hf_token>I encounter the following error:
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [92,0,0], thread: [0,0,0] Assertion -sizes[i] <= index && index < sizes[i] && "index out of bounds" failed.
Full Traceback:
Expected behavior
Has anyone encountered this error while merging LoRA adapters for inference? Is this a tensor size mismatch issue or a problem with quantization (BitsAndBytes)? What might cause the reduced number of safetensors files, and how could I solve this?
The text was updated successfully, but these errors were encountered: