Skip to content

Latest commit

 

History

History
236 lines (161 loc) · 7.94 KB

File metadata and controls

236 lines (161 loc) · 7.94 KB

QLoRA Finetuning with IPEX-LLM

This example ports Alpaca-LoRA to IPEX-LLM (using QLoRA algorithm) on Intel GPU.

Note: You could also refer to simple QLoRA example to try related usage.

0. Requirements

To run this example with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to here for more information.

1. Install

conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install transformers==4.36.0 datasets
pip install fire peft==0.10.0
pip install oneccl_bind_pt==2.1.100 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ # necessary to run distributed finetuning
pip install bitsandbytes scipy
# configures OneAPI environment variables
source /opt/intel/oneapi/setvars.sh # necessary to run before installing deepspeed
pip install git+https://github.com/microsoft/DeepSpeed.git@78c518e
pip install git+https://github.com/intel/intel-extension-for-deepspeed.git@ec33277

2. Configures OneAPI environment variables

source /opt/intel/oneapi/setvars.sh

3. QLoRA Finetune

Here, we provide example usages on different hardware. Please refer to the appropriate script based on your device and model:

Show LLaMA2-7B examples
Finetuning LLaMA2-7B on single Arc A770
bash qlora_finetune_llama2_7b_arc_1_card.sh
Finetuning LLaMA2-7B on two Arc A770
bash qlora_finetune_llama2_7b_arc_2_card.sh
Finetuning LLaMA2-7B on single Data Center GPU Flex 170
bash qlora_finetune_llama2_7b_flex_170_1_card.sh
Finetuning LLaMA2-7B on three Data Center GPU Flex 170
bash qlora_finetune_llama2_7b_flex_170_3_card.sh
Finetuning LLaMA2-7B on single Intel Data Center GPU Max 1100
bash qlora_finetune_llama2_7b_pvc_1100_1_card.sh
Finetuning LLaMA2-7B on four Intel Data Center GPU Max 1100
bash qlora_finetune_llama2_7b_pvc_1100_4_card.sh
Finetuning LLaMA2-7B on single Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_7b_pvc_1550_1_card.sh
Finetuning LLaMA2-7B on four Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_7b_pvc_1550_4_card.sh
Show LLaMA2-13B examples
Finetuning LLaMA2-13B on single tile of Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_13b_pvc_1550_1_tile.sh
Finetuning LLaMA2-13B on single Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_13b_pvc_1550_1_card.sh
Finetuning LLaMA2-13B on four Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_13b_pvc_1550_4_card.sh
Show LLaMA2-70B examples

Different from LLaMA2-7B and LLaMA2-13B, it is recommonded to save the model with ipex-llm low-bit optimization first to avoid large amount of CPU memory usage. And DeepSpeed ZeRO2 technology is used during finetuning.

Finetuning LLaMA2-70B on one Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_70b_pvc_1550_1_card.sh
Finetuning LLaMA2-70B on four Intel Data Center GPU Max 1550
bash qlora_finetune_llama2_70b_pvc_1550_4_card.sh
Show LLaMA3-8B examples
Finetuning LLaMA3-8B on single Arc A770
bash qlora_finetune_llama3_8b_arc_1_card.sh
Show ChatGLM3-6B examples
Finetuning ChatGLM3-6B examples on single Arc A770
bash qlora_finetune_chatglm3_6b_arc_1_card.sh
Show Qwen-1.5-7B examples
Finetuning Qwen-1.5-7B examples on single Arc A770

Install transformers 4.37.0

pip install transformers==4.37.0
bash qlora_finetune_qwen15_7b_arc_1_card.sh
Show Baichuan2-7B examples
Finetuning Baichuan2-7B examples on single Arc A770

Please download Baichuan2-7B-Chat. Modify modeling_baichuan.py in model dir. Add following 2 lines into Line 234. This change fixes Baichuan2 type mismatch issue.

if(attention_mask.dtype != query_states.dtype):
    attention_mask = attention_mask.to(query_states.dtype)

After modification, line 234-236 should look like below.

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=True, enable_mem_efficient=True):
    if(attention_mask.dtype != query_states.dtype):
        attention_mask = attention_mask.to(query_states.dtype)
    attn_output = F.scaled_dot_product_attention(query_states, key_states, value_states, attn_mask = attention_mask)

Modify --base_model in qlora_finetune_baichuan2_7b_arc_1_card.sh. Then, launch finetune.

bash qlora_finetune_baichuan2_7b_arc_1_card.sh

4. (Optional) Resume Training

If you fail to complete the whole finetuning process, it is suggested to resume training from a previously saved checkpoint by specifying resume_from_checkpoint to the local checkpoint folder as following:**

python ./alpaca_qlora_finetuning.py \
    --base_model "meta-llama/Llama-2-7b-hf" \
    --data_path "yahma/alpaca-cleaned" \
    --output_dir "./ipex-llm-qlora-alpaca" \
    --resume_from_checkpoint "./ipex-llm-qlora-alpaca/checkpoint-1100"

5. Sample Output

{'loss': 1.9231, 'learning_rate': 2.9999945367033285e-05, 'epoch': 0.0}                                                                                                                          
{'loss': 1.8622, 'learning_rate': 2.9999781468531096e-05, 'epoch': 0.01}                                                                                                                         
{'loss': 1.9043, 'learning_rate': 2.9999508305687345e-05, 'epoch': 0.01}                                                                                                                         
{'loss': 1.8967, 'learning_rate': 2.999912588049185e-05, 'epoch': 0.01}                                                                                                                          
{'loss': 1.9658, 'learning_rate': 2.9998634195730358e-05, 'epoch': 0.01}                                                                                                                         
{'loss': 1.8386, 'learning_rate': 2.9998033254984483e-05, 'epoch': 0.02}                                                                                                                         
{'loss': 1.809, 'learning_rate': 2.999732306263172e-05, 'epoch': 0.02}                                                                                                                           
{'loss': 1.8552, 'learning_rate': 2.9996503623845395e-05, 'epoch': 0.02}                                                                                                                         
  1%|█                                                                                                                                                         | 8/1164 [xx:xx<xx:xx:xx, xx s/it]

6. Merge the adapter into the original model

python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged

Then you can use ./outputs/checkpoint-200-merged as a normal huggingface transformer model to do inference.

7. Troubleshooting

Please refer to here for solutions of common issues during finetuning.