This example demonstrates how to finetune a llama2-7b model using Big-LLM 4bit optimizations on Intel CPUs.
- Single node with single socket: simple example or alpaca example
- Single node with multiple sockets
- multiple nodes with multiple sockets
This example is ported from bnb-4bit-training.
conda create -n llm python=3.9
conda activate llm
pip install --pre --upgrade bigdl-llm[all]
pip install transformers==4.34.0
pip install peft==0.5.0
pip install datasets
pip install accelerate==0.23.0
pip install bitsandbytes scipy
If the machine memory is not enough, you can try to set use_gradient_checkpointing=True
in here. While gradient checkpointing may improve memory efficiency, it slows training by approximately 20%.
We Recommend using micro_batch_size of 8 for better performance using 48cores in this example. You can refer to this guide for more details.
And remember to use bigdl-llm-init
before you start finetuning, which can accelerate the job.
source bigdl-llm-init -t
python ./qlora_finetuning_cpu.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --dataset DATASET
{'loss': 2.5668, 'learning_rate': 0.0002, 'epoch': 0.03}
{'loss': 1.6988, 'learning_rate': 0.00017777777777777779, 'epoch': 0.06}
{'loss': 1.3073, 'learning_rate': 0.00015555555555555556, 'epoch': 0.1}
{'loss': 1.3495, 'learning_rate': 0.00013333333333333334, 'epoch': 0.13}
{'loss': 1.1746, 'learning_rate': 0.00011111111111111112, 'epoch': 0.16}
{'loss': 1.0794, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.19}
{'loss': 1.2214, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.22}
{'loss': 1.1698, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.26}
{'loss': 1.2044, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.29}
{'loss': 1.1516, 'learning_rate': 0.0, 'epoch': 0.32}
{'train_runtime': xxx, 'train_samples_per_second': xxx, 'train_steps_per_second': xxx, 'train_loss': 1.3923714351654053, 'epoch': 0.32}
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [xx:xx<xx:xx, xxxs/it]
TrainOutput(global_step=200, training_loss=1.3923714351654053, metrics={'train_runtime': xx, 'train_samples_per_second': xx, 'train_steps_per_second': xx, 'train_loss': 1.3923714351654053, 'epoch': 0.32})
Using the export_merged_model.py to merge.
python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --adapter_path ./outputs/checkpoint-200 --output_path ./outputs/checkpoint-200-merged
Then you can use ./outputs/checkpoint-200-merged
as a normal huggingface transformer model to do inference.
Train more steps and try input sentence like ['quote'] -> [?]
to verify. For example, using “QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->:
to inference.
BigDL-LLM llama2 example link. Update the LLAMA2_PROMPT_FORMAT = "{prompt}"
.
python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt "“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->:" --n-predict 20
Base_model output
Inference time: xxx s
-------------------- Prompt --------------------
“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->:
-------------------- Output --------------------
“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: 💻 Fine-tuning a language model on a powerful device like an Intel CPU
Merged_model output
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Inference time: xxx s
-------------------- Prompt --------------------
“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->:
-------------------- Output --------------------
“QLoRA fine-tuning using BigDL-LLM 4bit optimizations on Intel CPU is Efficient and convenient” ->: ['bigdl'] ['deep-learning'] ['distributed-computing'] ['intel'] ['optimization'] ['training'] ['training-speed']