This document provides detailed instructions on how to convert and quantize large language models (LLMs) for efficient deployment on the Huawei Kunpeng 920 platform.
- Introduction
- Prerequisites
- ChatGLM Model Conversion
- Baichuan Model Conversion
- Qwen Model Conversion
- GGML Format Conversion
- Quantization Techniques
- Verifying Converted Models
Model conversion and quantization are crucial steps in deploying large language models on hardware with limited resources. These processes help reduce model size and improve inference speed while maintaining acceptable performance.
Before proceeding, ensure you have:
- Python 3.8+ installed
- Required libraries: transformers, torch, sentencepiece, accelerate
- Sufficient disk space for model weights
- Access to the original model weights (usually downloaded from Hugging Face or other model hubs)
pip install transformers torch sentencepiece accelerate
To convert ChatGLM models, use the following steps:
cd chatglm.cpp
python3 chatglm_cpp/convert.py -i /path/to/chatglm/model -t q4_0 -o chatglm-ggml.bin
Replace /path/to/chatglm/model
with the actual path to your ChatGLM model.
The -t q4_0
flag specifies the quantization type. You can choose different quantization types based on your needs:
q4_0
: 4-bit quantization (smallest size, fastest inference, some quality loss)q4_1
: 4-bit quantization with different algorithmq5_0
: 5-bit quantizationq8_0
: 8-bit quantization (larger size, slower inference, better quality)
For Baichuan models, use a similar process:
python3 chatglm_cpp/convert.py -i /path/to/baichuan/model -t q4_0 -o baichuan-ggml.bin
If you're converting a fine-tuned Baichuan model, you can specify the path to the fine-tuned weights:
python3 chatglm_cpp/convert.py -i /path/to/baichuan/model -t q4_0 -o baichuan-finetuned-ggml.bin -l /path/to/finetuned/weights
For Qwen models, the process is slightly different:
cd qwen.cpp
python3 qwen_cpp/convert.py -i /path/to/qwen/model -t q4_0 -o qwen-ggml.bin
Note that Qwen models may require a specific tokenizer file:
./build/bin/main -m qwen-ggml.bin --tiktoken "/path/to/qwen/tokenizer/qwen.tiktoken" -i
The GGML format is commonly used for efficient inference on CPU. The conversion scripts provided above typically handle the conversion to GGML format automatically.
Quantization reduces model precision to decrease size and increase inference speed. Common quantization types include:
- INT8: 8-bit integer quantization
- INT4: 4-bit integer quantization
- Mixed precision: Combination of different quantization levels for different parts of the model
The -t
flag in the conversion scripts specifies the quantization type. Experiment with different types to find the best balance between model size, inference speed, and output quality for your use case.
After conversion, it's crucial to verify the model's performance:
# For ChatGLM and Baichuan models
./build/bin/main -m model-ggml.bin -i
# For Qwen models
./build/bin/main -m qwen-ggml.bin --tiktoken "/path/to/qwen/tokenizer/qwen.tiktoken" -i
This will start an interactive session where you can test the model's outputs.
Remember to compare the outputs of the quantized model with the original model to ensure acceptable quality. You may need to adjust the quantization parameters if you observe significant degradation in performance.
Note: The exact paths and filenames may vary depending on your specific setup and the versions of the models you're using. Always refer to the most recent documentation of the respective model repositories for the most up-to-date instructions.