I got tired of manually downloading models from HuggingFace using git-lfs, authenticating every time, waiting for that finish, and then FINALLY manually inputting the model source and destination into a Python script, and then wait for THAT to finish inside of a Docker container before moving onto the next one and starting the process all over again.
As a result, I wrote these two scripts to download models from HuggingFace, convert them, pull all .json files from the original repo, and then insert a block of text into the model card (see below) before finally uploading everything to HuggingFace.
Original repo is here: RKLLM
Initial testing was done using Pelochus' EZ RKNN-LLM container found in this repo: ezrknn-llm
For more information, and useful links, please check out the RockchipNPU subreddit
Conversion tests done on consumer grade hardware:
- AMD Ryzen 3 1200 Quad-Core Processor
- GA-AX370-Gaming K5 Motherboard
- 2 x G.SKILL Ripjaws V Series DDR4 RAM 32GB, 64GB total
- NVIDIA GeForce GTX 780 (not used in this experiment)
Also performed conversion using an Intel X5650 CPU, which uses DDR3 RAM and does not support AVX.
There are two scripts in here - an interactive, and a non-interactive version. I have also included version 1.1.0 of RKLLM-Toolkit, since it contains all of the dependencies required to run these scripts (except for inquirer.)
To get started, clone this repository:
git clone https://github.com/c0zaut/ez-er-rkllm.git
To do a one-shot conversion in an interactive shell:
cd docker-interactive
docker build -t $(whoami)/rkllm-interactive . && docker run -it --rm $(whoami)/rkllm-interactive
To do a batch of various models, quant types, with or without optimization mode, and a range of hybrid quant ratios, you will need to edit non_interactive.py by setting the models, qtypes, optimizations, and hybrid quant ratio lists.
For example, to convert all three versions of chatglm3-6b (8K, 32K, and 128K context windows) with and without optimization, using w8a8 and w8a8_g128 quantizations with hybrid ratios of 0.0, 0.5, and 1.0:
model_ids = {"THUDM/chatglm3-6b", "THUDM/chatglm3-6b-32k", "THUDM/chatglm3-6b-128k"}
qtypes = {"w8a8", "w8a8_g128"}
hybrid_rates = {"0.0", "0.5", "1.0"}
optimizations = {"0", "1"}
Save your changes, and then run the following from the root of the repo directory:
cd docker-noninteractive
docker build -t $(whoami)/rkllm-noninteractive . && docker run -it --rm $(whoami)/rkllm-noninteractive
This version of the script performs one large upload - after all conversion is done.
Of course, feel free to adjust the model card template under the HubHelpers class, which is available in both:
def build_card(self, export_path):
"""
Inserts text into the README.md file of the original model, after the model data.
Using the HF built-in functions kept omitting the card's model data,
so gonna do this old school.
"""
self.model_name = self.model_id.split("/", 1)[1]
self.card_in = ModelCard.load(self.model_id)
self.card_out = export_path + "README.md"
self.template = f'---\n' + \
f'{self.card_in.data.to_yaml()}\n' + \
f'---\n' + \
f'# {self.model_name}-{self.platform.upper()}-{self.rkllm_version}\n\n' + \
f'This version of {self.model_name} has been converted to run on the {self.platform.upper()} NPU using {self.qtype} quantization.\n\n' + \
f'This model has been optimized with the following LoRA: {self.lora_id}\n\n' + \
f'Compatible with RKLLM version: {self.rkllm_version}\n\n' + \
f'###Useful links:\n' + \
f'[Official RKLLM GitHub](https://github.com/airockchip/rknn-llm) \n\n' + \
f'[RockhipNPU Reddit](https://reddit.com/r/RockchipNPU) \n\n' + \
f'[EZRKNN-LLM](https://github.com/Pelochus/ezrknn-llm/) \n\n' + \
f'Pretty much anything by these folks: [marty1885][https://github.com/marty1885] and [happyme531](https://huggingface.co/happyme531) \n\n' + \
f'# Original Model Card for base model, {self.model_name}, below:\n\n' + \
f'{self.card_in.text}'
try:
ModelCard.save(self.template, self.card_out)
except RuntimeError as e:
print(f"Runtime Error: {e}")
except RuntimeWarning as w:
print(f"Runtime Warning: {w}")
else:
print(f"Model card successfully exported to {self.card_out}!")
c = open(self.card_out, 'r')
print(c.read())
c.close()
Model conversion utilizes anywhere from 2-4x the size of the original model, which means that you need an equal amount of memory. I compensated for this with swap files of varying size. Since I just leave the process running overnight (I have low upload speeds,) the performance hit from using swap files vs partitions doesn't bother me much. If performance is critical, I would recommend at least 192GB - 512GB of DDR4 RAM with a lot of cores to handle especially large models. For evaluation and chat simulation, a CPU with AVX* support is also recommended.
Models converted using the Python 3.10 and RKLLM v1.1.1 packages do appear to be backwards compatible with the v1.1.0 runtime! So far, only Llama 3.2 3B Instruct has been tested. Check out u/DimensionUnlucky4046's pipeline in this Reddit thread
- Test with LoRA
- Test with full RAG pipeline
- Test with multimodal models