Skip to content

Automated script to convert Huggingface and GGUF models to rkllm format for running on Rockchip NPU

Notifications You must be signed in to change notification settings

c0zaut/ez-er-rkllm-toolkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

EZ-ER-RKLLM-Toolkit

Backstory

I got tired of manually downloading models from HuggingFace using git-lfs, authenticating every time, waiting for that finish, and then FINALLY manually inputting the model source and destination into a Python script, and then wait for THAT to finish inside of a Docker container before moving onto the next one and starting the process all over again.

As a result, I wrote these two scripts to download models from HuggingFace, convert them, pull all .json files from the original repo, and then insert a block of text into the model card (see below) before finally uploading everything to HuggingFace.

Original repo is here: RKLLM

Initial testing was done using Pelochus' EZ RKNN-LLM container found in this repo: ezrknn-llm

For more information, and useful links, please check out the RockchipNPU subreddit

Conversion tests done on consumer grade hardware:

  • AMD Ryzen 3 1200 Quad-Core Processor
  • GA-AX370-Gaming K5 Motherboard
  • 2 x G.SKILL Ripjaws V Series DDR4 RAM 32GB, 64GB total
  • NVIDIA GeForce GTX 780 (not used in this experiment)

Also performed conversion using an Intel X5650 CPU, which uses DDR3 RAM and does not support AVX.

How to use

There are two scripts in here - an interactive, and a non-interactive version. I have also included version 1.1.0 of RKLLM-Toolkit, since it contains all of the dependencies required to run these scripts (except for inquirer.)

To get started, clone this repository:

git clone https://github.com/c0zaut/ez-er-rkllm.git

To do a one-shot conversion in an interactive shell:

cd docker-interactive
docker build -t $(whoami)/rkllm-interactive . && docker run -it --rm $(whoami)/rkllm-interactive

To do a batch of various models, quant types, with or without optimization mode, and a range of hybrid quant ratios, you will need to edit non_interactive.py by setting the models, qtypes, optimizations, and hybrid quant ratio lists.

For example, to convert all three versions of chatglm3-6b (8K, 32K, and 128K context windows) with and without optimization, using w8a8 and w8a8_g128 quantizations with hybrid ratios of 0.0, 0.5, and 1.0:

    model_ids = {"THUDM/chatglm3-6b", "THUDM/chatglm3-6b-32k", "THUDM/chatglm3-6b-128k"}
    qtypes = {"w8a8", "w8a8_g128"}
    hybrid_rates = {"0.0", "0.5", "1.0"}
    optimizations = {"0", "1"}

Save your changes, and then run the following from the root of the repo directory:

cd docker-noninteractive
docker build -t $(whoami)/rkllm-noninteractive . && docker run -it --rm $(whoami)/rkllm-noninteractive

This version of the script performs one large upload - after all conversion is done.

Changing the model card template

Of course, feel free to adjust the model card template under the HubHelpers class, which is available in both:

    def build_card(self, export_path):
        """
        Inserts text into the README.md file of the original model, after the model data. 
        Using the HF built-in functions kept omitting the card's model data,
        so gonna do this old school.
        """
        self.model_name = self.model_id.split("/", 1)[1]
        self.card_in = ModelCard.load(self.model_id)
        self.card_out = export_path + "README.md"
        self.template = f'---\n' + \
            f'{self.card_in.data.to_yaml()}\n' + \
            f'---\n' + \
            f'# {self.model_name}-{self.platform.upper()}-{self.rkllm_version}\n\n' + \
            f'This version of {self.model_name} has been converted to run on the {self.platform.upper()} NPU using {self.qtype} quantization.\n\n' + \
            f'This model has been optimized with the following LoRA: {self.lora_id}\n\n' + \
            f'Compatible with RKLLM version: {self.rkllm_version}\n\n' + \
            f'###Useful links:\n' + \
            f'[Official RKLLM GitHub](https://github.com/airockchip/rknn-llm) \n\n' + \
            f'[RockhipNPU Reddit](https://reddit.com/r/RockchipNPU) \n\n' + \
            f'[EZRKNN-LLM](https://github.com/Pelochus/ezrknn-llm/) \n\n' + \
            f'Pretty much anything by these folks: [marty1885][https://github.com/marty1885] and [happyme531](https://huggingface.co/happyme531) \n\n' + \
            f'# Original Model Card for base model, {self.model_name}, below:\n\n' + \
            f'{self.card_in.text}'
        try:
            ModelCard.save(self.template, self.card_out)
        except RuntimeError as e:
            print(f"Runtime Error: {e}")
        except RuntimeWarning as w:
            print(f"Runtime Warning: {w}")
        else:
            print(f"Model card successfully exported to {self.card_out}!")
            c = open(self.card_out, 'r')
            print(c.read())
            c.close()

Utilization

Model conversion utilizes anywhere from 2-4x the size of the original model, which means that you need an equal amount of memory. I compensated for this with swap files of varying size. Since I just leave the process running overnight (I have low upload speeds,) the performance hit from using swap files vs partitions doesn't bother me much. If performance is critical, I would recommend at least 192GB - 512GB of DDR4 RAM with a lot of cores to handle especially large models. For evaluation and chat simulation, a CPU with AVX* support is also recommended.

Compatibility and Testing

Models converted using the Python 3.10 and RKLLM v1.1.1 packages do appear to be backwards compatible with the v1.1.0 runtime! So far, only Llama 3.2 3B Instruct has been tested. Check out u/DimensionUnlucky4046's pipeline in this Reddit thread

To do

  • Test with LoRA
  • Test with full RAG pipeline
  • Test with multimodal models

About

Automated script to convert Huggingface and GGUF models to rkllm format for running on Rockchip NPU

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published