EZ-ER-RKLLM-Toolkit

Backstory

I got tired of manually downloading models from HuggingFace using git-lfs, authenticating every time, waiting for that finish, and then FINALLY manually inputting the model source and destination into a Python script, and then wait for THAT to finish inside of a Docker container before moving onto the next one and starting the process all over again.

As a result, I wrote these two scripts to download models from HuggingFace, convert them, pull all .json files from the original repo, and then insert a block of text into the model card (see below) before finally uploading everything to HuggingFace.

Original repo is here: RKLLM

Initial testing was done using Pelochus' EZ RKNN-LLM container found in this repo: ezrknn-llm

For more information, and useful links, please check out the RockchipNPU subreddit

Conversion tests done on consumer grade hardware:

AMD Ryzen 3 1200 Quad-Core Processor
GA-AX370-Gaming K5 Motherboard
2 x G.SKILL Ripjaws V Series DDR4 RAM 32GB, 64GB total
NVIDIA GeForce GTX 780 (not used in this experiment)

Also performed conversion using an Intel X5650 CPU, which uses DDR3 RAM and does not support AVX.

How to use

There are two scripts in here - an interactive, and a non-interactive version. I have also included version 1.1.0 of RKLLM-Toolkit, since it contains all of the dependencies required to run these scripts (except for inquirer.)

To get started, clone this repository:

git clone https://github.com/c0zaut/ez-er-rkllm.git

To do a one-shot conversion in an interactive shell:

cd docker-interactive
docker build -t $(whoami)/rkllm-interactive . && docker run -it --rm $(whoami)/rkllm-interactive

To do a batch of various models, quant types, with or without optimization mode, and a range of hybrid quant ratios, you will need to edit non_interactive.py by setting the models, qtypes, optimizations, and hybrid quant ratio lists.

For example, to convert all three versions of chatglm3-6b (8K, 32K, and 128K context windows) with and without optimization, using w8a8 and w8a8_g128 quantizations with hybrid ratios of 0.0, 0.5, and 1.0:

    model_ids = {"THUDM/chatglm3-6b", "THUDM/chatglm3-6b-32k", "THUDM/chatglm3-6b-128k"}
    qtypes = {"w8a8", "w8a8_g128"}
    hybrid_rates = {"0.0", "0.5", "1.0"}
    optimizations = {"0", "1"}

Save your changes, and then run the following from the root of the repo directory:

cd docker-noninteractive
docker build -t $(whoami)/rkllm-noninteractive . && docker run -it --rm $(whoami)/rkllm-noninteractive

This version of the script performs one large upload - after all conversion is done.

Changing the model card template

Of course, feel free to adjust the model card template under the HubHelpers class, which is available in both:

    def build_card(self, export_path):
        """
        Inserts text into the README.md file of the original model, after the model data. 
        Using the HF built-in functions kept omitting the card's model data,
        so gonna do this old school.
        """
        self.model_name = self.model_id.split("/", 1)[1]
        self.card_in = ModelCard.load(self.model_id)
        self.card_out = export_path + "README.md"
        self.template = f'---\n' + \
            f'{self.card_in.data.to_yaml()}\n' + \
            f'---\n' + \
            f'# {self.model_name}-{self.platform.upper()}-{self.rkllm_version}\n\n' + \
            f'This version of {self.model_name} has been converted to run on the {self.platform.upper()} NPU using {self.qtype} quantization.\n\n' + \
            f'This model has been optimized with the following LoRA: {self.lora_id}\n\n' + \
            f'Compatible with RKLLM version: {self.rkllm_version}\n\n' + \
            f'###Useful links:\n' + \
            f'[Official RKLLM GitHub](https://github.com/airockchip/rknn-llm) \n\n' + \
            f'[RockhipNPU Reddit](https://reddit.com/r/RockchipNPU) \n\n' + \
            f'[EZRKNN-LLM](https://github.com/Pelochus/ezrknn-llm/) \n\n' + \
            f'Pretty much anything by these folks: [marty1885][https://github.com/marty1885] and [happyme531](https://huggingface.co/happyme531) \n\n' + \
            f'# Original Model Card for base model, {self.model_name}, below:\n\n' + \
            f'{self.card_in.text}'
        try:
            ModelCard.save(self.template, self.card_out)
        except RuntimeError as e:
            print(f"Runtime Error: {e}")
        except RuntimeWarning as w:
            print(f"Runtime Warning: {w}")
        else:
            print(f"Model card successfully exported to {self.card_out}!")
            c = open(self.card_out, 'r')
            print(c.read())
            c.close()

Utilization

Model conversion utilizes anywhere from 2-4x the size of the original model, which means that you need an equal amount of memory. I compensated for this with swap files of varying size. Since I just leave the process running overnight (I have low upload speeds,) the performance hit from using swap files vs partitions doesn't bother me much. If performance is critical, I would recommend at least 192GB - 512GB of DDR4 RAM with a lot of cores to handle especially large models. For evaluation and chat simulation, a CPU with AVX* support is also recommended.

Compatibility and Testing

Models converted using the Python 3.10 and RKLLM v1.1.1 packages do appear to be backwards compatible with the v1.1.0 runtime! So far, only Llama 3.2 3B Instruct has been tested. Check out u/DimensionUnlucky4046's pipeline in this Reddit thread

To do

Test with LoRA
Test with full RAG pipeline
Test with multimodal models

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docker-interactive		docker-interactive
docker-noninteractive		docker-noninteractive
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EZ-ER-RKLLM-Toolkit

Backstory

How to use

Changing the model card template

Utilization

Compatibility and Testing

To do

About

Releases

Packages

Languages

c0zaut/ez-er-rkllm-toolkit

Folders and files

Latest commit

History

Repository files navigation

EZ-ER-RKLLM-Toolkit

Backstory

How to use

Changing the model card template

Utilization

Compatibility and Testing

To do

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages