This repo is a learning playground for fine-tuning LLMs (currently Llama 3.2 1B / 3B Instruct) and running them locally (with Transformers.js v3 or Web-LLM).
- Downloaded full Google Chat transcript between my wife and me using Google Takeout
- Processed conversation data and created train.yaml pointing to that data
- Used torchtune with train.yaml config and full finetune to finetune Llama 3.2 1B Instruct on my data
tune run full_finetune_single_device --config train.yaml
- Renamed hf_model_0001_2.pt file to pytorch_model.bin so Optimum can find it
- Used transformer.js v3's convert.py script (after copying in the quantize.py script and removing duplicate enum to avoid an import issue) to quantize into onnx files
python transformers/scripts/convert.py --model_id "Meta-Llama3.2-1B-Instruct-FT" --task "text-generation"
- Used transformer.js v3's quantize.py script for all modes (could also do, for example,
--modes "q4fp16" "int8"
)python transformers/scripts/quantize.py --input_folder "models/elisheldon/Meta-Llama3.2-1B-Instruct-FT" --output_folder "models/elisheldon/Meta-Llama3.2-1B-Instruct-FT/onnx"
- Published model to https://huggingface.co/elisheldon/Meta-Llama3.2-1B-Instruct-FT which is consumed by this web code
- WebGPU issues. Current WebGPU implementation is mostly broken Only the q4fp16 quantized model can be run on WebGPU without a memory error or garbage output (at least on an M2 Pro), but the quality of the response is poor as compared with running the exact same model on WASM.
- WASM speed. Current WASM implementation runs very slowly, although it does successfully respond with reasonable responses.
- Small base model. Llama 3.2 1B Instruct was chosen to avoid 2GB protobuf limits in the onnx stack, but fine-tuning results are much better with fine-tuned Llama 3.2 3B Instruct run as GGUF outside of the browser context.
- Started with the full finetune of Llama 3.2 1B Instruct from above, prior to any ONNX conversion, optimization or quanitzation
- Followed steps here to convert weights to MLC format, then uploaded to HF
mlc_llm convert_weight Meta-Llama3.2-1B-Instruct-FT/ --quantization q4f16_1 -o Meta-Llama3.2-1B-Instruct-FT-q4f16_1-MLC
mlc_llm gen_config Meta-Llama3.2-1B-Instruct-FT/ --quantization q4f16_1 --conv-template llama-3_1 -o Meta-Llama3.2-1B-Instruct-FT-q4f16_1-MLC
- Created basic chat app using webllm and deployed to https://eli-chat.pages.dev/
- Small base model. This issue persists, as the Llama 3.2 3B Instruct model and repeating the above flow caused memory issues on an M2 Pro.