diff --git a/README.md b/README.md index 41a22d0..e4b1a8e 100644 --- a/README.md +++ b/README.md @@ -86,11 +86,11 @@ python3 examples/chat.py Chat demo on GPU (A100, LLaMa3.1 8B) - + Chat demo on Apple M4 (Phi3 3.8B) - + #### Option 2: Chat with ChatUI Install ChatUI and its dependencies: @@ -234,7 +234,7 @@ asyncio.run(benchmark()) Candle-vllm now supports GPTQ (Marlin kernel), you may supply the `quant` (marlin) parameter if you have `Marlin` format quantized weights, such as: ``` -cargo run --release -- --port 2000 --dtype f16 --weight-path /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin/ llama3 --quant marlin --temperature 0. --penalty 1. +cargo run --release --features cuda -- --port 2000 --dtype f16 --weight-path /home/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4-Marlin/ llama3 --quant marlin --temperature 0. --penalty 1. ``` You may also use `AutoGPTQ` to transform a model to marlin format by loading the (quantized) model, supplying the `use_marlin=True` in `AutoGPTQ` and resaving it with "save_pretrained". @@ -269,10 +269,12 @@ Options for `quant` parameters: ["q4_0", "q4_1", "q5_0", "q5_1", "q8_0", "q2k", ## Usage Help For general configuration help, run `cargo run -- --help`. -For model-specific help, run `cargo run --features -- --port 2000 --help` +For model-specific help, run `cargo run -- --features -- --port 2000 --help` For local model weights, run `cargo run --release --features cuda -- --port 2000 --weight-path /home/llama2_7b/ llama`, change the path when needed. +`MODE`=["debug", "release"] + `PLATFORM`=["cuda", "metal"] `MODEL_TYPE` = ["llama", "llama3", "mistral", "phi2", "phi3", "qwen2", "gemma", "yi", "stable-lm"]