-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization failed #1237
Comments
also the json files in the example are no longer supported in the intel neural compressor, it claims that this key value pair is invalid (as of version 3.0)
|
root@8fb421541c5d:~/optimum-habana/examples/text-generation# QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py \
|
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py tokenizer_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.5k/50.5k [00:00<00:00, 891kB/s] |
root@8fb421541c5d: |
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_lm_eval.py -o llama_405b_load_uint4_model.txt --model_name_or_path hugging-quants/Meta-Llama-3.1-405B-Instruct-AWQ-INT4 --use_hpu_graphs --use_kv_cache --trim_logits --batch_size 1 --bf16 --attn_softmax_bf16 --bucket_size=128 --bucket_internal Traceback (most recent call last): |
SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=1 python run_lm_eval.py -o acc_load_uint4_model.txt --model_name_or_path hugging-quants/Meta-Llama-3.1-405B-Instruct-GPTQ-INT4 --use_hpu_graphs --use_kv_cache --trim_logits --batch_size 1 --bf16 --attn_softmax_bf16 --bucket_size=128 --bucket_internal --load_quantized_model
|
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
QUANT_CONFIG=./quantization_config/maxabs_quant.json TQDM_DISABLE=1 python run_generation.py --model_name_or_path meta-llama/Meta-Llama-3.1-70B-Instruct --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --limit_hpu_graphs --bucket_size=128 --bucket_internal --max_new_tokens 2048 --max_input_tokens 2048 --bf16 --batch_size 1 --disk_offload --use_flash_attention --flash_attention_recompute
Expected behavior
trying to use quantized llama 3.1 70b models
The text was updated successfully, but these errors were encountered: