You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During calibration on Llama 3.1 70B on Tensor parallel = 2 (Gaudi2), step 4 scale failed due to out of memory issue.
Should add this line in the scripts to enable model offloading on cpu to mitigate this issue.
After reduce batch size to 16 and enabling cpu offloading, the calibration was successful.
Issue reproduce step:
Thank you for your input! :) We agree that proposed batch is too high for used model and TP=2. In vLLM logs we can even find log entry which stands:
INFO 01-20 18:42:43 executor_base.py:101] Maximum concurrency for 2048 tokens per request: 72.31x
so max batch size in this case is 72.
Your proposition to load model weights on CPU is also a great idea. This help to reduce memory pressure on HPU loading BF16 weights of the model to RAM and store only FP8 weights on HPU memory.
During calibration on Llama 3.1 70B on Tensor parallel = 2 (Gaudi2), step 4 scale failed due to out of memory issue.
Should add this line in the scripts to enable model offloading on cpu to mitigate this issue.
After reduce batch size to 16 and enabling cpu offloading, the calibration was successful.
Issue reproduce step:
A fix to address it:
Batch size adjustment
./calibrate_model.sh -m $model_path -d $data_path -o $model_output_path -b 16 -t 2 -l 4096
step-4-quantize-scales.py add cpu-offloading, line27
The text was updated successfully, but these errors were encountered: