-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Meta-Llama-3-3-70B-Instruct Outputs "!!!!" With Context Length above 10k #738
Comments
@ppatel-eng thank you for submitting the issue. For now Llama 3.3 is not fully validated by the team. Any feedback is valuable but we need some time to put the model on the official list of supported models. wget https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py For security purposes, please feel free to check the contents of collect_env.py before running it. |
Understood, thanks! The results from collect_env.py is below:
|
Hi @ppatel-eng, thank you for the update, it seems that vllm-fork for Gaudi is not installed. Please try the steps and run test once again: $ git clone https://github.com/HabanaAI/vllm-fork.git and see if that helps? |
Your current environment
Environment Details
Running in a Kubernetes environment with Habana Gaudi2 accelerators:
Hardware: Habana Gaudi2 accelerators
Deployment: Kubernetes cluster
Node Resources:
Gaudi Habana Version: 1.18
vLLM Version: 0.6.2+geb0d42fc
Python Version: 3.10
How would you like to use vllm
I would like to serve the Meta-Llama-3-3-70B-Instruct model.
Current Configuration
Meta-Llama-3-3-70B-Instruct:
arguments:
- --gpu-memory-utilization 0.90
- --max-logprobs 5
- --enable-auto-tool-choice
- --tool-call-parser llama3_json
- --download-dir /data
- --tensor-parallel-size 4
- --chat-template /data/chat_templates/tool_chat_template_llama31_json.jinja
gpuLimit: 1
numGPU: 4
Model Input Dumps
No response
🐛 Describe the bug
When we provide a context over 10k tokens but sometimes with as little as 3k tokens, we get a response where the model starts outputting exclamation points instead. We tested this same script with the same model on Nvidia A100s and did not see this issue testing up to 60k tokens despite serving the model with the same exact vLLM settings (vllm version 0.6.2).
Example Response:
The text was updated successfully, but these errors were encountered: