-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
「Question」Support 24GB 4090 inferences with multiple nodes #205
Comments
We didn't try running it on 32 RTX 4090 GPUs. I think you might try reducing |
I try it in 8*80G(H800), it's also OOM.
|
Change it do not help, still oom. |
The model has 61 layers which is not divisible by 4, try specifying VLLM_PP_LAYER_PARTITION="16,15,15,15". |
Thanks for the reply, but still get oom... |
the OOM happens during sampling, which is proportional to
Your GPU has 24 GiB memory, and the model weight already took 20.5 GiB. This is quite stressful. Suggestion: Try with It is also possible to reduce the model weight memory by using I'm preparing a blogpost to explain the memory footprint, please stay tuned. |
Is your feature request related to a problem? Please describe.
Currently, the only consumer-grade GPU that supports FP8 is the RTX 4090. I am attempting to run DeepSeek V3 across 4 nodes, each with 8 GPUs, but even with a very small context size (128), I encounter an “Out of Memory” error.
I want to confirm whether this issue is due to my configuration or if a model of this scale simply cannot run even with 32 RTX 4090 GPUs.
Here is my vLLM script, and I am using the latest version (0.6.6):
Here are some of the outputs:
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: