-
Notifications
You must be signed in to change notification settings - Fork 484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs] inference DeepSeek-V3 with LMDeploy #2960
Comments
What's your testing gpu, can this run on 8*A100-80GB machine? |
8*H200 |
I have 6 servers of DGX 8*H100, how to make it run in multiple machines |
Sorry, LMDeploy hasn't supported pipeline parallelism yet. |
is fp8 supported now in LMDeploy? As above code snippet mentioned: deepseek-ai/DeepSeek-V3-FP8 |
PR #2967 |
I wonder if we can run an AWQ quant version of that big model. |
with 8 * H200 processing a request, how many tokens can be generated per second
|
also wannt to know this |
when trying for online deploy using below command-
I get this error Looks like DeepSeek-V3-FP8 model doesn't exist in the HF hub (https://huggingface.co/api/models) |
RuntimeError: Can not found rewrite for auto_map: DeepseekV3ForCausalLM |
Use this: deepseek-ai/DeepSeek-V3 |
how to deploy it on multi nodes(A100)? |
It hasn't supported deploying DSV3 on multi nodes yet. |
Hi, is it possible to run this on a single AWS ec2 instance with an |
you can use sglang |
No, it can't. |
Is it possible to run on consumer hardware? I have a computer with a AMD Radeon 7900 XTX, and another with an NVIDIA 4070 Ti Super. This is purely for educational purposes and want to attempt to run models locally |
Hi, is it possible to run this on a single AWS ec2 instance with 7 NVIDIA L40S GPUs having 48*7 GB VRAM? Thanks! |
The model weights for this model is not available on huggingface |
Do you read the doc? https://hf-mirror.com/deepseek-ai/DeepSeek-V3 |
Hi, everyone. I found some errors in the Offline Inference Pipeline. Here is a corrected version: from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.model import ChatTemplateConfig
if __name__ == "__main__":
pipe = pipeline("deepseek-ai/DeepSeek-V3",
backend_config=PytorchEngineConfig(tp=8),
chat_template_config=ChatTemplateConfig(model_name='deepseek-r1'))
messages_list = [
[{"role": "user",
"content": "Translate the following content into Chinese directly: \
DeepSeek-V3 adopts innovative architectures to guarantee economical training and efficient inference."}],
]
output = pipe(messages_list)
print(output[0].text) Use 8*H20, peak GPU memory is 83GB. |
I also test the token/s for 8*H20, about 20token/s |
@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB? |
Hi. Regarding your questions, let me clarify two points: First, while I haven't delved deeply into DeepSeekV3's PytorchEngineConfig implementation, I can confirm that LMDeploy supports FP8 blockwise GEMM/GroupGEMM operations. Based on this, I believe the model is running with FP8 precision. Second, I can confirm the peak memory usage of 83GB or higher. This was validated through testing on 8 H800 GPUs, where we encountered out-of-memory (OOM) errors. This aligns with our expectations: FP8 dtype requires approximately 1GB of memory per billion parameters, while BF16 requires 2GB per billion parameters. Given that DeepSeekV3 has 671B parameters, the memory requirements would be: For FP8: ~671GB (671B × 1GB/B) ~ 83-84 * 8 = 672 |
Thanks for the detailed clarification about the memory computation, it is clear now to me. And also for confirming that code runs the model at fp8. The |
@yinfan98 Thank you for your message! |
Update torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 2 has a total capacity of 79.14 GiB of which 348.81 MiB is free. Process 420123 has 78.80 GiB memory in use. Of the allocated memory 78.07 GiB is allocated by PyTorch, and 60.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) |
You can't fit this model in 8 H800, you will probably need 10 or 12 or more of them to make it work. Based on how the model is divided into GPU memory the full memory is not utilized and you will need extra memory. You can try out the quantized models released by Unsloth meanwhile which will fit into 8 H800s. |
Hi @lvhan028 and Team - any timeline in terms of when/if we will be able to deploy DSV3 on multi-nodes? Thank you. |
Does Turbomind have any plans to support DeepSeek v3? |
Does it support INT4 KV Cache in DeepSeek-V3 or DeepSeek-R1 ? |
Yes. But it will be a long journey |
if i use 2 nodes per instance where per node with 8*H20, how to implement it? |
📚 The doc issue
LMDeploy, a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows.
Installation
Offline Inference Pipeline
Online Serving
# run lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch
To access the service, you can utilize the official OpenAI Python package
pip install openai
. Below is an example demonstrating how to use the entrypointv1/chat/completions
For more information, please refer to the following link: https://github.com/InternLM/lmdeploy/tree/support-dsv3
Suggest a potential alternative/fix
No response
The text was updated successfully, but these errors were encountered: