Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] inference DeepSeek-V3 with LMDeploy #2960

Open
haswelliris opened this issue Dec 26, 2024 · 35 comments
Open

[Docs] inference DeepSeek-V3 with LMDeploy #2960

haswelliris opened this issue Dec 26, 2024 · 35 comments

Comments

@haswelliris
Copy link

📚 The doc issue

LMDeploy, a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows.

Installation

git clone -b support-dsv3 https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

Offline Inference Pipeline

from lmdeploy import pipeline, PytorchEngineConfig

if __name__ == "__main__":
    pipe = pipeline("deepseek-ai/DeepSeek-V3-FP8", backend_config=PytorchEngineConfig(tp=8))
    messages_list = [
        [{"role": "user", "content": "Who are you?"}],
        [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V3 adopts innovative architectures to guarantee economical training and efficient inference."}],
        [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
    ]
    output = pipe(messages_list)
    print(output)

Online Serving

# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch

To access the service, you can utilize the official OpenAI Python package pip install openai. Below is an example demonstrating how to use the entrypoint v1/chat/completions

from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "user", "content": "Write a piece of quicksort code in C++."}
  ],
    temperature=0.8,
    top_p=0.8
)
print(response)

For more information, please refer to the following link: https://github.com/InternLM/lmdeploy/tree/support-dsv3

Suggest a potential alternative/fix

No response

@lvhan028 lvhan028 pinned this issue Dec 26, 2024
@DragonFive
Copy link

What's your testing gpu, can this run on 8*A100-80GB machine?

@lvhan028
Copy link
Collaborator

8*H200

@shuson
Copy link

shuson commented Dec 27, 2024

I have 6 servers of DGX 8*H100, how to make it run in multiple machines

@lvhan028
Copy link
Collaborator

Sorry, LMDeploy hasn't supported pipeline parallelism yet.

@Tushar-ml
Copy link

Tushar-ml commented Dec 27, 2024

is fp8 supported now in LMDeploy? As above code snippet mentioned: deepseek-ai/DeepSeek-V3-FP8
@lvhan028

@lvhan028
Copy link
Collaborator

PR #2967
It hasn't been merged to main yet.

@QwertyJack
Copy link
Contributor

I wonder if we can run an AWQ quant version of that big model.

@tracyCzf
Copy link

with 8 * H200 processing a request, how many tokens can be generated per second

8*H200

@bb33bb
Copy link

bb33bb commented Dec 31, 2024

H200

also wannt to know this
plz

@shashank-sensehq
Copy link

when trying for online deploy using below command-

# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch

I get this error
404 Client Error: Not Found for url: https://huggingface.co/api/models/deepseek-ai/DeepSeek-V3-FP8/revision/main

Looks like DeepSeek-V3-FP8 model doesn't exist in the HF hub (https://huggingface.co/api/models)

@haizeiwanglf
Copy link

RuntimeError: Can not found rewrite for auto_map: DeepseekV3ForCausalLM

@QwertyJack
Copy link
Contributor

when trying for online deploy using below command-

# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch

I get this error 404 Client Error: Not Found for url: https://huggingface.co/api/models/deepseek-ai/DeepSeek-V3-FP8/revision/main

Looks like DeepSeek-V3-FP8 model doesn't exist in the HF hub (https://huggingface.co/api/models)

Use this: deepseek-ai/DeepSeek-V3

@janelu9
Copy link

janelu9 commented Jan 9, 2025

how to deploy it on multi nodes(A100)?

@lvhan028
Copy link
Collaborator

lvhan028 commented Jan 9, 2025

It hasn't supported deploying DSV3 on multi nodes yet.

@cbasavaraj
Copy link

Hi, is it possible to run this on a single AWS ec2 instance with an NVIDIA A10G GPU having 24GB VRAM? Thanks

@yekx1912
Copy link

how to deploy it on multi nodes(A100)?

you can use sglang

@lvhan028
Copy link
Collaborator

Hi, is it possible to run this on a single AWS ec2 instance with an NVIDIA A10G GPU having 24GB VRAM? Thanks

No, it can't.

@javag97
Copy link

javag97 commented Jan 28, 2025

Is it possible to run on consumer hardware? I have a computer with a AMD Radeon 7900 XTX, and another with an NVIDIA 4070 Ti Super. This is purely for educational purposes and want to attempt to run models locally

@ZhangSJ515
Copy link

Hi, is it possible to run this on a single AWS ec2 instance with 7 NVIDIA L40S GPUs having 48*7 GB VRAM? Thanks!

@AvisP
Copy link

AvisP commented Jan 31, 2025

The model weights for this model is not available on huggingface deepseek-ai/DeepSeek-V3-FP8 where do I get the weights? and also FP8 for R1

@QwertyJack
Copy link
Contributor

The model weights for this model is not available on huggingface deepseek-ai/DeepSeek-V3-FP8 where do I get the weights? and also FP8 for R1

Do you read the doc? https://hf-mirror.com/deepseek-ai/DeepSeek-V3

@yinfan98
Copy link
Contributor

Hi, everyone. I found some errors in the Offline Inference Pipeline. Here is a corrected version:

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.model import ChatTemplateConfig

if __name__ == "__main__":
    pipe = pipeline("deepseek-ai/DeepSeek-V3", 
                    backend_config=PytorchEngineConfig(tp=8), 
                    chat_template_config=ChatTemplateConfig(model_name='deepseek-r1'))
    messages_list = [
        [{"role": "user", 
          "content": "Translate the following content into Chinese directly: \
          DeepSeek-V3 adopts innovative architectures to guarantee economical training and efficient inference."}],
    ]
    output = pipe(messages_list)
    print(output[0].text)

Use 8*H20, peak GPU memory is 83GB.

@yinfan98
Copy link
Contributor

I also test the token/s for 8*H20, about 20token/s

@AvisP
Copy link

AvisP commented Jan 31, 2025

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?

Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

@yinfan98
Copy link
Contributor

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?

Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

Hi. Regarding your questions, let me clarify two points:

First, while I haven't delved deeply into DeepSeekV3's PytorchEngineConfig implementation, I can confirm that LMDeploy supports FP8 blockwise GEMM/GroupGEMM operations. Based on this, I believe the model is running with FP8 precision.

Second, I can confirm the peak memory usage of 83GB or higher. This was validated through testing on 8 H800 GPUs, where we encountered out-of-memory (OOM) errors. This aligns with our expectations: FP8 dtype requires approximately 1GB of memory per billion parameters, while BF16 requires 2GB per billion parameters. Given that DeepSeekV3 has 671B parameters, the memory requirements would be:

For FP8: ~671GB (671B × 1GB/B) ~ 83-84 * 8 = 672
For BF16: ~1,342GB (671B × 2GB/B)

@AvisP
Copy link

AvisP commented Jan 31, 2025

Thanks for the detailed clarification about the memory computation, it is clear now to me. And also for confirming that code runs the model at fp8. The PytorchEngineConfig is of lmdeploy but looks like if I leave dtpye at auto it automatically works at fp8.

@AnyangAngus
Copy link

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?
Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

Hi. Regarding your questions, let me clarify two points:

First, while I haven't delved deeply into DeepSeekV3's PytorchEngineConfig implementation, I can confirm that LMDeploy supports FP8 blockwise GEMM/GroupGEMM operations. Based on this, I believe the model is running with FP8 precision.

Second, I can confirm the peak memory usage of 83GB or higher. This was validated through testing on 8 H800 GPUs, where we encountered out-of-memory (OOM) errors. This aligns with our expectations: FP8 dtype requires approximately 1GB of memory per billion parameters, while BF16 requires 2GB per billion parameters. Given that DeepSeekV3 has 671B parameters, the memory requirements would be:

For FP8: ~671GB (671B × 1GB/B) ~ 83-84 * 8 = 672 For BF16: ~1,342GB (671B × 2GB/B)

@yinfan98 Thank you for your message!
since I have only one 8 H800 GPU node, is there any solution to run deepseek v3 on single 8 H800 GPU? FP8 model need 672G and one 8 H800 GPU have 640G, the gap one gpu consumption is not significant, only 32G

@AnyangAngus
Copy link

Update
One 8*H800 OOM error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 2 has a total capacity of 79.14 GiB of which 348.81 MiB is free. Process 420123 has 78.80 GiB memory in use. Of the allocated memory 78.07 GiB is allocated by PyTorch, and 60.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-02-04 06:47:37,735 - lmdeploy - ERROR - model_agent.py:510 - TP process 3 failed with exitcode 1

@AvisP
Copy link

AvisP commented Feb 4, 2025

You can't fit this model in 8 H800, you will probably need 10 or 12 or more of them to make it work. Based on how the model is divided into GPU memory the full memory is not utilized and you will need extra memory.

You can try out the quantized models released by Unsloth meanwhile which will fit into 8 H800s.

@groklab
Copy link

groklab commented Feb 4, 2025

It hasn't supported deploying DSV3 on multi nodes yet.

Hi @lvhan028 and Team - any timeline in terms of when/if we will be able to deploy DSV3 on multi-nodes?

Thank you.

@snippetzero
Copy link

Does Turbomind have any plans to support DeepSeek v3?

@jifa513
Copy link

jifa513 commented Feb 8, 2025

Does it support INT4 KV Cache in DeepSeek-V3 or DeepSeek-R1 ?

@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 8, 2025

Does Turbomind have any plans to support DeepSeek v3?

Yes. But it will be a long journey

@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 8, 2025

It hasn't supported deploying DSV3 on multi nodes yet.

Hi @lvhan028 and Team - any timeline in terms of when/if we will be able to deploy DSV3 on multi-nodes?

Thank you.

2025 Q2. @grimoire is working on refactoring pytorch engine dist module
Hope it can benn done by the end of April

@fwensen
Copy link

fwensen commented Feb 8, 2025

if i use 2 nodes per instance where per node with 8*H20, how to implement it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests