[Docs] inference DeepSeek-V3 with LMDeploy #2960

haswelliris · 2024-12-26T09:02:11Z

📚 The doc issue

LMDeploy, a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows.

Installation

git clone -b support-dsv3 https://github.com/InternLM/lmdeploy.git
cd lmdeploy
pip install -e .

Offline Inference Pipeline

from lmdeploy import pipeline, PytorchEngineConfig

if __name__ == "__main__":
    pipe = pipeline("deepseek-ai/DeepSeek-V3-FP8", backend_config=PytorchEngineConfig(tp=8))
    messages_list = [
        [{"role": "user", "content": "Who are you?"}],
        [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V3 adopts innovative architectures to guarantee economical training and efficient inference."}],
        [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
    ]
    output = pipe(messages_list)
    print(output)

Online Serving

# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch

To access the service, you can utilize the official OpenAI Python package pip install openai. Below is an example demonstrating how to use the entrypoint v1/chat/completions

from openai import OpenAI
client = OpenAI(
    api_key='YOUR_API_KEY',
    base_url="http://0.0.0.0:23333/v1"
)
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
  model=model_name,
  messages=[
    {"role": "user", "content": "Write a piece of quicksort code in C++."}
  ],
    temperature=0.8,
    top_p=0.8
)
print(response)

For more information, please refer to the following link: https://github.com/InternLM/lmdeploy/tree/support-dsv3

Suggest a potential alternative/fix

No response

The text was updated successfully, but these errors were encountered:

DragonFive · 2024-12-26T12:19:10Z

What's your testing gpu, can this run on 8*A100-80GB machine?

lvhan028 · 2024-12-26T12:54:50Z

8*H200

shuson · 2024-12-27T03:22:19Z

I have 6 servers of DGX 8*H100, how to make it run in multiple machines

lvhan028 · 2024-12-27T08:27:46Z

Sorry, LMDeploy hasn't supported pipeline parallelism yet.

Tushar-ml · 2024-12-27T10:09:18Z

is fp8 supported now in LMDeploy? As above code snippet mentioned: deepseek-ai/DeepSeek-V3-FP8
@lvhan028

lvhan028 · 2024-12-27T12:30:15Z

PR #2967
It hasn't been merged to main yet.

QwertyJack · 2024-12-28T11:07:51Z

I wonder if we can run an AWQ quant version of that big model.

tracyCzf · 2024-12-29T07:31:22Z

with 8 * H200 processing a request, how many tokens can be generated per second

8*H200

bb33bb · 2024-12-31T13:41:40Z

H200

also wannt to know this
plz

shashank-sensehq · 2025-01-02T10:59:22Z

when trying for online deploy using below command-

# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch

I get this error
404 Client Error: Not Found for url: https://huggingface.co/api/models/deepseek-ai/DeepSeek-V3-FP8/revision/main

Looks like DeepSeek-V3-FP8 model doesn't exist in the HF hub (https://huggingface.co/api/models)

haizeiwanglf · 2025-01-08T03:05:02Z

RuntimeError: Can not found rewrite for auto_map: DeepseekV3ForCausalLM

QwertyJack · 2025-01-08T08:02:34Z

when trying for online deploy using below command-
# run
lmdeploy serve api_server deepseek-ai/DeepSeek-V3-FP8 --tp 8 --backend pytorch
I get this error 404 Client Error: Not Found for url: https://huggingface.co/api/models/deepseek-ai/DeepSeek-V3-FP8/revision/main

Looks like DeepSeek-V3-FP8 model doesn't exist in the HF hub (https://huggingface.co/api/models)

Use this: deepseek-ai/DeepSeek-V3

janelu9 · 2025-01-09T06:06:26Z

how to deploy it on multi nodes(A100)?

lvhan028 · 2025-01-09T07:21:53Z

It hasn't supported deploying DSV3 on multi nodes yet.

cbasavaraj · 2025-01-24T13:57:23Z

Hi, is it possible to run this on a single AWS ec2 instance with an NVIDIA A10G GPU having 24GB VRAM? Thanks

yekx1912 · 2025-01-25T14:34:16Z

how to deploy it on multi nodes(A100)?

you can use sglang

lvhan028 · 2025-01-27T09:11:14Z

Hi, is it possible to run this on a single AWS ec2 instance with an NVIDIA A10G GPU having 24GB VRAM? Thanks

No, it can't.

javag97 · 2025-01-28T00:16:11Z

Is it possible to run on consumer hardware? I have a computer with a AMD Radeon 7900 XTX, and another with an NVIDIA 4070 Ti Super. This is purely for educational purposes and want to attempt to run models locally

ZhangSJ515 · 2025-01-30T12:52:48Z

Hi, is it possible to run this on a single AWS ec2 instance with 7 NVIDIA L40S GPUs having 48*7 GB VRAM? Thanks!

AvisP · 2025-01-31T03:36:36Z

The model weights for this model is not available on huggingface deepseek-ai/DeepSeek-V3-FP8 where do I get the weights? and also FP8 for R1

QwertyJack · 2025-01-31T09:59:53Z

The model weights for this model is not available on huggingface deepseek-ai/DeepSeek-V3-FP8 where do I get the weights? and also FP8 for R1

Do you read the doc? https://hf-mirror.com/deepseek-ai/DeepSeek-V3

yinfan98 · 2025-01-31T13:46:13Z

Hi, everyone. I found some errors in the Offline Inference Pipeline. Here is a corrected version:

from lmdeploy import pipeline, PytorchEngineConfig
from lmdeploy.model import ChatTemplateConfig

if __name__ == "__main__":
    pipe = pipeline("deepseek-ai/DeepSeek-V3", 
                    backend_config=PytorchEngineConfig(tp=8), 
                    chat_template_config=ChatTemplateConfig(model_name='deepseek-r1'))
    messages_list = [
        [{"role": "user", 
          "content": "Translate the following content into Chinese directly: \
          DeepSeek-V3 adopts innovative architectures to guarantee economical training and efficient inference."}],
    ]
    output = pipe(messages_list)
    print(output[0].text)

Use 8*H20, peak GPU memory is 83GB.

yinfan98 · 2025-01-31T14:49:21Z

I also test the token/s for 8*H20, about 20token/s

AvisP · 2025-01-31T16:21:46Z

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?

Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

yinfan98 · 2025-01-31T16:55:19Z

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?

Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

Hi. Regarding your questions, let me clarify two points:

First, while I haven't delved deeply into DeepSeekV3's PytorchEngineConfig implementation, I can confirm that LMDeploy supports FP8 blockwise GEMM/GroupGEMM operations. Based on this, I believe the model is running with FP8 precision.

Second, I can confirm the peak memory usage of 83GB or higher. This was validated through testing on 8 H800 GPUs, where we encountered out-of-memory (OOM) errors. This aligns with our expectations: FP8 dtype requires approximately 1GB of memory per billion parameters, while BF16 requires 2GB per billion parameters. Given that DeepSeekV3 has 671B parameters, the memory requirements would be:

For FP8: ~671GB (671B × 1GB/B) ~ 83-84 * 8 = 672
For BF16: ~1,342GB (671B × 2GB/B)

AvisP · 2025-01-31T17:18:10Z

Thanks for the detailed clarification about the memory computation, it is clear now to me. And also for confirming that code runs the model at fp8. The PytorchEngineConfig is of lmdeploy but looks like if I leave dtpye at auto it automatically works at fp8.

AnyangAngus · 2025-02-04T02:44:39Z

@yinfan98 Thanks for updating the code for pipeline quickly. Just to confirm the code is running the model at fp16 as dtype has been set to auto in PytorchEngineConfig based on reference Is the model deepseek-ai/DeepSeek-V3 a fp16 model?
Also are you sure peak memory is 83 GB? Each H20 has 96GB memory so 8 of them would make it about 770 GB?

Hi. Regarding your questions, let me clarify two points:

First, while I haven't delved deeply into DeepSeekV3's PytorchEngineConfig implementation, I can confirm that LMDeploy supports FP8 blockwise GEMM/GroupGEMM operations. Based on this, I believe the model is running with FP8 precision.

Second, I can confirm the peak memory usage of 83GB or higher. This was validated through testing on 8 H800 GPUs, where we encountered out-of-memory (OOM) errors. This aligns with our expectations: FP8 dtype requires approximately 1GB of memory per billion parameters, while BF16 requires 2GB per billion parameters. Given that DeepSeekV3 has 671B parameters, the memory requirements would be:

For FP8: ~671GB (671B × 1GB/B) ~ 83-84 * 8 = 672 For BF16: ~1,342GB (671B × 2GB/B)

@yinfan98 Thank you for your message!
since I have only one 8 H800 GPU node, is there any solution to run deepseek v3 on single 8 H800 GPU? FP8 model need 672G and one 8 H800 GPU have 640G, the gap one gpu consumption is not significant, only 32G

AnyangAngus · 2025-02-04T06:50:29Z

Update
One 8*H800 OOM error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 2 has a total capacity of 79.14 GiB of which 348.81 MiB is free. Process 420123 has 78.80 GiB memory in use. Of the allocated memory 78.07 GiB is allocated by PyTorch, and 60.70 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-02-04 06:47:37,735 - lmdeploy - ERROR - model_agent.py:510 - TP process 3 failed with exitcode 1

AvisP · 2025-02-04T07:03:42Z

You can't fit this model in 8 H800, you will probably need 10 or 12 or more of them to make it work. Based on how the model is divided into GPU memory the full memory is not utilized and you will need extra memory.

You can try out the quantized models released by Unsloth meanwhile which will fit into 8 H800s.

groklab · 2025-02-04T20:07:16Z

It hasn't supported deploying DSV3 on multi nodes yet.

Hi @lvhan028 and Team - any timeline in terms of when/if we will be able to deploy DSV3 on multi-nodes?

Thank you.

snippetzero · 2025-02-07T07:11:43Z

Does Turbomind have any plans to support DeepSeek v3?

jifa513 · 2025-02-08T08:04:08Z

Does it support INT4 KV Cache in DeepSeek-V3 or DeepSeek-R1 ?

lvhan028 · 2025-02-08T09:44:16Z

Does Turbomind have any plans to support DeepSeek v3?

Yes. But it will be a long journey

lvhan028 · 2025-02-08T09:46:54Z

It hasn't supported deploying DSV3 on multi nodes yet.

Hi @lvhan028 and Team - any timeline in terms of when/if we will be able to deploy DSV3 on multi-nodes?

Thank you.

2025 Q2. @grimoire is working on refactoring pytorch engine dist module
Hope it can benn done by the end of April

fwensen · 2025-02-08T11:47:33Z

if i use 2 nodes per instance where per node with 8*H20, how to implement it?

lvhan028 pinned this issue Dec 26, 2024

obriensystems mentioned this issue Jan 25, 2025

DeepSeek R1 14b on NVIDIA 48G RTX-A6000 or Apple M4 Max 40 core 48G compared to OpenAI o1 pro ObrienlabsDev/machine-learning#37

Open

[Docs] inference DeepSeek-V3 with LMDeploy #2960

[Docs] inference DeepSeek-V3 with LMDeploy #2960

Comments

haswelliris commented Dec 26, 2024

📚 The doc issue

Installation

Offline Inference Pipeline

Online Serving

Suggest a potential alternative/fix

DragonFive commented Dec 26, 2024

lvhan028 commented Dec 26, 2024

shuson commented Dec 27, 2024

lvhan028 commented Dec 27, 2024

Tushar-ml commented Dec 27, 2024 • edited Loading

lvhan028 commented Dec 27, 2024

QwertyJack commented Dec 28, 2024

tracyCzf commented Dec 29, 2024

bb33bb commented Dec 31, 2024

shashank-sensehq commented Jan 2, 2025

haizeiwanglf commented Jan 8, 2025

QwertyJack commented Jan 8, 2025

janelu9 commented Jan 9, 2025

lvhan028 commented Jan 9, 2025

cbasavaraj commented Jan 24, 2025

yekx1912 commented Jan 25, 2025

lvhan028 commented Jan 27, 2025

javag97 commented Jan 28, 2025

ZhangSJ515 commented Jan 30, 2025

AvisP commented Jan 31, 2025

QwertyJack commented Jan 31, 2025

yinfan98 commented Jan 31, 2025

yinfan98 commented Jan 31, 2025

AvisP commented Jan 31, 2025 • edited Loading

yinfan98 commented Jan 31, 2025

AvisP commented Jan 31, 2025 • edited Loading

AnyangAngus commented Feb 4, 2025

AnyangAngus commented Feb 4, 2025

AvisP commented Feb 4, 2025

groklab commented Feb 4, 2025

snippetzero commented Feb 7, 2025

jifa513 commented Feb 8, 2025

lvhan028 commented Feb 8, 2025

lvhan028 commented Feb 8, 2025

fwensen commented Feb 8, 2025

Tushar-ml commented Dec 27, 2024 •

edited

Loading

AvisP commented Jan 31, 2025 •

edited

Loading

AvisP commented Jan 31, 2025 •

edited

Loading