Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc]: How to build on Ascend NPU #1

Open
1 task done
wangshuai09 opened this issue Sep 10, 2024 · 6 comments
Open
1 task done

[Misc]: How to build on Ascend NPU #1

wangshuai09 opened this issue Sep 10, 2024 · 6 comments

Comments

@wangshuai09
Copy link
Owner

wangshuai09 commented Sep 10, 2024

  1. use dev docker
    docker pull ascendai/pytorch:2.1.0-ubuntu22.04
  2. into container
    docker run -p 2022:22 --name test-vllm --device /dev/davinci0 --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc -v /usr/local/dcmi:/usr/local/dcmi -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info -itd ascendai/pytorch:2.1.0-ubuntu22.04 bash
  3. download vllm project
yum install git
pip uninstall torch_npu
git clone https://github.com/wangshuai09/vllm
cd vllm
git chekcout npu_support
  1. install vllm
    VLLM_TARGET_DEVICE=npu pip install -e .
  2. test model
    python examples/offline_inference_npu.py

Wechat Group

Scan the QR code via WeChat and join the vLLM-NPU 交流群.
description

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@beardog6
Copy link

目前只支持离线推理吗?openai服务接口运行报错:
INFO 09-12 10:08:00 selector.py:237] Cannot use _Backend.FLASH_ATTN backend on NPU.
INFO 09-12 10:08:00 selector.py:161] Using ASCEND_TORCH backend.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/envs/Python310/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in init
self.engine = AsyncLLMEngine.from_engine_args(
File "/vllm/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
engine = cls(
File "/vllm/vllm/engine/async_llm_engine.py", line 615, in init
self.engine = self._init_engine(*args, **kwargs)
File "/vllm/vllm/engine/async_llm_engine.py", line 835, in _init_engine
return engine_class(*args, **kwargs)
File "/vllm/vllm/engine/async_llm_engine.py", line 262, in init
super().init(*args, **kwargs)
File "/vllm/vllm/engine/llm_engine.py", line 324, in init
self.model_executor = executor_class(
File "/vllm/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/vllm/vllm/executor/gpu_executor.py", line 38, in _init_executor
self.driver_worker = self._create_worker()
File "/vllm/vllm/executor/gpu_executor.py", line 105, in _create_worker
return create_worker(**self._get_create_worker_kwargs(
File "/vllm/vllm/executor/gpu_executor.py", line 24, in create_worker
wrapper.init_worker(**kwargs)
File "/vllm/vllm/worker/worker_base.py", line 449, in init_worker
self.worker = worker_class(*args, **kwargs)
File "/vllm/vllm/worker/worker.py", line 99, in init
self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
File "/vllm/vllm/worker/model_runner.py", line 888, in init
self.attn_state = self.attn_backend.get_state_cls()(
File "/vllm/vllm/attention/backends/abstract.py", line 43, in get_state_cls
raise NotImplementedError
NotImplementedError
ERROR 09-12 10:08:02 api_server.py:188] RPCServer process died before responding to readiness probe

@wangshuai09
Copy link
Owner Author

@beardog6 当前还在开发阶段,这些特性还没调试,欢迎进行合作开发,开发分支为npu_support

@wangshuai09
Copy link
Owner Author

@beardog6 这个是你测试的场景吗

# start server
vllm serve facebook/opt-125m

# request
curl http://localhost:8000/v1/completions -H "Content-Type
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 20,
"temperature": 0
}'

# output
{"id":"cmpl-862bb9206aa84004a55c625b75e6dfea","object":"text_completion","created":1726649591,"model":"facebook/opt-125m","choices":[{"index":0,"text":" great place to live.  I've lived in San Francisco for a few years now and I've","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":25,"completion_tokens":20}}

@beardog6
Copy link

是的,启动参数有所不同 @wangshuai09

@wangshuai09
Copy link
Owner Author

是的,启动参数有所不同 @wangshuai09

我上面的测试通过了,你可以拉取最新的代码,看看你的参数可以跑通吗

@beardog6
Copy link

beardog6 commented Sep 20, 2024

是的,启动参数有所不同 @wangshuai09

我上面的测试通过了,你可以拉取最新的代码,看看你的参数可以跑通吗

Qwen2和Qwen2.5单卡测试通过
Deepseek-v2-lite-chat报错(应该是moe模型的forward_npu方法未实现):
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
hidden_states, residual = layer(positions, hidden_states,
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
hidden_states = self.mlp(hidden_states)
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
final_hidden_states = self.experts(
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/miniconda3/envs/Python310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 442, in forward
final_hidden_states = self.quant_method.apply(
File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 78, in apply
return self.forward(x=x,
File "/vllm/vllm/model_executor/custom_op.py", line 14, in forward
return self._forward_method(*args, **kwargs)
File "/vllm/vllm/model_executor/custom_op.py", line 58, in forward_npu
return self.forward_native(*args, **kwargs)
File "/vllm/vllm/model_executor/custom_op.py", line 23, in forward_native
raise NotImplementedError
NotImplementedError
ERROR 09-20 07:51:16 api_server.py:188] RPCServer process died before responding to readiness probe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants