Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: XPU out of memory on WSL2 vLLM running Qwen2.5-7B-Instruct, sym_int4, Arc A770 #12584

Open
nkt-dk opened this issue Dec 19, 2024 · 1 comment
Assignees

Comments

@nkt-dk
Copy link

nkt-dk commented Dec 19, 2024

Hello everyone,

I'm trying to get vLLM running inside a docker container on Windows 11. I followed the docker Windows quickstart guide.

I was using the same configuration files for docker on Ubuntu before, which was working. On Windows on the other hand I'm not able to run any larger model than Qwen2.5-1.5B-Instruct. The 7b model should easily fit inside the VRAM, but I get an XPU out of memory error. It seems like it happens once it tried to allocate more than 1GB of VRAM at once. I saw some issue with a similar error message and behavior somewhere else, but it was never resolved.

I would guess it is related to WSL2 as it was working on Ubuntu.

Here is the detailed log:

intel-gpu-vllm  | 2024-12-20 00:15:16,084 - INFO - intel_extension_for_pytorch auto imported
intel-gpu-vllm  | WARNING 12-20 00:15:17 config.py:1656] Casting torch.bfloat16 to torch.float16.
intel-gpu-vllm  | 2024-12-20 00:15:19,642       INFO worker.py:1821 -- Started a local Ray instance.
intel-gpu-vllm  | INFO 12-20 00:15:20 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/llm/models/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
intel-gpu-vllm  | INFO 12-20 00:15:20 ray_gpu_executor.py:135] use_ray_spmd_worker: False
intel-gpu-vllm  | (pid=425) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
intel-gpu-vllm  | (pid=425)   warn(
intel-gpu-vllm  | (pid=425) 2024-12-20 00:15:23,011 - INFO - intel_extension_for_pytorch auto imported
intel-gpu-vllm  | observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
intel-gpu-vllm  | INFO 12-20 00:15:24 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
intel-gpu-vllm  | INFO 12-20 00:15:24 selector.py:138] Using IPEX attention backend.
intel-gpu-vllm  | INFO 12-20 00:15:24 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
intel-gpu-vllm  | INFO 12-20 00:15:24 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:10<00:31, 10.36s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:18<00:17,  8.94s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:26<00:08,  8.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:33<00:00,  8.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:33<00:00,  8.49s/it]
intel-gpu-vllm  |
intel-gpu-vllm  | 2024-12-20 00:15:58,504 - INFO - Converting the current model to sym_int4 format......
intel-gpu-vllm  | 2024-12-20 00:15:58,505 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
intel-gpu-vllm  | 2024-12-20 00:16:06,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464] Traceback (most recent call last):
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     return executor(*args, **kwargs)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 183, in load_model
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     self.model_runner.load_model()
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 110, in _ipex_llm_load_model
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     self.model = self.model.to(device=self.device_config.device,
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1160, in to
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     return self._apply(convert)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     module._apply(fn)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     module._apply(fn)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 833, in _apply
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     param_applied = fn(param)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]                     ^^^^^^^^^
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1158, in convert
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  | ERROR 12-20 00:16:06 worker_base.py:464] RuntimeError: XPU out of memory. Tried to allocate 1.02 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
intel-gpu-vllm  | Process SpawnProcess-33:
intel-gpu-vllm  | Traceback (most recent call last):
intel-gpu-vllm  |   File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
intel-gpu-vllm  |     self.run()
intel-gpu-vllm  |   File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
intel-gpu-vllm  |     self._target(*self._args, **self._kwargs)
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 145, in run_mp_engine
intel-gpu-vllm  |     engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
intel-gpu-vllm  |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 133, in from_engine_args
intel-gpu-vllm  |     return super().from_engine_args(engine_args, usage_context, ipc_path)
intel-gpu-vllm  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
intel-gpu-vllm  |     return cls(
intel-gpu-vllm  |            ^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
intel-gpu-vllm  |     self.engine = LLMEngine(*args,
intel-gpu-vllm  |                   ^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__
intel-gpu-vllm  |     self.model_executor = executor_class(
intel-gpu-vllm  |                           ^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
intel-gpu-vllm  |     super().__init__(*args, **kwargs)
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/xpu_executor.py", line 55, in __init__
intel-gpu-vllm  |     self._init_executor()
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 65, in _init_executor
intel-gpu-vllm  |     self._init_workers_ray(placement_group)
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 281, in _init_workers_ray
intel-gpu-vllm  |     self._run_workers("load_model",
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 520, in _run_workers
intel-gpu-vllm  |     self.driver_worker.execute_method(method, *driver_args,
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 465, in execute_method
intel-gpu-vllm  |     raise e
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
intel-gpu-vllm  |     return executor(*args, **kwargs)
intel-gpu-vllm  |            ^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 183, in load_model
intel-gpu-vllm  |     self.model_runner.load_model()
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 110, in _ipex_llm_load_model
intel-gpu-vllm  |     self.model = self.model.to(device=self.device_config.device,
intel-gpu-vllm  |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1160, in to
intel-gpu-vllm  |     return self._apply(convert)
intel-gpu-vllm  |            ^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm  |     module._apply(fn)
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm  |     module._apply(fn)
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 833, in _apply
intel-gpu-vllm  |     param_applied = fn(param)
intel-gpu-vllm  |                     ^^^^^^^^^
intel-gpu-vllm  |   File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1158, in convert
intel-gpu-vllm  |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
intel-gpu-vllm  |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm  | RuntimeError: XPU out of memory. Tried to allocate 1.02 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)

This is my Dockerfile:

FROM intelanalytics/ipex-llm-serving-xpu:latest

COPY start-vllm-service-Qwen2.5-1.5B-Instruct.sh /llm/start-vllm-service-Qwen2.5-1.5B-Instruct.sh
COPY start-vllm-service-Qwen2.5-7B-Instruct.sh /llm/start-vllm-service-Qwen2.5-7B-Instruct.sh
COPY start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh /llm/start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh
COPY start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh /llm/start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh

# Set executable permissions for the script
RUN chmod +x /llm/start-vllm-service-Qwen2.5-1.5B-Instruct.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-7B-Instruct.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh

WORKDIR /llm/

ENTRYPOINT ["./start-vllm-service-Qwen2.5-7B-Instruct.sh"]

This is my bash script to start serving vLLM:

#!/bin/bash
model="/llm/models/Qwen2.5-7B-Instruct"
served_model_name="Qwen2.5-7B-Instruct"

export ZES_ENABLE_SYSMAN=1

export SYCL_CACHE_PERSISTENT=1

export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export ONEAPI_DEVICE_SELECTOR=level_zero:0

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $served_model_name \
  --port 8000 \
  --model $model \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --device xpu \
  --dtype float16 \
  --load-in-low-bit sym_int4 \
  --max-model-len 1024 \
  --max-num-batched-tokens 1024 \
  --max-num-seqs 4 \
  --tensor-parallel-size 1 \
  --enforce-eager \
  --disable-async-output-proc \
  --distributed-executor-backend ray

I'm thankful for any suggestions what I could try.

@hzjane
Copy link
Contributor

hzjane commented Dec 23, 2024

We now only enable it on 2.1.0 version, And the performance will be very poor. We recommend you to switch the physical system to Ubuntu to run the latest image to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants