You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to get vLLM running inside a docker container on Windows 11. I followed the docker Windows quickstart guide.
I was using the same configuration files for docker on Ubuntu before, which was working. On Windows on the other hand I'm not able to run any larger model than Qwen2.5-1.5B-Instruct. The 7b model should easily fit inside the VRAM, but I get an XPU out of memory error. It seems like it happens once it tried to allocate more than 1GB of VRAM at once. I saw some issue with a similar error message and behavior somewhere else, but it was never resolved.
I would guess it is related to WSL2 as it was working on Ubuntu.
Here is the detailed log:
intel-gpu-vllm | 2024-12-20 00:15:16,084 - INFO - intel_extension_for_pytorch auto imported
intel-gpu-vllm | WARNING 12-20 00:15:17 config.py:1656] Casting torch.bfloat16 to torch.float16.
intel-gpu-vllm | 2024-12-20 00:15:19,642 INFO worker.py:1821 -- Started a local Ray instance.
intel-gpu-vllm | INFO 12-20 00:15:20 llm_engine.py:226] Initializing an LLM engine (v0.6.2+ipexllm) with config: model='/llm/models/Qwen2.5-7B-Instruct', speculative_config=None, tokenizer='/llm/models/Qwen2.5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=xpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen2.5-7B-Instruct, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=True, mm_processor_kwargs=None)
intel-gpu-vllm | INFO 12-20 00:15:20 ray_gpu_executor.py:135] use_ray_spmd_worker: False
intel-gpu-vllm | (pid=425) /usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'libpng16.so.16: cannot open shared object file: No such file or directory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
intel-gpu-vllm | (pid=425) warn(
intel-gpu-vllm | (pid=425) 2024-12-20 00:15:23,011 - INFO - intel_extension_for_pytorch auto imported
intel-gpu-vllm | observability_config is ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False)
intel-gpu-vllm | INFO 12-20 00:15:24 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
intel-gpu-vllm | INFO 12-20 00:15:24 selector.py:138] Using IPEX attention backend.
intel-gpu-vllm | INFO 12-20 00:15:24 selector.py:193] Cannot use _Backend.FLASH_ATTN backend on XPU.
intel-gpu-vllm | INFO 12-20 00:15:24 selector.py:138] Using IPEX attention backend.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:10<00:31, 10.36s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:18<00:17, 8.94s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:26<00:08, 8.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:33<00:00, 8.17s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:33<00:00, 8.49s/it]
intel-gpu-vllm |
intel-gpu-vllm | 2024-12-20 00:15:58,504 - INFO - Converting the current model to sym_int4 format......
intel-gpu-vllm | 2024-12-20 00:15:58,505 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
intel-gpu-vllm | 2024-12-20 00:16:06,683 - INFO - Only HuggingFace Transformers models are currently supported for further optimizations
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] Error executing method load_model. This might cause deadlock in distributed execution.
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] Traceback (most recent call last):
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] return executor(*args, **kwargs)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 183, in load_model
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] self.model_runner.load_model()
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 110, in _ipex_llm_load_model
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] self.model = self.model.to(device=self.device_config.device,
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1160, in to
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] return self._apply(convert)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] module._apply(fn)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] module._apply(fn)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 833, in _apply
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] param_applied = fn(param)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] ^^^^^^^^^
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1158, in convert
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | ERROR 12-20 00:16:06 worker_base.py:464] RuntimeError: XPU out of memory. Tried to allocate 1.02 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
intel-gpu-vllm | Process SpawnProcess-33:
intel-gpu-vllm | Traceback (most recent call last):
intel-gpu-vllm | File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
intel-gpu-vllm | self.run()
intel-gpu-vllm | File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
intel-gpu-vllm | self._target(*self._args, **self._kwargs)
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 145, in run_mp_engine
intel-gpu-vllm | engine = IPEXLLMMQLLMEngine.from_engine_args(engine_args=engine_args,
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/engine/engine.py", line 133, in from_engine_args
intel-gpu-vllm | return super().from_engine_args(engine_args, usage_context, ipc_path)
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
intel-gpu-vllm | return cls(
intel-gpu-vllm | ^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
intel-gpu-vllm | self.engine = LLMEngine(*args,
intel-gpu-vllm | ^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/engine/llm_engine.py", line 325, in __init__
intel-gpu-vllm | self.model_executor = executor_class(
intel-gpu-vllm | ^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
intel-gpu-vllm | super().__init__(*args, **kwargs)
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/executor/xpu_executor.py", line 55, in __init__
intel-gpu-vllm | self._init_executor()
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 65, in _init_executor
intel-gpu-vllm | self._init_workers_ray(placement_group)
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 281, in _init_workers_ray
intel-gpu-vllm | self._run_workers("load_model",
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/executor/ray_gpu_executor.py", line 520, in _run_workers
intel-gpu-vllm | self.driver_worker.execute_method(method, *driver_args,
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 465, in execute_method
intel-gpu-vllm | raise e
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker_base.py", line 456, in execute_method
intel-gpu-vllm | return executor(*args, **kwargs)
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/vllm/worker/worker.py", line 183, in load_model
intel-gpu-vllm | self.model_runner.load_model()
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/vllm/xpu/model_convert.py", line 110, in _ipex_llm_load_model
intel-gpu-vllm | self.model = self.model.to(device=self.device_config.device,
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1160, in to
intel-gpu-vllm | return self._apply(convert)
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm | module._apply(fn)
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 810, in _apply
intel-gpu-vllm | module._apply(fn)
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 833, in _apply
intel-gpu-vllm | param_applied = fn(param)
intel-gpu-vllm | ^^^^^^^^^
intel-gpu-vllm | File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1158, in convert
intel-gpu-vllm | return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
intel-gpu-vllm | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
intel-gpu-vllm | RuntimeError: XPU out of memory. Tried to allocate 1.02 GiB (GPU 0; 15.56 GiB total capacity; 0 bytes already allocated; 0 bytes reserved in total by PyTorch)
This is my Dockerfile:
FROM intelanalytics/ipex-llm-serving-xpu:latest
COPY start-vllm-service-Qwen2.5-1.5B-Instruct.sh /llm/start-vllm-service-Qwen2.5-1.5B-Instruct.sh
COPY start-vllm-service-Qwen2.5-7B-Instruct.sh /llm/start-vllm-service-Qwen2.5-7B-Instruct.sh
COPY start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh /llm/start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh
COPY start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh /llm/start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh
# Set executable permissions for the script
RUN chmod +x /llm/start-vllm-service-Qwen2.5-1.5B-Instruct.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-7B-Instruct.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-7B-Instruct-AWQ.sh
RUN chmod +x /llm/start-vllm-service-Qwen2.5-32B-Instruct-AWQ.sh
WORKDIR /llm/
ENTRYPOINT ["./start-vllm-service-Qwen2.5-7B-Instruct.sh"]
We now only enable it on 2.1.0 version, And the performance will be very poor. We recommend you to switch the physical system to Ubuntu to run the latest image to run.
Hello everyone,
I'm trying to get vLLM running inside a docker container on Windows 11. I followed the docker Windows quickstart guide.
I was using the same configuration files for docker on Ubuntu before, which was working. On Windows on the other hand I'm not able to run any larger model than Qwen2.5-1.5B-Instruct. The 7b model should easily fit inside the VRAM, but I get an XPU out of memory error. It seems like it happens once it tried to allocate more than 1GB of VRAM at once. I saw some issue with a similar error message and behavior somewhere else, but it was never resolved.
I would guess it is related to WSL2 as it was working on Ubuntu.
Here is the detailed log:
This is my Dockerfile:
This is my bash script to start serving vLLM:
I'm thankful for any suggestions what I could try.
The text was updated successfully, but these errors were encountered: