[Fix] Getting GPU memory usage by a worker process correctly. #2807

sh1ng · 2024-02-07T19:53:00Z

Currently, it's not possible to properly share GPU resources by multiple running vllm instances.

If I share GPU memory 50/50 the second process will fail.

python3 -u -m vllm.entrypoints.openai.api_server        --host 0.0.0.0        --model h2oai/h2ogpt-4096-llama2-7b-chat  --tensor-parallel-size 2 --enforce-eager --gpu-memory-utilization 0.5

python3 -u -m vllm.entrypoints.openai.api_server        --host 0.0.0.0 --port 8888       --model h2oai/h2ogpt-4096-llama2-7b-chat  --tensor-parallel-size 2 --enforce-eager --gpu-memory-utilization 0.5
INFO 02-07 11:38:58 api_server.py:209] args: Namespace(host='0.0.0.0', port=8888, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.5, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-02-07 11:39:00,440	INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-07 11:39:01 llm_engine.py:72] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, seed=0)
WARNING 02-07 11:39:05 custom_all_reduce.py:44] Custom allreduce is disabled because your platform lacks GPU P2P capability. To slience this warning, specifydisable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=88356) WARNING 02-07 11:39:05 custom_all_reduce.py:44] Custom allreduce is disabled because your platform lacks GPU P2P capability. To slience this warning, specifydisable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=88356) INFO 02-07 11:39:06 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-07 11:39:06 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-07 11:39:09 llm_engine.py:322] # GPU blocks: 0, # CPU blocks: 1024
Traceback (most recent call last):
  File "/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sh1ng/dev/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sh1ng/dev/vllm/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/home/sh1ng/dev/vllm/vllm/engine/llm_engine.py", line 326, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

I'm properly getting memory consumption for every worker process.

sh1ng · 2024-02-07T19:55:42Z

FYI @pseudotensor

sh1ng · 2024-02-08T15:33:53Z

It's related to gpuopenanalytics/pynvml#36

So if vllm running inside docker we can't rely on pid.

pseudotensor · 2024-02-08T15:49:15Z

Thanks!

sh1ng · 2024-02-08T16:19:06Z

@pseudotensor we will have to use --pid host for docker, a limitation of nvml.

sh1ng · 2024-02-28T13:42:01Z

I believe that this PR is still valuable even after #2863 as it measures memory usage by worker's process and is not affected by race when GPU memory gets allocated and freed by anyone.

cc @WoosukKwon @simon-mo

github-actions · 2024-10-30T02:02:37Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-10-30T02:03:19Z

This pull request has merge conflicts that must be resolved before it can be
merged. @sh1ng please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sh1ng changed the title ~~[Fix] Getting memory usage by a worker process correctly.~~ [Fix] Getting GPU memory usage by a worker process correctly. Feb 7, 2024

sh1ng force-pushed the fix-gpu-memory-info branch from 76f73bd to ce4b03a Compare February 8, 2024 14:53

sh1ng added 2 commits February 28, 2024 05:25

get memory usage by a worker process

b72d435

fix when using a docker container

11e3db5

sh1ng force-pushed the fix-gpu-memory-info branch from 1dcecb2 to 11e3db5 Compare February 28, 2024 13:35

sh1ng added 2 commits February 28, 2024 06:17

add logic from vllm-project#2863

44744b7

format

b923322

github-actions bot added the stale label Oct 30, 2024

mergify bot added the needs-rebase label Oct 30, 2024

github-actions bot added unstale and removed stale labels Nov 2, 2024

simon-mo requested review from zhuohan123, youkaichao, alexm-neuralmagic, comaniac and njhill as code owners November 26, 2024 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Getting GPU memory usage by a worker process correctly. #2807

[Fix] Getting GPU memory usage by a worker process correctly. #2807

sh1ng commented Feb 7, 2024 •

edited

Loading

sh1ng commented Feb 7, 2024

sh1ng commented Feb 8, 2024

pseudotensor commented Feb 8, 2024

sh1ng commented Feb 8, 2024

sh1ng commented Feb 28, 2024

github-actions bot commented Oct 30, 2024

mergify bot commented Oct 30, 2024

[Fix] Getting GPU memory usage by a worker process correctly. #2807

Are you sure you want to change the base?

[Fix] Getting GPU memory usage by a worker process correctly. #2807

Conversation

sh1ng commented Feb 7, 2024 • edited Loading

sh1ng commented Feb 7, 2024

sh1ng commented Feb 8, 2024

pseudotensor commented Feb 8, 2024

sh1ng commented Feb 8, 2024

sh1ng commented Feb 28, 2024

github-actions bot commented Oct 30, 2024

mergify bot commented Oct 30, 2024

sh1ng commented Feb 7, 2024 •

edited

Loading