Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] Getting GPU memory usage by a worker process correctly. #2807

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

sh1ng
Copy link
Contributor

@sh1ng sh1ng commented Feb 7, 2024

Currently, it's not possible to properly share GPU resources by multiple running vllm instances.

If I share GPU memory 50/50 the second process will fail.

python3 -u -m vllm.entrypoints.openai.api_server        --host 0.0.0.0        --model h2oai/h2ogpt-4096-llama2-7b-chat  --tensor-parallel-size 2 --enforce-eager --gpu-memory-utilization 0.5
python3 -u -m vllm.entrypoints.openai.api_server        --host 0.0.0.0 --port 8888       --model h2oai/h2ogpt-4096-llama2-7b-chat  --tensor-parallel-size 2 --enforce-eager --gpu-memory-utilization 0.5
INFO 02-07 11:38:58 api_server.py:209] args: Namespace(host='0.0.0.0', port=8888, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.5, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
2024-02-07 11:39:00,440	INFO worker.py:1724 -- Started a local Ray instance.
INFO 02-07 11:39:01 llm_engine.py:72] Initializing an LLM engine with config: model='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer='h2oai/h2ogpt-4096-llama2-7b-chat', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, seed=0)
WARNING 02-07 11:39:05 custom_all_reduce.py:44] Custom allreduce is disabled because your platform lacks GPU P2P capability. To slience this warning, specifydisable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=88356) WARNING 02-07 11:39:05 custom_all_reduce.py:44] Custom allreduce is disabled because your platform lacks GPU P2P capability. To slience this warning, specifydisable_custom_all_reduce=True explicitly.
(RayWorkerVllm pid=88356) INFO 02-07 11:39:06 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-07 11:39:06 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-07 11:39:09 llm_engine.py:322] # GPU blocks: 0, # CPU blocks: 1024
Traceback (most recent call last):
  File "/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/sh1ng/miniconda3/envs/vllm-py310/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/sh1ng/dev/vllm/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/home/sh1ng/dev/vllm/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/home/sh1ng/dev/vllm/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/home/sh1ng/dev/vllm/vllm/engine/llm_engine.py", line 326, in _init_cache
    raise ValueError("No available memory for the cache blocks. "
ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

I'm properly getting memory consumption for every worker process.

@sh1ng
Copy link
Contributor Author

sh1ng commented Feb 7, 2024

FYI @pseudotensor

@sh1ng sh1ng changed the title [Fix] Getting memory usage by a worker process correctly. [Fix] Getting GPU memory usage by a worker process correctly. Feb 7, 2024
@sh1ng
Copy link
Contributor Author

sh1ng commented Feb 8, 2024

It's related to gpuopenanalytics/pynvml#36

So if vllm running inside docker we can't rely on pid.

@pseudotensor
Copy link

Thanks!

@sh1ng
Copy link
Contributor Author

sh1ng commented Feb 8, 2024

@pseudotensor we will have to use --pid host for docker, a limitation of nvml.

@sh1ng
Copy link
Contributor Author

sh1ng commented Feb 28, 2024

I believe that this PR is still valuable even after #2863 as it measures memory usage by worker's process and is not affected by race when GPU memory gets allocated and freed by anyone.

cc @WoosukKwon @simon-mo

Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Copy link

mergify bot commented Oct 30, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @sh1ng please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants