Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors cause the instance to run indefinitely #29

Open
gabewillen opened this issue Dec 27, 2023 · 23 comments
Open

Errors cause the instance to run indefinitely #29

gabewillen opened this issue Dec 27, 2023 · 23 comments

Comments

@gabewillen
Copy link

gabewillen commented Dec 27, 2023

Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.

@alpayariyak
Copy link
Contributor

Are you still facing this issue currently?

@dannysemi
Copy link

I had this issue yesterday. Used up all of my credits overnight.

@bartlettD
Copy link

bartlettD commented Jan 29, 2024

I've seen this as well, but more from the perspective that if vllm runs into an error then the worker continues to retry the job over and over.

I can get this to happen if I do the following

  1. Try load a model with a larger context size that will fit in memory.
  2. Send a request.
  3. Container logs show vllm quits with an out of memory error.
  4. Vllm restarts the job, fails again.
  5. Repeat

@ashleykleynhans
Copy link

This is not a VLLM specific thing, this happens when my other workers get errors too, they just keep running over and over and spawning more and more workers until you scale your workers down to zero. This seems to be some kind of issue with the backend or the RunPod SDK.

@gabewillen
Copy link
Author

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

@dannysemi
Copy link

I'm going to try polling the health check for retries and cancel the job if I get more than one or two retries.

@alpayariyak
Copy link
Contributor

@bartlettD Could you provide an example model and GPU model please?

@preemware
Copy link

Same problem. Entire balance was wiped from

2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
raise EnvironmentError(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
resolved_config_file = cached_file(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config = AutoConfig.from_pretrained(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config
2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

@preemware
Copy link

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

Link? Because I've lost a lot of money from trying to use this one.

@alpayariyak
Copy link
Contributor

Same problem. Entire balance was wiped from


2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

@preemware
Copy link

Same problem. Entire balance was wiped from


2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

It should exit on exception. That isn't impossible to implement. This used to work perfectly for a long time when only using VLLM's generate. The code should be tested before being tagged as a release.

@alpayariyak
Copy link
Contributor

@anthonyllx
The issue is that Serverless will keep restarting the worker despite it breaking or raising an exception. The same would happen even when only using vLLM's generate, since you need to start the vLLM engine to use generate, which is where the exception occurs.

The latest commit fixes the error you're facing, thank you for reporting it.

@alpayariyak
Copy link
Contributor

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

@preemware
Copy link

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

Thank you. This would solve the problem.

@willsamu
Copy link
Contributor

willsamu commented Mar 3, 2024

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

@alpayariyak When will this be introduced? I cannot find a setting to configure it in the UI. I'm somewhat afraid to use serverless endpoints in prod scenarios until this is solved.

@avacaondata
Copy link

@gabewillen Could you please provide a link to the repo implementing the TGI custom worker?

@dpkirchner
Copy link

@alpayariyak Just checking to see if this feature is now available and if so how to enable it? Is it an environment variable?

@DireLines
Copy link

The cause for this is identified and we are implementing a fix for it which should be out by end of next week. For now, you should know that this error will always happen when the handler code exits before running runpod.serverless.start(handler), which in turn mostly happens because of some error in the initialization phase. For example, in the stack trace you posted @preemware the error happened during initialization of the vllm engine because of some missing config on the model.

The fix is for runpod's backend to monitor the handler process for completion and terminate the pod if that process completes either successfully or unsuccessfully.

@willsamu
Copy link
Contributor

willsamu commented May 8, 2024

@DireLines Thank you for the update. Is it implemented now? How does with work together with Flashboot enabled? For example, for me a Mixtral finetune ran just fine on an RTX 6000 for dozens of requests until suddenly during initialization with Flashboot, it threw an error to be out of memory (due to kv_cache filling up If i remember correctly).

Does that mean, we need to wrap the vllm initialization phase in a try-catch block and continue successfully, so that it will only fail once it reaches the handler?

@7flash
Copy link

7flash commented Jun 22, 2024

I also have this issue, balance wiped out

@dannysemi how did you implement health check?

@Permafacture
Copy link

@DireLines any update?

@DireLines
Copy link

It took longer than expected but logic flagging workers that fail during initialization as unhealthy is done, and will be activated in the next release for one of our repos. It's already deployed but only logging to us when it happens, so we can see that it behaves as expected before flipping the switch.

Once released, workers that are flagged in this way will be shown as "unhealthy" on the serverless UI, and automatically stopped and then removed from the endpoint. New ones will scale up to take their place, which means the money drain is slowed but not stopped. This is because a failure during initialization can happen because of a temporary outage for a dependency needed at import time as well, and we don't want a temporary outage to turn into a permanent one. In a later iteration, we will implement better retry logic so that the money drain will be stopped completely, and figure out some alerting/notification so you as the maintainer of an endpoint can know when failures of this type happen.

Thanks for your patience, this is definitely a bad behavior for serverless to exhibit and not at all an intended UX. I hope this prevents similar problems to what you've experienced in the future.

@DireLines
Copy link

This change is now released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests