Errors cause the instance to run indefinitely #29

gabewillen · 2023-12-27T19:34:56Z

Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.

alpayariyak · 2024-01-19T17:42:54Z

Are you still facing this issue currently?

dannysemi · 2024-01-29T13:40:56Z

I had this issue yesterday. Used up all of my credits overnight.

bartlettD · 2024-01-29T14:53:56Z

I've seen this as well, but more from the perspective that if vllm runs into an error then the worker continues to retry the job over and over.

I can get this to happen if I do the following

Try load a model with a larger context size that will fit in memory.
Send a request.
Container logs show vllm quits with an out of memory error.
Vllm restarts the job, fails again.
Repeat

ashleykleynhans · 2024-01-29T14:55:47Z

This is not a VLLM specific thing, this happens when my other workers get errors too, they just keep running over and over and spawning more and more workers until you scale your workers down to zero. This seems to be some kind of issue with the backend or the RunPod SDK.

gabewillen · 2024-01-29T14:57:15Z

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

dannysemi · 2024-01-29T15:33:25Z

I'm going to try polling the health check for retries and cancel the job if I get more than one or two retries.

alpayariyak · 2024-01-31T00:18:45Z

@bartlettD Could you provide an example model and GPU model please?

preemware · 2024-02-10T03:19:10Z

Same problem. Entire balance was wiped from

2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
raise EnvironmentError(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
resolved_config_file = cached_file(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
config = AutoConfig.from_pretrained(
2024-02-09 21:00:26.435
[5hhq44ockiqu67]
[info]
File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config
2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

preemware · 2024-02-10T03:26:34Z

This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue.

Link? Because I've lost a lot of money from trying to use this one.

alpayariyak · 2024-02-10T03:43:05Z

Same problem. Entire balance was wiped from


2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434

using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .

Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

preemware · 2024-02-10T03:53:26Z

Same problem. Entire balance was wiped from
2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

OSError: /models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb does not appear to have a file named config.json. Checkout 'https://huggingface.co//models/huggingface-cache/hub/models--anthonylx--Proximus-2x7B-v1/snapshots/43bf1965176b15634df97107863d4e3972eecebb/None' for available files.

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

raise EnvironmentError(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py", line 356, in cached_file

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

resolved_config_file = cached_file(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 689, in _get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 634, in get_config_dict

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1100, in from_pretrained

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

config = AutoConfig.from_pretrained(

2024-02-09 21:00:26.435

[5hhq44ockiqu67]

[info]

File "/vllm-installation/vllm/transformers_utils/config.py", line 23, in get_config

2024-02-09 21:00:26.434
using the build command docker build -t anthony001/proximus-worker:1.0.0 --build-arg MODEL_NAME="anthonylx/Proximus-2x7B-v1" --build-arg BASE_PATH="/models" .
Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution.

It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker.

It should exit on exception. That isn't impossible to implement. This used to work perfectly for a long time when only using VLLM's generate. The code should be tested before being tagged as a release.

alpayariyak · 2024-02-10T04:02:52Z

@anthonyllx
The issue is that Serverless will keep restarting the worker despite it breaking or raising an exception. The same would happen even when only using vLLM's generate, since you need to start the vLLM engine to use generate, which is where the exception occurs.

The latest commit fixes the error you're facing, thank you for reporting it.

alpayariyak · 2024-02-10T04:08:38Z

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

preemware · 2024-02-10T04:10:21Z

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

Thank you. This would solve the problem.

willsamu · 2024-03-03T16:54:26Z

We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue.

@alpayariyak When will this be introduced? I cannot find a setting to configure it in the UI. I'm somewhat afraid to use serverless endpoints in prod scenarios until this is solved.

avacaondata · 2024-03-25T15:51:40Z

@gabewillen Could you please provide a link to the repo implementing the TGI custom worker?

dpkirchner · 2024-04-17T21:49:39Z

@alpayariyak Just checking to see if this feature is now available and if so how to enable it? Is it an environment variable?

DireLines · 2024-04-19T22:28:55Z

The cause for this is identified and we are implementing a fix for it which should be out by end of next week. For now, you should know that this error will always happen when the handler code exits before running runpod.serverless.start(handler), which in turn mostly happens because of some error in the initialization phase. For example, in the stack trace you posted @preemware the error happened during initialization of the vllm engine because of some missing config on the model.

The fix is for runpod's backend to monitor the handler process for completion and terminate the pod if that process completes either successfully or unsuccessfully.

willsamu · 2024-05-08T08:49:30Z

@DireLines Thank you for the update. Is it implemented now? How does with work together with Flashboot enabled? For example, for me a Mixtral finetune ran just fine on an RTX 6000 for dozens of requests until suddenly during initialization with Flashboot, it threw an error to be out of memory (due to kv_cache filling up If i remember correctly).

Does that mean, we need to wrap the vllm initialization phase in a try-catch block and continue successfully, so that it will only fail once it reaches the handler?

7flash · 2024-06-22T09:01:10Z

I also have this issue, balance wiped out

@dannysemi how did you implement health check?

Permafacture · 2024-06-24T22:54:42Z

@DireLines any update?

DireLines · 2024-06-24T23:53:49Z

It took longer than expected but logic flagging workers that fail during initialization as unhealthy is done, and will be activated in the next release for one of our repos. It's already deployed but only logging to us when it happens, so we can see that it behaves as expected before flipping the switch.

Once released, workers that are flagged in this way will be shown as "unhealthy" on the serverless UI, and automatically stopped and then removed from the endpoint. New ones will scale up to take their place, which means the money drain is slowed but not stopped. This is because a failure during initialization can happen because of a temporary outage for a dependency needed at import time as well, and we don't want a temporary outage to turn into a permanent one. In a later iteration, we will implement better retry logic so that the money drain will be stopped completely, and figure out some alerting/notification so you as the maintainer of an endpoint can know when failures of this type happen.

Thanks for your patience, this is definitely a bad behavior for serverless to exhibit and not at all an intended UX. I hope this prevents similar problems to what you've experienced in the future.

DireLines · 2024-06-25T16:37:18Z

This change is now released

alpayariyak closed this as completed Jan 19, 2024

alpayariyak reopened this Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors cause the instance to run indefinitely #29

Errors cause the instance to run indefinitely #29

gabewillen commented Dec 27, 2023 •

edited

Loading

alpayariyak commented Jan 19, 2024

dannysemi commented Jan 29, 2024

bartlettD commented Jan 29, 2024 •

edited

Loading

ashleykleynhans commented Jan 29, 2024

gabewillen commented Jan 29, 2024

dannysemi commented Jan 29, 2024

alpayariyak commented Jan 31, 2024

preemware commented Feb 10, 2024

preemware commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

preemware commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

preemware commented Feb 10, 2024

willsamu commented Mar 3, 2024

avacaondata commented Mar 25, 2024

dpkirchner commented Apr 17, 2024

DireLines commented Apr 19, 2024

willsamu commented May 8, 2024

7flash commented Jun 22, 2024

Permafacture commented Jun 24, 2024

DireLines commented Jun 24, 2024

DireLines commented Jun 25, 2024

Errors cause the instance to run indefinitely #29

Errors cause the instance to run indefinitely #29

Comments

gabewillen commented Dec 27, 2023 • edited Loading

alpayariyak commented Jan 19, 2024

dannysemi commented Jan 29, 2024

bartlettD commented Jan 29, 2024 • edited Loading

ashleykleynhans commented Jan 29, 2024

gabewillen commented Jan 29, 2024

dannysemi commented Jan 29, 2024

alpayariyak commented Jan 31, 2024

preemware commented Feb 10, 2024

preemware commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

preemware commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

alpayariyak commented Feb 10, 2024

preemware commented Feb 10, 2024

willsamu commented Mar 3, 2024

avacaondata commented Mar 25, 2024

dpkirchner commented Apr 17, 2024

DireLines commented Apr 19, 2024

willsamu commented May 8, 2024

7flash commented Jun 22, 2024

Permafacture commented Jun 24, 2024

DireLines commented Jun 24, 2024

DireLines commented Jun 25, 2024

gabewillen commented Dec 27, 2023 •

edited

Loading

bartlettD commented Jan 29, 2024 •

edited

Loading