-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors cause the instance to run indefinitely #29
Comments
Are you still facing this issue currently? |
I had this issue yesterday. Used up all of my credits overnight. |
I've seen this as well, but more from the perspective that if vllm runs into an error then the worker continues to retry the job over and over. I can get this to happen if I do the following
|
This is not a VLLM specific thing, this happens when my other workers get errors too, they just keep running over and over and spawning more and more workers until you scale your workers down to zero. This seems to be some kind of issue with the backend or the RunPod SDK. |
This is why we abandoned the serverless VLLM worker. We are now using a custom TGI serverless worker that hasn't experienced this issue. |
I'm going to try polling the health check for retries and cancel the job if I get more than one or two retries. |
@bartlettD Could you provide an example model and GPU model please? |
Same problem. Entire balance was wiped from
using the build command |
Link? Because I've lost a lot of money from trying to use this one. |
Like @ashleykleynhans said, this is a problem with RunPod Serverless in general, not something specific to worker-vllm - the team is working on a solution. It seems like your endpoint was not working from the start, so I'd recommend making sure of that first in the future with at least 1 test request before leaving it running to avoid getting your balance wiped. vLLM is faster than TGI, but has a lot of moving parts, so you need to ensure that your deployment is successful, tweaking your configuration as necessary or reporting the issue if it's a bug in the worker. |
It should exit on exception. That isn't impossible to implement. This used to work perfectly for a long time when only using VLLM's generate. The code should be tested before being tagged as a release. |
@anthonyllx The latest commit fixes the error you're facing, thank you for reporting it. |
We will be adding a maximum number of worker restarts and job length limits to RunPod Serverless next week, this should solve the issue. |
Thank you. This would solve the problem. |
@alpayariyak When will this be introduced? I cannot find a setting to configure it in the UI. I'm somewhat afraid to use serverless endpoints in prod scenarios until this is solved. |
@gabewillen Could you please provide a link to the repo implementing the TGI custom worker? |
@alpayariyak Just checking to see if this feature is now available and if so how to enable it? Is it an environment variable? |
The cause for this is identified and we are implementing a fix for it which should be out by end of next week. For now, you should know that this error will always happen when the handler code exits before running runpod.serverless.start(handler), which in turn mostly happens because of some error in the initialization phase. For example, in the stack trace you posted @preemware the error happened during initialization of the vllm engine because of some missing config on the model. The fix is for runpod's backend to monitor the handler process for completion and terminate the pod if that process completes either successfully or unsuccessfully. |
@DireLines Thank you for the update. Is it implemented now? How does with work together with Does that mean, we need to wrap the vllm initialization phase in a try-catch block and continue successfully, so that it will only fail once it reaches the handler? |
I also have this issue, balance wiped out @dannysemi how did you implement health check? |
@DireLines any update? |
It took longer than expected but logic flagging workers that fail during initialization as unhealthy is done, and will be activated in the next release for one of our repos. It's already deployed but only logging to us when it happens, so we can see that it behaves as expected before flipping the switch. Once released, workers that are flagged in this way will be shown as "unhealthy" on the serverless UI, and automatically stopped and then removed from the endpoint. New ones will scale up to take their place, which means the money drain is slowed but not stopped. This is because a failure during initialization can happen because of a temporary outage for a dependency needed at import time as well, and we don't want a temporary outage to turn into a permanent one. In a later iteration, we will implement better retry logic so that the money drain will be stopped completely, and figure out some alerting/notification so you as the maintainer of an endpoint can know when failures of this type happen. Thanks for your patience, this is definitely a bad behavior for serverless to exhibit and not at all an intended UX. I hope this prevents similar problems to what you've experienced in the future. |
This change is now released |
Any errors caused by the payload cause the instance to hang in an error state indefinitely. You have to manually terminate the instance or you'll rack up a hefty bill should you have several running that have an error.
The text was updated successfully, but these errors were encountered: