-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Timed out waiting for the condition" when binding task volumes #4050
Comments
I think it's a good catch for custom bind tiemout setting as the actual bind time is relevant to the underly storage: ) Autually volcano schedules and retries every reverted task very fast, so it will not be a bottleneck when the bind time is too small, I think that just becasues the provison pv time is exactly the same as the bind timeout time, causing the task to end first, and then the task is processed in the next scheduling cycle. Hope these information can help you. |
/milestone v1.12 |
Thanks @Monokaix !
Exactly, that was my impression as well, if the volume binding done by the external provisioner exceeds the 30s default timeout, then Volcano considers the volume binding to be unsuccessful, and retries again in the next scheduling session.
While we wait for v1.12, would it be advisable for us to change the bind timeout from 30s to a larger period? We can fork volcano in our project and manually change it while we wait for the volume binding refactoring. |
I think it's ok. |
Great scenario! volume binding has to refactor in this version, this scenario is a good catch |
Please describe your problem in detail
Hello,
I'm experiencing an issue with Volcano version 1.8.0 when a PodGroup task references a PVC that has its volume dynamically created by a CSI driver.
Environment:
The PVC is created prior the VolcanoJob is launched in the Kubernetes cluster. It uses a StorageClass that has the "WaitForFirstConsumer" policy.
Volcano seems to have a pretty aggressive timeout parameter when waiting for the volumes to be bound. After finding a node for the task, it waits for the volume to be bound, and timeouts few seconds later. Here are the observed events:
1 - Volcano finds a node for the task, and starts the binding process:
2 - Kubernetes event for "PV provisioning" is observed few seconds after that, thanks to the "WaitForFirstConsumer" SC policy:
3 - At 11:18:20, Volcano reports a timeout for the volume binding process:
4 - But few seconds later, at
11:18:23
, we notice that the external volume provisioner (AWS EBS) was able to provision the volume:If Volcano had waited a few seconds more, I believe the "binding volumes" would succeed for the task. Notice that the external provisioner doesn't even take too long to provision the PV, only ~9s.
5 - Thanks to this timeout, and also thanks to the number of VcJobs in the corresponding queue, fair scheduling, etc., Volcano "discards" the task, and only picks it up again several minutes later, at 11:43:58.
From what I understand, if the binding volume fails for the task, it gets unassigned from that node, and has to wait for another scheduling loop to be allocated to another node, is that correct?
6 - Finally, the pod can be started in the node after that second attempt.
My question is: can we tweak the "timeout parameter" that Volcano uses to assert the binding of task volumes? I checked the code, and it seems to be
30s
by default:volcano/pkg/scheduler/cache/cache.go
Line 587 in aa57168
Some further questions that I would like to check with Volcano experts:
Any other relevant information
No response
The text was updated successfully, but these errors were encountered: