-
Notifications
You must be signed in to change notification settings - Fork 143
container "pytorch" is waiting to start: PodInitializing #348
Comments
Could you please run |
Can you show more about it? Especially the events section. |
Seems that the init container is pending. Can you show its log? |
Can you try kubectl debug to run an ephemeral container, then run |
It's weird. |
I put the program to sleep for a while and found that the worker can run. Is there any restriction on the creation order of service and pod in pytorchjob? |
It should be that the master executes too fast, so that the ep of the service finally becomes none, and the worker cannot obtain the IP address of the master. |
Interesting. /cc @johnugeorge |
But in that case, master should not start the job until workers are up. Are you using distributed setup itself in the code? |
I did not use distributed steps in the code. After master running, it becomes completed state, "kubectl get ep -ntest", and found that ep is none |
If the code is not sleeping, the master is in a short running state, but the worker is in the init state |
If you are not using distributed pytorch in the code, this can happen. Master can start executing and gets completed before worker starts. Can you confirm whether you are using distributed APIs? |
That is issue. Any reason in using pytorch-operator without using distributed version? Example: |
How do I make the pod created by pytorchjob not automatically disappear after completion, is it to use cleanpolicy? How to set it up? thanks |
When the master is finished running, the worker is still initializing.
worker log:
Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing
What is the reason for this?
The text was updated successfully, but these errors were encountered: