This repository has been archived by the owner on Sep 19, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 143
Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator #319
Labels
Comments
HI @gaocegege Is there a plan to look/comment on this issue? |
Yeah, we are trying to use Amazon's new public docker registry. Ref kubeflow/training-operator#1205 |
|
Hi @Jeffwan I am still getting the alpine image not found when we apply a PytorchJob yaml even with kubeflow 1.3.0 manifest.
I applied this PytorchJob yaml. I also used kubeflow manifests 1.3.0 and kustomize to generate the pytorch-operator crds and operator yamls and applied them. The pytorch-operator logs shows that the operator is running fine.
|
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
He Team,
I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in
pytorch-operator/manifests/kustomization.yaml
Line 13 in 6293efc
809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator
. However, that repo is not accessible from inside our network. So, instead I switched togcr.io/kubeflow-images-public/pytorch-operator:latest
I cloned this
pytorch-operator
repo and generated the pytorch operator usingkustomize build manifests/ | kubectl apply -f
which generates the following yaml - I also customized the namespace.I applied the above yaml and verified that the operator is running successfully
I then apply the following yaml to create a Distributed PytorchJob.
I see the worker pods failing with ImagePullBackOff Errors
Failed to pull image "alpine:3.10": rpc error: code = Unknown desc = Error reading manifest 3.10 in OUR_AWS_ACCOUNT.dkr.ecr.us-west-2.amazonaws.com/alpine: name unknown: The repository with name 'alpine' does not exist in the registry with id 'OUR_AWS_ACCOUNT'
Since, the Docker images are fully materialized why would it fail looking for
alpine:3.10
?The text was updated successfully, but these errors were encountered: