Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V1: Error syncing deployment - Operation cannot be fulfilled on replicasets.apps #5082

Open
stephen37 opened this issue Aug 10, 2023 · 3 comments
Labels

Comments

@stephen37
Copy link
Contributor

Describe the bug

When deployment a new SeldonDeployment and expecting pods to be restarted with the new version, not all pods are created with the new version and the Kube-controller-manager complains about

"Error syncing deployment" deployment="analytics/name_of_the_sdep" err="Operation cannot be fulfilled on replicasets.apps \"pod-name-7894b6f9dd\": the object has been modified; please apply your changes to the latest version and try again"

I haven't checked with Seldon-Core V2, those are on Seldon-core V1 and it has been happening since Seldon Core 1.16.0

To reproduce

  1. Deploy a model in K8s with multiple pods running
  2. Deploy a model with another Docker tag for example
  3. Observe that not all pods will be updated
  4. kube-controller-manager complains with the error mentioned above

Expected behaviour

All pods should be updated with the latest version of the Docker image and it's not the case

> kubectl get pods | grep <name>
name-0-main-7894b6f9dd-2nqs7            2/2     Running       0              31h
name-0-main-7894b6f9dd-8knxc            2/2     Running       0              21m
name-0-main-7894b6f9dd-b7d9m            2/2     Running       0              31h
name-0-main-7894b6f9dd-b9tlr            2/2     Running       0              31h
name-0-main-7894b6f9dd-ddsp9            2/2     Running       0              21m
name-0-main-7894b6f9dd-dlq9j            2/2     Running       0              21m

They should all be 21m old but for some of them, the deployment sync has errored.

Environment

  • Cloud Provider: EKS
  • Kubernetes Cluster Version: 1.24/ 1.25
  • Deployed Seldon System: 1.16.0

Model Details

  • Images of your model: [Output of: kubectl get seldondeployment -n <yourmodelnamespace> <seldondepname> -o yaml | grep image: where <yourmodelnamespace>]

  • Logs of your model: [You can get the logs of your model by running kubectl logs -n <yourmodelnamespace> <seldonpodname> <container>]

  • Logs of seldon-controller-manager:

{"volumes":[{"name":"regressor-provision-location","emptyDir":{}},{"name":"seldon-podinfo","downwardAPI":{"items":[{"path":"annotations","fieldRef":{"apiVersion":"v1","fieldPath":"metadata.annotations"}}],"defaultMode":420}}],"initContainers":[{"name":"regressor-model-initializer","image":"kserve/storage-initializer:v0.9.0","args":["s3://amazing-bucket-name/amazing-model-name/amazing-model-name/2023-08-07T01-30-00/","/mnt/models"],"resources":{},"volu
{"level":"info","ts":1691681150.4964962,"logger":"controllers.SeldonDeployment","msg":"The deployments are the same - api server defaults ignored","SeldonDeployment":"analytics/amazing-model-name"}
{"level":"info","ts":1691681150.4965105,"logger":"controllers.SeldonDeployment","msg":"Found identical deployment","SeldonDeployment":"analytics/amazing-model-name","namespace":"analytics","name":"amazing-model-name-prediction-pipeline-0-regressor","status":{"observedGeneration":93042,"replicas":3,"updatedReplicas":3,"readyReplicas":3,"availableReplicas":3,"conditions":[{"type":"Available","status":"True","lastUpdateTime":"2023-08-09T07:21:30Z","lastTransitionTime":"2023-08-09T07
{"level":"info","ts":1691681150.496532,"logger":"controllers.SeldonDeployment","msg":"Deployment status","SeldonDeployment":"analytics/amazing-model-name","name":"amazing-model-name-prediction-pipeline-0-regressor","status":{"observedGeneration":93042,"replicas":3,"updatedReplicas":3,"readyReplicas":3,"availableReplicas":3,"conditions":[{"type":"Available","status":"True","lastUpdateTime":"2023-08-09T07:21:30Z","lastTransitionTime":"2023-08-09T07:21:30Z","reason":"MinimumReplicas
{"level":"info","ts":1691681150.4965816,"logger":"controllers.SeldonDeployment","msg":"Found identical Service","SeldonDeployment":"analytics/amazing-model-name","all":true,"namespace":"analytics","name":"amazing-model-name-prediction-pipeline-regressor","status":{"loadBalancer":{}}}
{"level":"info","ts":1691681150.496616,"logger":"controllers.SeldonDeployment","msg":"Found identical Service","SeldonDeployment":"analytics/amazing-model-name","all":true,"namespace":"analytics","name":"amazing-model-name-prediction-pipeline","status":{"loadBalancer":{}}}
{"level":"info","ts":1691681150.496658,"logger":"controllers.SeldonDeployment","msg":"Removing unused services","SeldonDeployment":"analytics/amazing-model-name"}
{"level":"info","ts":1691681150.4968553,"logger":"controllers.SeldonDeployment","msg":"Reconcile called","SeldonDeployment":"analytics/ml-prep-times"}
{"level":"info","ts":1691681150.49691,"logger":"seldondeployment","msg":"Defaulting Seldon Deployment called","name":"ml-prep-times"}
{"level":"info","ts":1691681150.4969313,"logger":"controllers.SeldonDeployment","msg":"pSvcName","SeldonDeployment":"analytics/ml-prep-times","val":"ml-prep-times-graph-main"}
{"level":"info","ts":1691681150.4972584,"logger":"controllers.SeldonDeployment","msg":"Found identical Service","SeldonDeployment":"analytics/ml-prep-times","all":false,"namespace":"analytics","name":"ml-prep-times-graph-main-main","status":{"loadBalancer":{}}}
{"level":"info","ts":1691681150.4972935,"logger":"controllers.SeldonDeployment","msg":"Found identical Service","SeldonDeployment":"analytics/ml-prep-times","all":false,"namespace":"analytics","name":"ml-prep-times-graph-main","status":{"loadBalancer":{}}}
{"level":"info","ts":1691681150.4973407,"logger":"controllers.SeldonDeployment","msg":"Found identical HPA","SeldonDeployment":"analytics/ml-prep-times","namespace":"analytics","name":"ml-prep-times-graph-main-0-main","status":{"lastScaleTime":"2023-08-09T07:11:01Z","currentReplicas":2,"desiredReplicas":2,"currentMetrics":[{"type":"Resource","resource":{"name":"memory","current":{"averageValue":"429568k","averageUtilization":40}}},{"type":"Resource","resource":{"name":"cpu","current":{
{"level":"info","ts":1691681150.49742,"logger":"controllers.SeldonDeployment","msg":"Scheme","SeldonDeployment":"analytics/ml-prep-times","r.scheme":{}}
{"level":"info","ts":1691681150.4974318,"logger":"controllers.SeldonDeployment","msg":"createDeployments","SeldonDeployment":"analytics/ml-prep-times","deploy":{"namespace":"analytics","name":"ml-prep-times-graph-main-0-main"}}
{"level":"info","ts":1691681150.5027092,"logger":"controllers.SeldonDeployment","msg":"Updating Deployment","SeldonDeployment":"analytics/ml-prep-times","namespace":"analytics","name":"ml-prep-times-graph-main-0-main"}
@stephen37 stephen37 added the bug label Aug 10, 2023
@stephen37 stephen37 changed the title Error syncing deployment - Operation cannot be fulfilled on replicasets.apps V1: Error syncing deployment - Operation cannot be fulfilled on replicasets.apps Aug 10, 2023
@ukclivecox ukclivecox added the v1 label Sep 7, 2023
@Vavinash-github
Copy link

Hi @stephen37 @cliveseldon Facing a similar issue Old pods are not getting deleted when new pods are up.. Is this issue resolved in later versions of seldon??

@stephen37
Copy link
Contributor Author

Hey,

The solution we found is to always define the number of replicas in the Seldon Deployment. That way the pods are always updated

@bcvanmeurs
Copy link

bcvanmeurs commented Mar 8, 2024

We have a very similar issue with the same logs, where somehow the controller keeps saying that the deployments are the same and then tries to reconcile. But actually nothing seems to happen and the model seems fine. Though argocd says the deployment is stuck. I don't fully understand what is going on, the deployment is similar to all our other deployments. We have set the number of replicas to a fixed number but it still happens. Any thoughts on why the operator might think there are duplicate deployments and services?

When I describe the SeldonDeployment I see this:

  Type    Reason   Age                        From                       Message
  ----    ------   ----                       ----                       -------
  Normal  Updated  3m47s (x1451532 over 20h)  seldon-controller-manager  Updated SeldonDeployment "xxx"

edit: I did find the issue and reported it here: #5435

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants