Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClearML Agent fails scheduling tasks in Kubernetes after updating to v1.9.3 #223

Open
dberkerdem opened this issue Jan 22, 2025 · 9 comments

Comments

@dberkerdem
Copy link

After updating ClearML Agent that run in Kubernetes to v1.9.3, we start encountering following error on new scheduled tasks.

ERROR: Could not push back task [2667c614e10a46df90882ae3aa3ca7c8] to k8s pending queue k8s_scheduler [45bfd703d59245acbe3f3982fdb5f2d2], error: Validation error (Cannot skip setting execution queue for a task that is not enqueued or does not have execution queue set)

@mb-ii
Copy link

mb-ii commented Jan 22, 2025

Seeing the same issue using Helm deployment (charts "clearml-7.14.1" and "clearml-agent-5.3.1")

@jkhenning
Copy link
Member

Hi @dberkerdem and @@mb-ii

Can you share the exact server version/build from the UI's profile page?

@dberkerdem
Copy link
Author

WebApp: 2.0.0-613 • Server: 2.0.0-613 • API: 2.31

1 similar comment
@mb-ii
Copy link

mb-ii commented Jan 23, 2025

WebApp: 2.0.0-613 • Server: 2.0.0-613 • API: 2.31

@mb-ii
Copy link

mb-ii commented Jan 23, 2025

@jkhenning here's full Helm configuration for easier debugging:

values.yaml
clearml:
  elasticsearch:
    enabled: false
  mongodb:
    enabled: false
  redis:
    enabled: false

  apiserver:
    replicaCount: 1
    service:
      nodePort: 30040
    resources:
      requests:
        memory: 500Mi
        cpu: "500m"
      limits:
        memory: 500Mi
    ingress:
      enabled: true
      hostName: "***"

  fileserver:
    replicaCount: 1
    service:
      nodePort: 30041
    resources:
      requests:
        memory: 500Mi
        cpu: "500m"
      limits:
        memory: 500Mi
    storage:
      data:
        class: clearml-fileserver
        size: 10Gi
      enabled: true
    ingress:
      enabled: true
      hostName: "***"

  webserver:
    replicaCount: 1
    service:
      nodePort: 30042
    resources:
      requests:
        memory: 500Mi
        cpu: "500m"
      limits:
        memory: 500Mi
    ingress:
      enabled: true
      hostName: "***"

  externalServices:
    elasticsearchConnectionString: "***"
    redisHost: "***"
    redisPort: 6379
    mongodbConnectionStringAuth: '***'
    mongodbConnectionStringBackend: '***'

clearml-agent:
  agentk8sglue:
    webServerUrlReference: "***"
    apiServerUrlReference: "***"
    fileServerUrlReference: "***"
  clearml:
    agentk8sglueKey: "GGS9F4M6XB2DXJ5AFT9F"
    agentk8sglueSecret: "2oGujVFhPfaozhpuz2GzQfA5OyxmMsR3WVJpsCR5hrgHFs20PO"
Chart.yaml

apiVersion: v2
name: clearml-dev
version: 0.0.0
dependencies:
  - name: clearml
    version: 7.14.2
    repository: https://clearml.github.io/clearml-helm-charts
  - name: clearml-agent
    version: 5.3.1
    repository: https://clearml.github.io/clearml-helm-charts

@mb-ii
Copy link

mb-ii commented Jan 27, 2025

May I ask what's the last Helm chart version this was working for you @dberkerdem ?

@dberkerdem
Copy link
Author

dberkerdem commented Jan 27, 2025

I use 5.3.1 but it is not related with Chart version. To solve this problem you need to expose following environment variable agents environment. The problem is caused by dynamic installation of latest ClearML version on ClearML Agent startup.See behavior

      agentk8sglue:
          extraEnvs:
            - name: CLEARML_AGENT_UPDATE_VERSION
              value: "==1.9.2"

@mb-ii
Copy link

mb-ii commented Jan 27, 2025

Thanks, it helped! @dberkerdem

@illuser-maker
Copy link

I also saw this issue in clearml-agent 1.9.3. Downgrade to 1.9.2 helped. It seems a bug in 1.9.3 version :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants