Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orphaned allocations when stopping a job with failures #8475

Closed
threemachines opened this issue Jul 20, 2020 · 4 comments
Closed

Orphaned allocations when stopping a job with failures #8475

threemachines opened this issue Jul 20, 2020 · 4 comments

Comments

@threemachines
Copy link

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

Ubuntu 18.04 on AWS. Nomad clients are m5.large instances. We have six clients, but I don't think that matters due to the default binpacking placement strategy.

Issue

After deploying a job in which some tasks are stuck failing, and then stopping that job, some allocations are left running. The job status is dead, and the allocation DesiredStatus is "stop", but these orphan allocations have not actually been stopped.

Reproduction steps

  1. Deploy a job where some tasks will consistently fail. In this case, we achieve this by deploying a lot of small allocations, such that they mostly get placed on the same client, and some of the allocations fail due to Linux open file limits. I don't imagine the failure mechanism would matter, but who knows?
  2. Stop the job.
  3. Check the remaining allocations. (Forcing GC makes it easier to find them.)

Job file (if appropriate)

This is basically just the example redis job, except with a much higher count, and lower allocation requirements (so more can be packed into a single client).

job "example" {
  datacenters = ["sandbox"]
  type = "service"
  update {
    max_parallel = 50
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }
  migrate {
    max_parallel = 50
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }
  group "cache" {
    count = 300
    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 20
        memory = 10
        network {
          mbits = 1
          port "db" {}
        }
      }
      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad logs (if appropriate)

Will send to support email.

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2020

Thank you so much for reaching out! I suspect this is a manifestation of the issue #6557 . Do you only observe this when the allocs are failing due to host resource utilization (e.g. file limits)? What about a consistently failing jobs (e.g. having a job that always exit 1)?

@threemachines
Copy link
Author

You're right, that sounds almost exactly like #6557.

I don't think my issue could be reproduced if jobs always fail, since I think it needs some successfully running instances to be orphaned, but I might try writing a toy app that has a 50% chance to fail on start and see if that helps us.

@notnoop
Copy link
Contributor

notnoop commented Jul 28, 2020

Thank you again for the report! I'll close this ticket being a duplicate then.

@notnoop notnoop closed this as completed Jul 28, 2020
@github-actions
Copy link

github-actions bot commented Nov 4, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants