Orphaned allocations when stopping a job with failures #8475

threemachines · 2020-07-20T23:07:33Z

Nomad version

Nomad v0.12.0 (8f7fbc8)

Operating system and Environment details

Ubuntu 18.04 on AWS. Nomad clients are m5.large instances. We have six clients, but I don't think that matters due to the default binpacking placement strategy.

Issue

After deploying a job in which some tasks are stuck failing, and then stopping that job, some allocations are left running. The job status is dead, and the allocation DesiredStatus is "stop", but these orphan allocations have not actually been stopped.

Reproduction steps

Deploy a job where some tasks will consistently fail. In this case, we achieve this by deploying a lot of small allocations, such that they mostly get placed on the same client, and some of the allocations fail due to Linux open file limits. I don't imagine the failure mechanism would matter, but who knows?
Stop the job.
Check the remaining allocations. (Forcing GC makes it easier to find them.)

Job file (if appropriate)

This is basically just the example redis job, except with a much higher count, and lower allocation requirements (so more can be packed into a single client).

job "example" {
  datacenters = ["sandbox"]
  type = "service"
  update {
    max_parallel = 50
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }
  migrate {
    max_parallel = 50
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }
  group "cache" {
    count = 300
    restart {
      attempts = 2
      interval = "30m"
      delay = "15s"
      mode = "fail"
    }
    task "redis" {
      driver = "docker"
      config {
        image = "redis:3.2"
        port_map {
          db = 6379
        }
      }
      resources {
        cpu    = 20
        memory = 10
        network {
          mbits = 1
          port "db" {}
        }
      }
      service {
        name = "redis-cache"
        tags = ["global", "cache"]
        port = "db"
        check {
          name     = "alive"
          type     = "tcp"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Nomad logs (if appropriate)

Will send to support email.

The text was updated successfully, but these errors were encountered:

notnoop · 2020-07-21T15:27:08Z

Thank you so much for reaching out! I suspect this is a manifestation of the issue #6557 . Do you only observe this when the allocs are failing due to host resource utilization (e.g. file limits)? What about a consistently failing jobs (e.g. having a job that always exit 1)?

threemachines · 2020-07-21T17:13:20Z

You're right, that sounds almost exactly like #6557.

I don't think my issue could be reproduced if jobs always fail, since I think it needs some successfully running instances to be orphaned, but I might try writing a toy app that has a 50% chance to fail on start and see if that helps us.

notnoop · 2020-07-28T18:47:04Z

Thank you again for the report! I'll close this ticket being a duplicate then.

github-actions · 2022-11-04T02:37:53Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop mentioned this issue Jul 24, 2020

Docker reconciler #7700

Closed

notnoop closed this as completed Jul 28, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orphaned allocations when stopping a job with failures #8475

Orphaned allocations when stopping a job with failures #8475

threemachines commented Jul 20, 2020

notnoop commented Jul 21, 2020 •

edited

Loading

threemachines commented Jul 21, 2020

notnoop commented Jul 28, 2020 •

edited

Loading

github-actions bot commented Nov 4, 2022

Orphaned allocations when stopping a job with failures #8475

Orphaned allocations when stopping a job with failures #8475

Comments

threemachines commented Jul 20, 2020

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Job file (if appropriate)

Nomad logs (if appropriate)

notnoop commented Jul 21, 2020 • edited Loading

threemachines commented Jul 21, 2020

notnoop commented Jul 28, 2020 • edited Loading

github-actions bot commented Nov 4, 2022

notnoop commented Jul 21, 2020 •

edited

Loading

notnoop commented Jul 28, 2020 •

edited

Loading