Skip to content
This repository has been archived by the owner on Jul 16, 2020. It is now read-only.

Limit number of parallel starts #8

Open
markdryan opened this issue Apr 7, 2016 · 1 comment
Open

Limit number of parallel starts #8

markdryan opened this issue Apr 7, 2016 · 1 comment

Comments

@markdryan
Copy link
Contributor

markdryan commented Apr 7, 2016

Currently, there is no limit to the number of instances that ciao-launcher will start in parallel. This places extreme load on the compute node when spawning large amounts of instances at once and can lead to various errors. It may be better for launcher to introduce some sort of semaphore to limit the number of instances that can be launched in parallel to some function of the number of cores on the machine. We could also return a special STATUS, e.g., throttle, to indicate that launcher is overloaded but not full.

When launching large amounts of instances, e.g., 10000, we often see some failures and timeouts in qemu and networking. Reducing the load on the compute node may prevent these failures.

@markdryan markdryan self-assigned this Apr 7, 2016
@markdryan markdryan added the P2 label Jun 6, 2016
@amyleeland amyleeland added this to the Sprint 1 milestone Jun 9, 2016
@tpepper
Copy link

tpepper commented Jun 22, 2016

Relates to issue #99 ... Even if a node isn't launching a tonne of things its cpu load could be really high. As could all nodes, temporarily. Scheduler could queue for a bit and meter start commands out to nodes to act as a higher level throttle than just launcher alone. Both issues have merit, but we end up with a few more degrees of freedom in the flow that is START.

markdryan pushed a commit to markdryan/ciao that referenced this issue Jul 6, 2016
This commit limits the number of parallel starts to a function
of the number of CPUs present in the node.  There really isn't
much point in allowing 1000 instances to be started on the same
node at the same time.  Doing so won't increase the start times
much and will increase the likelihood of failure due to the
resource exhaustion caused by the heavy demands of instance
startup.

Fixes ciao-project#8

Signed-off-by: Mark Ryan <[email protected]>
@amyleeland amyleeland removed the ready label Jul 6, 2016
markdryan pushed a commit to markdryan/ciao that referenced this issue Jul 7, 2016
This commit limits the number of parallel starts to a function
of the number of CPUs present in the node.  There really isn't
much point in allowing 1000 instances to be started on the same
node at the same time.  Doing so won't increase the start times
much and will increase the likelihood of failure due to the
resource exhaustion caused by the heavy demands of instance
startup.

Fixes ciao-project#8

Signed-off-by: Mark Ryan <[email protected]>
@tpepper tpepper modified the milestones: Sprint 3, Sprint 1 Jul 19, 2016
kaccardi added a commit to kaccardi/ciao that referenced this issue Aug 11, 2016
@amyleeland amyleeland removed this from the Sprint 3 milestone Sep 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants