Skip to content

Commit

Permalink
ciao-launcher: Limit the number of parallel starts
Browse files Browse the repository at this point in the history
This commit limits the number of parallel starts to a function
of the number of CPUs present in the node.  There really isn't
much point in allowing 1000 instances to be started on the same
node at the same time.  Doing so won't increase the start times
much and will increase the likelihood of failure due to the
resource exhaustion caused by the heavy demands of instance
startup.

Fixes ciao-project#8

Signed-off-by: Mark Ryan <[email protected]>
  • Loading branch information
Mark Ryan committed Jul 6, 2016
1 parent b68ed47 commit 3fbd92d
Showing 1 changed file with 15 additions and 0 deletions.
15 changes: 15 additions & 0 deletions ciao-launcher/instance.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ package main

import (
"path"
"runtime"
"sync"
"time"

Expand Down Expand Up @@ -61,6 +62,12 @@ type insDeleteCmd struct {
type insStopCmd struct{}
type insMonitorCmd struct{}

var startSemaphore chan struct{}

func init() {
startSemaphore = make(chan struct{}, runtime.NumCPU()*2)
}

/*
This functions asks the server loop to kill the instance. An instance
needs to request that the server loop kill it if Start fails completly.
Expand Down Expand Up @@ -103,7 +110,15 @@ func (id *instanceData) startCommand(cmd *insStartCmd) {
startErr.send(id.ac.conn, id.instance)
return
}

select {
case startSemaphore <- struct{}{}:
case <-id.doneCh:
glog.Warningf("Abandoning instance %s start due to shutdown", id.instance)
return
}
st, startErr := processStart(cmd, id.instanceDir, id.vm, id.ac.conn)
_ = <-startSemaphore
if startErr != nil {
glog.Errorf("Unable to start instance[%s]: %v", string(startErr.code), startErr.err)
startErr.send(id.ac.conn, id.instance)
Expand Down

1 comment on commit 3fbd92d

@tpepper
Copy link

@tpepper tpepper commented on 3fbd92d Jul 6, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me though I'd do:

runtime.NumCPU()*20

There's IO during start up and things block. My rough sense has been that more than 20x overcommit or so during boot was where we started to see slowdown. If you do only 2x, we're liable to run slower than currently in the face of a huge herd of launches, where I take the patch's intent is to speed things.

This argues at having it a defined variable with a default value and an override via a --cpu-overcommit config option.

Please sign in to comment.