Fixes for 'concurrency' issues in JobQueue and JobManager #1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes autolab#182 and indirectly fixes autolab#142
Changes proposed in this PR:
None
before attempting to do the logging. In the case where there are not enough VMs in the pool available,None
can be returned. This previously caused an exception to be thrown and for the job to be considered Dead instead.getNextPendingJob
, we simply just have to pop off the FIFO queue. Some implications:- This helps to indirectly avoid the possible starvation problem as mentioned in the issue jobs with high ids stay waiting when the job id counter rolls over autolab/Tango#142, since jobs are assigned or run based on their order in the FIFO queue, rather than arbitrarily based on their jobIDs.
- I've also used a blocking call to the redis list
lpop
. This means thatgetNextPendingJob
will block until there is a next job available in the job queue. Previously, it seems like we have the jobManager spinning when there are no jobs, which might be less desirable.- To accommodate previous API methods in the jobManager like
delJob
etc., I have added methods to theTangoQueue
object to add functions to remove items in the queue.getNextPendingJobReuse
method so that it will be less expensive. Previously, it was looping through the hash table to find a job and initialiaze a pool of VMs for it. However, this can be simplified if we simply retrieve the job and pass it to a helper function to do this initialization.initializeVM
tocreateVM
in theWorker
-- refer to the note below for more details.QA
server
and thejobManager
behaves as expected together, so we may be writing more of these.*Note:
Worker
are not tested and used at all. This is the case when a worker is started withpreVM
isNone
. It seems like the current behavior of the worker will be toinitialize
a new VM whenever there is not already a pre-allocated VM for it. This might cause some issues and unexpected behavior for the case whenConfig.REUSE_VMMS
is set to true -- when there is a flood of jobs to the autograder, many of these vms might be created in this worker. Because of the fact that they use the callinitializeVM
to immediately initialize and create a new vm instance, this vm is not added to the pool and is also untracked anywhere else. Based on my limited understanding, it seems like this vm will not be terminated. In the case where ec2ssh is used for example, this might be undesirable. I have replaced this call withcreateVM
at least, so that the pool keeps track of this new vm. The preallocator will then manage the pools by terminating instances when there are too many later on.However, a more important question is, what is the desired behavior of the autograder when there are not enough VMs? Perhaps we could consider simply blocking the job there, and waiting for another vm to be freed instead of always creating new VMs. This will potentially cause a lot more VMs to be created during peak usage.