Fixes for 'concurrency' issues in JobQueue and JobManager #1

mojojojo99 · 2020-12-11T02:49:56Z

Fixes autolab#182 and indirectly fixes autolab#142

Changes proposed in this PR:

The most glaring issue fixed was here. I included a check for whether the pre-allocated VM in jobManager.py returns a None before attempting to do the logging. In the case where there are not enough VMs in the pool available, None can be returned. This previously caused an exception to be thrown and for the job to be considered Dead instead.
Added a TangoQueue for unassigned jobs. This keeps a FIFO queue of live jobs that have been added and not yet assigned to a worker or vm. Whenever we getNextPendingJob, we simply just have to pop off the FIFO queue. Some implications:
- This helps to indirectly avoid the possible starvation problem as mentioned in the issue jobs with high ids stay waiting when the job id counter rolls over autolab/Tango#142, since jobs are assigned or run based on their order in the FIFO queue, rather than arbitrarily based on their jobIDs.
- I've also used a blocking call to the redis list lpop. This means that getNextPendingJob will block until there is a next job available in the job queue. Previously, it seems like we have the jobManager spinning when there are no jobs, which might be less desirable.
- To accommodate previous API methods in the jobManager like delJob etc., I have added methods to the TangoQueue object to add functions to remove items in the queue.
Removed some methods in the jobQueue which are not used currently and do not seem to be useful. I also changed the getNextPendingJobReuse method so that it will be less expensive. Previously, it was looping through the hash table to find a job and initialiaze a pool of VMs for it. However, this can be simplified if we simply retrieve the job and pass it to a helper function to do this initialization.
Changed initializeVM to createVM in the Worker-- refer to the note below for more details.
Added more comments and docs to the functions within jobQueue. There were several assumptions (or preconditions required) in some methods and made them explicit in these comments.
Added some minor error handling within jobQueue. I checked most of the functions to make sure that error codes are being handled.
Fixed some of the unit tests for jobQueue so that it aligns with the modified api.

QA

More tests for this are in the making. We realize there aren't really any integrated tests present currently that makes sure that the server and the jobManager behaves as expected together, so we may be writing more of these.

*Note:

I have previously raised this up in the issue comments, but because of the exception thrown during the logging in jobManager, several branches and code in the Worker are not tested and used at all. This is the case when a worker is started with preVM is None. It seems like the current behavior of the worker will be to initialize a new VM whenever there is not already a pre-allocated VM for it. This might cause some issues and unexpected behavior for the case when Config.REUSE_VMMS is set to true -- when there is a flood of jobs to the autograder, many of these vms might be created in this worker. Because of the fact that they use the call initializeVM to immediately initialize and create a new vm instance, this vm is not added to the pool and is also untracked anywhere else. Based on my limited understanding, it seems like this vm will not be terminated. In the case where ec2ssh is used for example, this might be undesirable. I have replaced this call with createVM at least, so that the pool keeps track of this new vm. The preallocator will then manage the pools by terminating instances when there are too many later on.

However, a more important question is, what is the desired behavior of the autograder when there are not enough VMs? Perhaps we could consider simply blocking the job there, and waiting for another vm to be freed instead of always creating new VMs. This will potentially cause a lot more VMs to be created during peak usage.

mojojojo99 added 2 commits December 4, 2020 21:26

get next id docs + fixes

1613f35

discover vm errors

e40adee

mojojojo99 added the Not ready label Dec 11, 2020

mojojojo99 added 6 commits December 10, 2020 21:57

fixed logging issue

f87f571

check none

dd393d4

remove remove

b490652

fixes

7a47085

preallocator createvm worker

fce6fa2

unit tests for job queue fixed

8cac193

mojojojo99 changed the title ~~Jojo~~ Fixes for 'concurrency' issues in JobQueue and JobManager Dec 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for 'concurrency' issues in JobQueue and JobManager #1

Fixes for 'concurrency' issues in JobQueue and JobManager #1

mojojojo99 commented Dec 11, 2020 •

edited

Loading

Fixes for 'concurrency' issues in JobQueue and JobManager #1

Are you sure you want to change the base?

Fixes for 'concurrency' issues in JobQueue and JobManager #1

Conversation

mojojojo99 commented Dec 11, 2020 • edited Loading

QA

mojojojo99 commented Dec 11, 2020 •

edited

Loading