Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for 'concurrency' issues in JobQueue and JobManager #1

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

mojojojo99
Copy link
Owner

@mojojojo99 mojojojo99 commented Dec 11, 2020

Fixes autolab#182 and indirectly fixes autolab#142

Changes proposed in this PR:

  • The most glaring issue fixed was here. I included a check for whether the pre-allocated VM in jobManager.py returns a None before attempting to do the logging. In the case where there are not enough VMs in the pool available, None can be returned. This previously caused an exception to be thrown and for the job to be considered Dead instead.
  • Added a TangoQueue for unassigned jobs. This keeps a FIFO queue of live jobs that have been added and not yet assigned to a worker or vm. Whenever we getNextPendingJob, we simply just have to pop off the FIFO queue. Some implications:
    - This helps to indirectly avoid the possible starvation problem as mentioned in the issue jobs with high ids stay waiting when the job id counter rolls over autolab/Tango#142, since jobs are assigned or run based on their order in the FIFO queue, rather than arbitrarily based on their jobIDs.
    - I've also used a blocking call to the redis list lpop. This means that getNextPendingJob will block until there is a next job available in the job queue. Previously, it seems like we have the jobManager spinning when there are no jobs, which might be less desirable.
    - To accommodate previous API methods in the jobManager like delJob etc., I have added methods to the TangoQueue object to add functions to remove items in the queue.
  • Removed some methods in the jobQueue which are not used currently and do not seem to be useful. I also changed the getNextPendingJobReuse method so that it will be less expensive. Previously, it was looping through the hash table to find a job and initialiaze a pool of VMs for it. However, this can be simplified if we simply retrieve the job and pass it to a helper function to do this initialization.
  • Changed initializeVM to createVM in the Worker-- refer to the note below for more details.
  • Added more comments and docs to the functions within jobQueue. There were several assumptions (or preconditions required) in some methods and made them explicit in these comments.
  • Added some minor error handling within jobQueue. I checked most of the functions to make sure that error codes are being handled.
  • Fixed some of the unit tests for jobQueue so that it aligns with the modified api.

QA

  • More tests for this are in the making. We realize there aren't really any integrated tests present currently that makes sure that the server and the jobManager behaves as expected together, so we may be writing more of these.

*Note:

  • I have previously raised this up in the issue comments, but because of the exception thrown during the logging in jobManager, several branches and code in the Worker are not tested and used at all. This is the case when a worker is started with preVM is None. It seems like the current behavior of the worker will be to initialize a new VM whenever there is not already a pre-allocated VM for it. This might cause some issues and unexpected behavior for the case when Config.REUSE_VMMS is set to true -- when there is a flood of jobs to the autograder, many of these vms might be created in this worker. Because of the fact that they use the call initializeVM to immediately initialize and create a new vm instance, this vm is not added to the pool and is also untracked anywhere else. Based on my limited understanding, it seems like this vm will not be terminated. In the case where ec2ssh is used for example, this might be undesirable. I have replaced this call with createVM at least, so that the pool keeps track of this new vm. The preallocator will then manage the pools by terminating instances when there are too many later on.

However, a more important question is, what is the desired behavior of the autograder when there are not enough VMs? Perhaps we could consider simply blocking the job there, and waiting for another vm to be freed instead of always creating new VMs. This will potentially cause a lot more VMs to be created during peak usage.

@mojojojo99 mojojojo99 changed the title Jojo Fixes for 'concurrency' issues in JobQueue and JobManager Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tango concurrency and locking issues jobs with high ids stay waiting when the job id counter rolls over
1 participant