NoneType Exception When Queuing Jobs #138

chriswailes · 2017-09-25T16:05:26Z

Expected Behavior

When using Redis with VM ReUse enabled the server should continue to process jobs.

Actual Behavior

After some combination of job submissions takes place the jobManager will produce the following error:

DEBUG|2017-09-24 16:58:32,713|JobQueue|get| Acquired lock to job queue.
DEBUG|2017-09-24 16:58:32,724|JobQueue|get| Released lock to job queue.
Starting the stand-alone Tango JobManager
Traceback (most recent call last):
  File "jobManager.py", line 125, in <module>
    jobs.run()
  File "jobManager.py", line 45, in run
    self.__manage()
  File "jobManager.py", line 105, in __manage
    self.jobQueue.makeDead(job.id, str(err))
AttributeError: 'NoneType' object has no attribute 'id'

Steps to Reproduce the Behavior

Run the Tango job manager with Redis and VM ReUse.

Additional Details

The error appears to be the result of getNextPendingJobReuse returning (None, None) at jobManager.py:66. This then results in the next line trying to retrieve a job using None as an ID. This results in an exception and the variable job holding the value None. Next, the exception handler at jobManager.py:105 attempts to mark the job as dead. Unfortunately, because job is None the above error is raised.

The text was updated successfully, but these errors were encountered:

bstriner · 2018-03-18T20:22:19Z

Just saw this bug. Had the same problem myself.

Not a fix to the root problem, but this PR avoids the crash and logs the error.

#146

chriswailes · 2018-03-20T14:03:16Z

It turns out the "solution" to this issue is a necessary but entirely undocumented command. Tango keeps track of "pools" (that aren't actually pools) of VMs and attempts to "allocate" machines from there. For some reason the pool for the first Docker image name given is automatically created for you, but you can't use a different Docker image without manually creating them using the following command:

python ./clients/tango-cli -k <key> -l <lab name> --prealloc --image <image name> --num <pool size>

With VM re-use disabled (having it enabled was causing problems for me) the pool size is a useless parameter. Tango will simply spawn as many jobs as it can as fast as it can.

victorhuangwq · 2020-05-30T10:10:43Z

Indeed there seems to be an issue with VM Reuse, and it still has not be resolved. Bumping this thread to have devs to notice

victorhuangwq · 2020-06-20T11:04:10Z

Hi, a quick update for those who are viewing this thread later, read the TLDR below if you don't want to read through the explaination.

Firstly looking at this line in the codebase, the docker containers are already destroyed after copying out the feedback
https://github.com/autolab/Tango/blob/master/vmms/localDocker.py#L159

I tested after commenting out the line where localDocker calls on destroyVMs when it finishes its job, and the yet the docker container still exits. It seems that docker containers are meant to exit when it's main command finishes https://stackoverflow.com/questions/28212380/why-docker-container-exits-immediately . So even without the destroy command, it exits anyway.

In fact I see that the localDocker depends on the docker exiting to obtain its output, because when I stop the docker container from exiting, localDocker was not able to obtain the feedback.

So to truly enable reuse of docker containers would require a different implementation of localDocker, but I don't think that is needed anyways, cause it seems that the correct way to reuse docker containers is to let them exit when they aren't being used.

TLDR

Applying @bstriner patch does create a limited pool, limiting the number of docker containers that can be spunned up at any one time, but it does not actually enable actual reuse of Docker containers (but reuse might still be possible with other VMs), and as of what I see, not reusuing docker containers seems to be the correct way to use them

victorhuangwq added the bug label May 30, 2020

victorhuangwq assigned victorhuangwq and fanpu May 30, 2020

victorhuangwq mentioned this issue Jul 4, 2020

Python 3 Upgrade #160

Merged

victorhuangwq added Type: Bug and removed bug labels Jul 8, 2020

victorhuangwq linked a pull request Jul 8, 2020 that will close this issue

Python 3 Upgrade #160

Merged

fanpu closed this as completed in #160 Aug 19, 2020

fanpu mentioned this issue Dec 4, 2020

Tango concurrency and locking issues #182

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoneType Exception When Queuing Jobs #138

NoneType Exception When Queuing Jobs #138

chriswailes commented Sep 25, 2017

bstriner commented Mar 18, 2018

chriswailes commented Mar 20, 2018

victorhuangwq commented May 30, 2020

victorhuangwq commented Jun 20, 2020 •

edited

Loading

NoneType Exception When Queuing Jobs #138

NoneType Exception When Queuing Jobs #138

Comments

chriswailes commented Sep 25, 2017

Expected Behavior

Actual Behavior

Steps to Reproduce the Behavior

Additional Details

bstriner commented Mar 18, 2018

chriswailes commented Mar 20, 2018

victorhuangwq commented May 30, 2020

victorhuangwq commented Jun 20, 2020 • edited Loading

TLDR

victorhuangwq commented Jun 20, 2020 •

edited

Loading