-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995
Comments
The checking fails mostly because the disk was allocated for too much. |
@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less? |
I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker. The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources. Attach the log here One typical failure looks like this:
The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker. My hypothesis is that task outputs are not properly considered while allocating disk space. |
Two possible improvements here:
|
Per our discussion earlier this week, we decided that the fundamental issue is that the manager is attempting to allocate all of the disk space at the worker, which results in trouble when the cache expands. Assuming the following:
Then, when the manager does not have a good prediction of the space needed by a task, it should allocate the following:
The result of this is that (in the absence of of other information) the manager will seek to allocate one half of the available disk space, and give each task storage in proportion to the number of cores requested. This will allow the cache (or other task sandboxes) some room to grow while still allocating at least one half of the total disk. |
For I think A should be The dividing by two should be done after the fact in case the proportion is larger than Further, the divide by 2 can be applied to all resources when needed, not only disk. With this, some allocations will fail, but I think that's ok. If the allocation does not fit, even after /2, I think that's a cache management issue rather than an allocation one. For disk we could put a bound on the proportion. Say, the proportion should not be larger than 3 times the maximum observed sandbox, or something like that. If the task fail because of disk, then the maximum observed sandbox increased, and the task can be retried appropriately. |
No, the divide by two is specific to the disk resource, and should not be applied to the others. All the other resource types (cores, gpus, memory) are used only by tasks. And there is no harm in consuming all of the resource. Once the task is done, the resource is freed. Disk is different because it is consumed by both the cache directory as well as the task sandboxes. And so our objective is to leave some disk unused so that the cache directory can grow. |
The divide by two for other resources is simply making a less conservative guess. What Jin is observing with failed allocations because of disk happens with memory when tasks specify different quantities for other resources (i.e. different categories): a proportion may dictate to use more memory than the one available, but the proportion is only a guess and tasks may run fine with less memory. I would prefer to have the allocations code just to deal with allocations. Leaving space for the cache to grow sounds to me like a storage policy. The problem with implementing it in the proportional allocations is that they only kick in when a resource is not specified; if I specify disk, then the available/2 does not come into play for cache. Further, using available space for disk may give the wrong proportion for the other resources: if a task specifies to use 100% of the available disk, then it will want to use 100% of the cores. If we really always want to divide by two the proportional allocation for disk, it should be done after the proportional value has been computed, and the proportion should be computed over |
In
check_worker_against_task
, we check if a task is able to run on a worker, this includes two essential steps:vine_manager_choose_resources_for_task
. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.check_worker_have_enough_resources
. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.
To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in
vine_manager_choose_resources_for_task
), separately, the below shows their differences:current version: the resource allocation is expensive and tasks run very sparse
test version: the resource allocation mostly succeed and task concurrency seems good
In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:
That said, I think there is a room to reduce the scheduling latency by combining a set of techniques:
The text was updated successfully, but these errors were encountered: