vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

JinZhou5042 · 2024-11-27T21:02:15Z

In check_worker_against_task, we check if a task is able to run on a worker, this includes two essential steps:

Estimate the resources to be allocated for this task, in vine_manager_choose_resources_for_task. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.
Check if the chosen resources fit the actual worker, in check_worker_have_enough_resources. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.

However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.

To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in vine_manager_choose_resources_for_task), separately, the below shows their differences:

current version: the resource allocation is expensive and tasks run very sparse
test version: the resource allocation mostly succeed and task concurrency seems good

In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:

{
    "manager-name": "jzhou24-hgg2",
    "max-workers": 32,
    "min-workers": 16,
    "workers-per-cycle": 16,
    "condor-requirements":"((has_vast))",
    "cores": 4,
    "memory": 10*1024,
    "disk": 50*1024,
    "timeout": 36000
}

That said, I think there is a room to reduce the scheduling latency by combining a set of techniques:

Maybe allocate the resources for a task based on the available resources on a worker (combine the first and the second step)?
Track the global usable cores on all workers and one task can be considered only if there is at least one available core in the cluster.
Improve the proportional resource allocation.

The text was updated successfully, but these errors were encountered:

JinZhou5042 · 2024-11-27T21:06:46Z

The checking fails mostly because the disk was allocated for too much.

btovar · 2024-12-02T14:50:52Z

@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less?

JinZhou5042 · 2024-12-02T21:57:49Z

I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker.

The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources.

Attach the log here

One typical failure looks like this:

task_id: 861

allocated_cores: 1
available_cores: 1

allocated_memory: 2560
available_memory: 2560

allocated_disk: 12683
available_disk: 12459

The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker.

My hypothesis is that task outputs are not properly considered while allocating disk space.

JinZhou5042 · 2024-12-02T22:08:23Z

Two possible improvements here:

Track the total number of cores and one task can be considered only if there is at least one core usable
Maybe don't allocate too much disk?

dthain · 2024-12-13T15:36:11Z

Per our discussion earlier this week, we decided that the fundamental issue is that the manager is attempting to allocate all of the disk space at the worker, which results in trouble when the cache expands.

Assuming the following:

T = total disk space at worker
C = disk used by worker cache
S_i = disk used by task sandbox I
A = available (unused) space at the worker

C + sum(S_i) + A = T

Then, when the manager does not have a good prediction of the space needed by a task, it should allocate the following:

T_disk = (A/2) * (T_cores / W_cores)

The result of this is that (in the absence of of other information) the manager will seek to allocate one half of the available disk space, and give each task storage in proportion to the number of cores requested.

This will allow the cache (or other task sandboxes) some room to grow while still allocating at least one half of the total disk.

btovar · 2024-12-13T17:08:55Z

For A = available (unused) space at the worker and (A/2).

I think A should be total worker - in cache use. Using unused space is unfair for tasks scheduled later to the worker.

The dividing by two should be done after the fact in case the proportion is larger than total worker - in cache use - sandboxes . We do not want to first divide by 2 because that hurts tasks that use, say, all the cores at a worker.

Further, the divide by 2 can be applied to all resources when needed, not only disk.

With this, some allocations will fail, but I think that's ok. If the allocation does not fit, even after /2, I think that's a cache management issue rather than an allocation one.

For disk we could put a bound on the proportion. Say, the proportion should not be larger than 3 times the maximum observed sandbox, or something like that. If the task fail because of disk, then the maximum observed sandbox increased, and the task can be retried appropriately.

dthain · 2024-12-13T17:42:35Z

No, the divide by two is specific to the disk resource, and should not be applied to the others.

All the other resource types (cores, gpus, memory) are used only by tasks. And there is no harm in consuming all of the resource. Once the task is done, the resource is freed.

Disk is different because it is consumed by both the cache directory as well as the task sandboxes. And so our objective is to leave some disk unused so that the cache directory can grow.

btovar · 2024-12-13T18:16:23Z

The divide by two for other resources is simply making a less conservative guess. What Jin is observing with failed allocations because of disk happens with memory when tasks specify different quantities for other resources (i.e. different categories): a proportion may dictate to use more memory than the one available, but the proportion is only a guess and tasks may run fine with less memory.

I would prefer to have the allocations code just to deal with allocations. Leaving space for the cache to grow sounds to me like a storage policy. The problem with implementing it in the proportional allocations is that they only kick in when a resource is not specified; if I specify disk, then the available/2 does not come into play for cache. Further, using available space for disk may give the wrong proportion for the other resources: if a task specifies to use 100% of the available disk, then it will want to use 100% of the cores.

If we really always want to divide by two the proportional allocation for disk, it should be done after the proportional value has been computed, and the proportion should be computed over worker total - in cache. We let the scheduling checks to reject the allocation if it is too large.

JinZhou5042 changed the title ~~vine: scheduling inefficiency b/c task resource estimation and allocation~~ vine: scheduling inefficiency b/c task resource pre-allocation usually fails Nov 27, 2024

JinZhou5042 added enhancement TaskVine labels Nov 27, 2024

JinZhou5042 added this to TaskVine Storage Management Nov 27, 2024

JinZhou5042 self-assigned this Dec 10, 2024

JinZhou5042 linked a pull request Dec 11, 2024 that will close this issue

vine: efficient resource allocation #4006

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

JinZhou5042 commented Nov 27, 2024

JinZhou5042 commented Nov 27, 2024

btovar commented Dec 2, 2024

JinZhou5042 commented Dec 2, 2024 •

edited

Loading

JinZhou5042 commented Dec 2, 2024

dthain commented Dec 13, 2024

btovar commented Dec 13, 2024

dthain commented Dec 13, 2024

btovar commented Dec 13, 2024

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

Comments

JinZhou5042 commented Nov 27, 2024

JinZhou5042 commented Nov 27, 2024

btovar commented Dec 2, 2024

JinZhou5042 commented Dec 2, 2024 • edited Loading

JinZhou5042 commented Dec 2, 2024

dthain commented Dec 13, 2024

btovar commented Dec 13, 2024

dthain commented Dec 13, 2024

btovar commented Dec 13, 2024

JinZhou5042 commented Dec 2, 2024 •

edited

Loading