Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

Open
JinZhou5042 opened this issue Nov 27, 2024 · 8 comments · May be fixed by #4006
Open

vine: scheduling inefficiency b/c task resource pre-allocation usually fails #3995

JinZhou5042 opened this issue Nov 27, 2024 · 8 comments · May be fixed by #4006

Comments

@JinZhou5042
Copy link
Member

In check_worker_against_task, we check if a task is able to run on a worker, this includes two essential steps:

  1. Estimate the resources to be allocated for this task, in vine_manager_choose_resources_for_task. By default, we use a proportional technique, which chooses the max proportional cpu/memory/disk/gpu resources for the task.
  2. Check if the chosen resources fit the actual worker, in check_worker_have_enough_resources. This compares the chosen cpu/memory/disk/gpu against the actual or current resource usage state on the worker.

However, the chosen resources are usually larger than the available resources, which results in the second step constantly failing until some tasks complete on the worker and release more resources. The first step tends to choose a larger portion of resources than the available, so the scheduling overhead seems to dominate the latency. I observed this by printing some debug messages in terminal.

To be more clear, I ran a small version of DV5 with the current version (without any change) and my test version (select a much smaller portion of resources in vine_manager_choose_resources_for_task), separately, the below shows their differences:

  • current version: the resource allocation is expensive and tasks run very sparse
    QQ_1732740902599

  • test version: the resource allocation mostly succeed and task concurrency seems good
    QQ_1732740965414

In total there are 8801 tasks, 99% of them finish in 10s. Factory configuration:

{
    "manager-name": "jzhou24-hgg2",
    "max-workers": 32,
    "min-workers": 16,
    "workers-per-cycle": 16,
    "condor-requirements":"((has_vast))",
    "cores": 4,
    "memory": 10*1024,
    "disk": 50*1024,
    "timeout": 36000
}

That said, I think there is a room to reduce the scheduling latency by combining a set of techniques:

  • Maybe allocate the resources for a task based on the available resources on a worker (combine the first and the second step)?
  • Track the global usable cores on all workers and one task can be considered only if there is at least one available core in the cluster.
  • Improve the proportional resource allocation.
@JinZhou5042 JinZhou5042 changed the title vine: scheduling inefficiency b/c task resource estimation and allocation vine: scheduling inefficiency b/c task resource pre-allocation usually fails Nov 27, 2024
@JinZhou5042
Copy link
Member Author

The checking fails mostly because the disk was allocated for too much.

@btovar
Copy link
Member

btovar commented Dec 2, 2024

@JinZhou5042 What do you mean? Without caching files and each task uses one core, then auto memory should be 2560 an disk 12800. If this is not the case, then this might be a bug. How much is it allocated per task so that the check fails? Is it because there are files already cached at the worker and the disk available for the proportion is less?

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Dec 2, 2024

I logged the resource allocation function from a run comprising ~8k tasks, the first figure shows that the manager had 458k attempts for task resource allocation, among them 96.4% failed because the task was allocated for more resources and thus was unable to be scheduled on that particular worker.

image

The second figure shows the three types of failure for those unsuccessful attempts. Surprisingly, every single failure comes along with a disk overallocation, meaning that the scheduler might allocate disk too aggressively. The number of core overallocation is equal to the number of memory overallocation, probably indicating a normal overallocation that the worker is busy with some other tasks thus it does not have extra free resources.

image

Attach the log here

One typical failure looks like this:

task_id: 861

allocated_cores: 1
available_cores: 1

allocated_memory: 2560
available_memory: 2560

allocated_disk: 12683
available_disk: 12459

The worker has 1 core, 2560 memory and 12459 disk available, however the scheduler allocates 1 core, 2560 memory and 12683 disk, with disk over allocating and the task has to be considered against another worker.

My hypothesis is that task outputs are not properly considered while allocating disk space.

@JinZhou5042
Copy link
Member Author

Two possible improvements here:

  • Track the total number of cores and one task can be considered only if there is at least one core usable
  • Maybe don't allocate too much disk?

@JinZhou5042 JinZhou5042 self-assigned this Dec 10, 2024
@JinZhou5042 JinZhou5042 linked a pull request Dec 11, 2024 that will close this issue
7 tasks
@dthain
Copy link
Member

dthain commented Dec 13, 2024

Per our discussion earlier this week, we decided that the fundamental issue is that the manager is attempting to allocate all of the disk space at the worker, which results in trouble when the cache expands.

Assuming the following:

T = total disk space at worker
C = disk used by worker cache
S_i = disk used by task sandbox I
A = available (unused) space at the worker

C + sum(S_i) + A = T

Then, when the manager does not have a good prediction of the space needed by a task, it should allocate the following:

T_disk = (A/2) * (T_cores / W_cores)

The result of this is that (in the absence of of other information) the manager will seek to allocate one half of the available disk space, and give each task storage in proportion to the number of cores requested.

This will allow the cache (or other task sandboxes) some room to grow while still allocating at least one half of the total disk.

@btovar
Copy link
Member

btovar commented Dec 13, 2024

For A = available (unused) space at the worker and (A/2).

I think A should be total worker - in cache use. Using unused space is unfair for tasks scheduled later to the worker.

The dividing by two should be done after the fact in case the proportion is larger than total worker - in cache use - sandboxes . We do not want to first divide by 2 because that hurts tasks that use, say, all the cores at a worker.

Further, the divide by 2 can be applied to all resources when needed, not only disk.

With this, some allocations will fail, but I think that's ok. If the allocation does not fit, even after /2, I think that's a cache management issue rather than an allocation one.

For disk we could put a bound on the proportion. Say, the proportion should not be larger than 3 times the maximum observed sandbox, or something like that. If the task fail because of disk, then the maximum observed sandbox increased, and the task can be retried appropriately.

@dthain
Copy link
Member

dthain commented Dec 13, 2024

No, the divide by two is specific to the disk resource, and should not be applied to the others.

All the other resource types (cores, gpus, memory) are used only by tasks. And there is no harm in consuming all of the resource. Once the task is done, the resource is freed.

Disk is different because it is consumed by both the cache directory as well as the task sandboxes. And so our objective is to leave some disk unused so that the cache directory can grow.

@btovar
Copy link
Member

btovar commented Dec 13, 2024

The divide by two for other resources is simply making a less conservative guess. What Jin is observing with failed allocations because of disk happens with memory when tasks specify different quantities for other resources (i.e. different categories): a proportion may dictate to use more memory than the one available, but the proportion is only a guess and tasks may run fine with less memory.

I would prefer to have the allocations code just to deal with allocations. Leaving space for the cache to grow sounds to me like a storage policy. The problem with implementing it in the proportional allocations is that they only kick in when a resource is not specified; if I specify disk, then the available/2 does not come into play for cache. Further, using available space for disk may give the wrong proportion for the other resources: if a task specifies to use 100% of the available disk, then it will want to use 100% of the cores.

If we really always want to divide by two the proportional allocation for disk, it should be done after the proportional value has been computed, and the proportion should be computed over worker total - in cache. We let the scheduling checks to reject the allocation if it is too large.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

3 participants