JobManager: create & start in parallel #719

jdries · 2025-01-31T07:29:33Z

Creating and starting a job takes some time, which means there's an interval when the jobmanager is creating new jobs and resources are potentially unused.
If we can start jobs in parallel, we can decrease this.
Do note that we typically have rate-limiting in place on backends, so we have to be resilient there.

HansVRP · 2025-01-31T10:08:28Z

@jdries @soxofaan could we run the start job across multiple threads e.g. by:

from threading import Thread

    threads = []
    for backend_name in self.backends:
        backend_load = per_backend.get(backend_name, 0)
        available_slots = 
        for i in not_started.index[:available_slots]:
            thread = Thread(target=self._launch_job, args=(...)
            thread.start()
            threads.append((thread, i))

    for thread, i in threads:
        thread.join()

another option might be to look into asyncio which would run sinfle thread but jump between the various job creations/starts
without waiting? asyncio might be the more scalable solution in case we want to start 100s of jobs at once.

soxofaan · 2025-01-31T10:40:47Z

Threading won't work I'm afraid, because in Python only one thread can be active at a time and the current "start_job" requests are blocking. So the execution would not be parallel in reality .

We would have to use a non-blocking request library like https://www.python-httpx.org, or use multiprocessing to have effective parallelism

multiprocessing might be the easiest route for now (I'm not so sure how easy it will be to switch to httpx from our classic "requests" based implementation)

soxofaan · 2025-01-31T10:46:04Z

another option might be to look into asyncio which would run sinfle thread but jump between the various job creations/starts
without waiting? asyncio might be the more scalable solution in case we want to start 100s of jobs at once

Indeed, that would probably be a more modern approach, but it's not trivial to migrate everything (or at least a well chosen subset) we already have to this new paradigm

HansVRP · 2025-01-31T10:48:18Z

reading a bit deeper into it, If we indeed want a full performance we need to make sure that all network requests, database queries, etc can run asynchronously.

This might make the code overtly complex since as a standard we only support 2 parallel jobs...

soxofaan · 2025-01-31T11:25:51Z

Threading won't work I'm afraid, because in Python only one thread can be active at a time and the current "start_job" requests are blocking. So the execution would not be parallel in reality

Ok I did some testing with requests in threads, and apparently it does work to do requests in parallel that way. I was probably confusing it with another threading problem I had before.
So yes basic threading is probably the easiest solution here

HansVRP · 2025-01-31T12:46:57Z

Do we know what the upper limit is on the amount of threads we could use? Being able to add 20 jobs in parallel would already make a big difference. LCFM would probably prefer 100 at once

soxofaan · 2025-01-31T16:34:24Z

Note that it might be counter-productive to do too much in parallel as well:
flooding/saturating the openeo workers and resources, triggering rate limit middleware errors that are not retried by the client (because they don't use openeo conventions), etc.

I would default to something like 5, and maybe scale up a bit if you know what you are doing.

threading tutorials typically point to thread pools (with fixed limit) to solve this easily

HansVRP · 2025-02-03T12:21:21Z

before we needed 2 minutes and 10 seconds to start all jobs:

now it is reduced to 22 seconds. Notice how initially 5 jobs were started together (equal to the max thread pool)

JorisCod · 2025-02-03T19:54:35Z

Just to be sure, this is with the STAC-based implementation of the job manager?

As pandas dataframes are not thread-safe, so unless you explicitly add locks, this might run awry at scale.

soxofaan · 2025-02-04T08:21:30Z

Same concern here (and in PR #723) about pandas. While pandas as kind of database API was handy in the proof of concept phase of the job manager, I have the feeling it is now actually making progress harder than it could be. We had various struggles in the past when implementing new job manager features, and it is now making threading based features quite challenging. I think we should try to get/keep pandas out of the code paths that we want to run in threads.

jdries mentioned this issue Jan 22, 2025

[EPIC] MultibackendJobmanager improvement #710

Open

HansVRP self-assigned this Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JobManager: create & start in parallel #719

JobManager: create & start in parallel #719

jdries commented Jan 31, 2025

HansVRP commented Jan 31, 2025 •

edited

Loading

soxofaan commented Jan 31, 2025 •

edited

Loading

soxofaan commented Jan 31, 2025

HansVRP commented Jan 31, 2025 •

edited

Loading

soxofaan commented Jan 31, 2025

HansVRP commented Jan 31, 2025

soxofaan commented Jan 31, 2025

HansVRP commented Feb 3, 2025

JorisCod commented Feb 3, 2025 •

edited

Loading

soxofaan commented Feb 4, 2025

JobManager: create & start in parallel #719

JobManager: create & start in parallel #719

Comments

jdries commented Jan 31, 2025

HansVRP commented Jan 31, 2025 • edited Loading

soxofaan commented Jan 31, 2025 • edited Loading

soxofaan commented Jan 31, 2025

HansVRP commented Jan 31, 2025 • edited Loading

soxofaan commented Jan 31, 2025

HansVRP commented Jan 31, 2025

soxofaan commented Jan 31, 2025

HansVRP commented Feb 3, 2025

JorisCod commented Feb 3, 2025 • edited Loading

soxofaan commented Feb 4, 2025

HansVRP commented Jan 31, 2025 •

edited

Loading

soxofaan commented Jan 31, 2025 •

edited

Loading

HansVRP commented Jan 31, 2025 •

edited

Loading

JorisCod commented Feb 3, 2025 •

edited

Loading