Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JobManager: create & start in parallel #719

Open
jdries opened this issue Jan 31, 2025 · 10 comments
Open

JobManager: create & start in parallel #719

jdries opened this issue Jan 31, 2025 · 10 comments
Assignees

Comments

@jdries
Copy link
Collaborator

jdries commented Jan 31, 2025

Creating and starting a job takes some time, which means there's an interval when the jobmanager is creating new jobs and resources are potentially unused.
If we can start jobs in parallel, we can decrease this.
Do note that we typically have rate-limiting in place on backends, so we have to be resilient there.

@HansVRP
Copy link
Contributor

HansVRP commented Jan 31, 2025

@jdries @soxofaan could we run the start job across multiple threads e.g. by:

from threading import Thread

    threads = []
    for backend_name in self.backends:
        backend_load = per_backend.get(backend_name, 0)
        available_slots = 
        for i in not_started.index[:available_slots]:
            thread = Thread(target=self._launch_job, args=(...)
            thread.start()
            threads.append((thread, i))

    for thread, i in threads:
        thread.join()  

another option might be to look into asyncio which would run sinfle thread but jump between the various job creations/starts
without waiting? asyncio might be the more scalable solution in case we want to start 100s of jobs at once.

@HansVRP HansVRP self-assigned this Jan 31, 2025
@soxofaan
Copy link
Member

soxofaan commented Jan 31, 2025

Threading won't work I'm afraid, because in Python only one thread can be active at a time and the current "start_job" requests are blocking. So the execution would not be parallel in reality .

We would have to use a non-blocking request library like https://www.python-httpx.org, or use multiprocessing to have effective parallelism

multiprocessing might be the easiest route for now (I'm not so sure how easy it will be to switch to httpx from our classic "requests" based implementation)

@soxofaan
Copy link
Member

another option might be to look into asyncio which would run sinfle thread but jump between the various job creations/starts
without waiting? asyncio might be the more scalable solution in case we want to start 100s of jobs at once

Indeed, that would probably be a more modern approach, but it's not trivial to migrate everything (or at least a well chosen subset) we already have to this new paradigm

@HansVRP
Copy link
Contributor

HansVRP commented Jan 31, 2025

reading a bit deeper into it, If we indeed want a full performance we need to make sure that all network requests, database queries, etc can run asynchronously.

This might make the code overtly complex since as a standard we only support 2 parallel jobs...

@soxofaan
Copy link
Member

Threading won't work I'm afraid, because in Python only one thread can be active at a time and the current "start_job" requests are blocking. So the execution would not be parallel in reality

Ok I did some testing with requests in threads, and apparently it does work to do requests in parallel that way. I was probably confusing it with another threading problem I had before.
So yes basic threading is probably the easiest solution here

@HansVRP
Copy link
Contributor

HansVRP commented Jan 31, 2025

Do we know what the upper limit is on the amount of threads we could use? Being able to add 20 jobs in parallel would already make a big difference. LCFM would probably prefer 100 at once

@soxofaan
Copy link
Member

Note that it might be counter-productive to do too much in parallel as well:
flooding/saturating the openeo workers and resources, triggering rate limit middleware errors that are not retried by the client (because they don't use openeo conventions), etc.

I would default to something like 5, and maybe scale up a bit if you know what you are doing.

threading tutorials typically point to thread pools (with fixed limit) to solve this easily

@HansVRP
Copy link
Contributor

HansVRP commented Feb 3, 2025

before we needed 2 minutes and 10 seconds to start all jobs:

Image

now it is reduced to 22 seconds. Notice how initially 5 jobs were started together (equal to the max thread pool)

Image

@JorisCod
Copy link

JorisCod commented Feb 3, 2025

Just to be sure, this is with the STAC-based implementation of the job manager?

As pandas dataframes are not thread-safe, so unless you explicitly add locks, this might run awry at scale.

@soxofaan
Copy link
Member

soxofaan commented Feb 4, 2025

Same concern here (and in PR #723) about pandas. While pandas as kind of database API was handy in the proof of concept phase of the job manager, I have the feeling it is now actually making progress harder than it could be. We had various struggles in the past when implementing new job manager features, and it is now making threading based features quite challenging. I think we should try to get/keep pandas out of the code paths that we want to run in threads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants