Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #719 paralleljobthreading #723

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

HansVRP
Copy link
Contributor

@HansVRP HansVRP commented Feb 3, 2025

No description provided.

@HansVRP
Copy link
Contributor Author

HansVRP commented Feb 3, 2025

@soxofaan @jdries

Also did a small integration test on 10 jobs.

before we needed 2 minutes and 10 seconds to start all jobs:

image

now it is reduced to 22 seconds. Notice how initially 5 jobs were started together (equal to the max thread pool)

image

@HansVRP HansVRP requested review from soxofaan and jdries February 3, 2025 14:34
Copy link
Member

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial feedback

openeo/extra/job_management/__init__.py Outdated Show resolved Hide resolved
openeo/extra/job_management/__init__.py Outdated Show resolved Hide resolved
openeo/extra/job_management/__init__.py Outdated Show resolved Hide resolved
openeo/extra/job_management/__init__.py Outdated Show resolved Hide resolved
def job_worker(i, backend_name):
with semaphore:
try:
self._launch_job(start_job, not_started, i, backend_name, stats)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the db lock also apply to this launch job callable (because _launch_job has access to the db dataframe)?

However, this indicates there is a bit of problem: if you lock around launch_job, then you effectively loose parallelism again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the db lock is only to protect the job_db.persist calls. But then _launch_job should be given a read-only version of the dataframe row. I'm not sure if that is compatible with how users use _launch_job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I indeed made the db lock purely for the persist calls. Compatible in what way?

Copy link
Member

@soxofaan soxofaan Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that is compatible with how users use _launch_job.

maybe some users update some fields in the pandas row from withing their _launch_job implementation, expecting it to be persisted. But guaranteeing that undermines the opportunity for thread-safe and effective parallelism

Copy link
Contributor Author

@HansVRP HansVRP Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so do we then want to avoid the locking, which may lead to concurrency issues?

Or will we not support altering the _launch_job functionality and document that changes to the dataframe must occur within the persist function?

Copy link
Member

@soxofaan soxofaan Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we then want to avoid the locking

Indeed, the goal of this feature is to exploit parallelism for more efficient use of time and every lock that would be needed to ensure consistency undermines that goal. Instead of sharing state (e.g. pandas dataframes) between threads (requiring locks), I think we should aim for a design where there is as little as possible state sharing (e.g. dataframes) between the main thread and the worker threads.

For example, as the scope here is mainly to offload job starting in side-threads, these threads should be able to just do their work given the job id as single string (and probably a valid access token, again one string, to address auth). All the other state/objects drag in concurrency risks

@HansVRP HansVRP requested a review from soxofaan February 4, 2025 12:10
@HansVRP
Copy link
Contributor Author

HansVRP commented Feb 4, 2025

Made an update on how the threading and queing work together. I believe now we no longer send out batches of jobs, but continiously add jobs if a thread becomes available.

I did have to set a lock, to ensure the unit tests would pass

@HansVRP
Copy link
Contributor Author

HansVRP commented Feb 5, 2025

lock indeed causes a bottleneck in the que implementation:

image

Will need to discuss how best to proceed

soxofaan added a commit that referenced this pull request Feb 10, 2025
@soxofaan soxofaan changed the title Issue717 paralleljobthreading Issue #719 paralleljobthreading Feb 10, 2025
@soxofaan soxofaan linked an issue Feb 10, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JobManager: create & start in parallel
2 participants