Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebalance JWDs #891

Merged
merged 1 commit into from
Sep 6, 2023
Merged

Rebalance JWDs #891

merged 1 commit into from
Sep 6, 2023

Conversation

kysrpex
Copy link
Contributor

@kysrpex kysrpex commented Sep 6, 2023

Have a look at the storage stats, jwd05e is almost full while jwd02f is almost empty, despite both having the same weight (something is must be wrong with the job distribution).

I spent some time this morning familiarizing myself with the mechanism that chooses a storage backend to send new jobs to and at first glance the code looks legit, so I do not know what is causing the skew. There is a mechanism to exclude storage backends that are almost full, but the feature seems to be unfinished. I have tried it and some storage backends do not have the function that computes the free space implemented (e.g. S3).

For a few hours (since about 13:08), we have been operating like in this PR without issues: sending 70% of jobs to jwd02f and 30% to jwd05e.

The change seems to work properly.

$ journalctl -u "galaxy-handler@*" --since "2023-09-06 13:08:00" | grep "files23" | wc -l
5090
$ journalctl -u "galaxy-handler@*" --since "2023-09-06 13:08:00" | grep "files24" | wc -l
2066

Which arguably is weird, because if it works, the storage distribution should not have been skewed in the first place. Maybe we are storing something in jwd05e that we should not be storing?

If you want we can keep this for a day or two and then revert while we find out. Otherwise we may hit the storage limit soon. Sadly I have already cleaned up the job working directories for failed jobs 😞.

Send 70% of jobs to jwd02f and 30% to jwd05e.
@kysrpex kysrpex added the bug label Sep 6, 2023
@kysrpex kysrpex self-assigned this Sep 6, 2023
@bgruening
Copy link
Member

@jmchilton do you have maybe an idea here? We are running 23.1

@kysrpex kysrpex merged commit 8d5d31f into usegalaxy-eu:master Sep 6, 2023
2 checks passed
@kysrpex kysrpex deleted the rebalance_jwds branch September 6, 2023 14:32
@kysrpex
Copy link
Contributor Author

kysrpex commented Sep 6, 2023

I was quick to say without issues. Let's keep an eye on this. Typically, storage speed problems (which could arise from this PR) lead to high numbers of unprocessed jobs rather than processed jobs. This can be just heavy load on the cluster but as said let's keep in mind that this is happening.

@kysrpex kysrpex mentioned this pull request Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants