Rebalance JWDs #891

kysrpex · 2023-09-06T14:23:15Z

Have a look at the storage stats, jwd05e is almost full while jwd02f is almost empty, despite both having the same weight (something is must be wrong with the job distribution).

I spent some time this morning familiarizing myself with the mechanism that chooses a storage backend to send new jobs to and at first glance the code looks legit, so I do not know what is causing the skew. There is a mechanism to exclude storage backends that are almost full, but the feature seems to be unfinished. I have tried it and some storage backends do not have the function that computes the free space implemented (e.g. S3).

For a few hours (since about 13:08), we have been operating like in this PR without issues: sending 70% of jobs to jwd02f and 30% to jwd05e.

The change seems to work properly.

$ journalctl -u "galaxy-handler@*" --since "2023-09-06 13:08:00" | grep "files23" | wc -l
5090
$ journalctl -u "galaxy-handler@*" --since "2023-09-06 13:08:00" | grep "files24" | wc -l
2066

Which arguably is weird, because if it works, the storage distribution should not have been skewed in the first place. Maybe we are storing something in jwd05e that we should not be storing?

If you want we can keep this for a day or two and then revert while we find out. Otherwise we may hit the storage limit soon. Sadly I have already cleaned up the job working directories for failed jobs 😞.

Send 70% of jobs to jwd02f and 30% to jwd05e.

bgruening · 2023-09-06T14:28:32Z

@jmchilton do you have maybe an idea here? We are running 23.1

kysrpex · 2023-09-06T15:07:41Z

I was quick to say without issues. Let's keep an eye on this. Typically, storage speed problems (which could arise from this PR) lead to high numbers of unprocessed jobs rather than processed jobs. This can be just heavy load on the cluster but as said let's keep in mind that this is happening.

Rebalance JWDs

c8fb874

Send 70% of jobs to jwd02f and 30% to jwd05e.

kysrpex added the bug label Sep 6, 2023

kysrpex requested a review from sanjaysrikakulam September 6, 2023 14:23

kysrpex self-assigned this Sep 6, 2023

bgruening approved these changes Sep 6, 2023

View reviewed changes

kysrpex merged commit 8d5d31f into usegalaxy-eu:master Sep 6, 2023
2 checks passed

kysrpex deleted the rebalance_jwds branch September 6, 2023 14:32

kysrpex mentioned this pull request Sep 7, 2023

Revert "Rebalance JWDs" #893

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebalance JWDs #891

Rebalance JWDs #891

kysrpex commented Sep 6, 2023

bgruening commented Sep 6, 2023

kysrpex commented Sep 6, 2023

Rebalance JWDs #891

Rebalance JWDs #891

Conversation

kysrpex commented Sep 6, 2023

bgruening commented Sep 6, 2023

kysrpex commented Sep 6, 2023