Really slow gateway responses on a RHEL cluster #456

atb1r21 · 2025-01-24T16:13:47Z

atb1r21
Jan 24, 2025

Hi,

I've been looking at slurm-web for a couple of weeks now, as I have two production clusters I'd like to deploy it on.

I have started off by deploying a localhost instance on each cluster separately, so make sure everything is stable.

On one cluster it has worked perfectly and I have no issues, from following the Quickstart guide.

On the second cluster, using the same major version of RHEL, and the same method, everything starts but the slurm-web-gateway has issues.

In this gateway if I try to move around between pages I am constantly getting the following error:

Server error: Request error: canceled

I've tried running the gateway interactively in debug and I am seeing no issues, and similarly I can't see any error messages coming from the agent in this cluster.

I can't find any reference to these errors in the command line and was wondering if I have encountered a bug?

Could you please let me know any information I need to provide to help diagnose this?

Thanks.

Answered by rezib

Jan 25, 2025

Hello @atb1r21, I just converted the issue you opened #455 into this discussion as this is more a support request than an actual bug, at least for now.

Did you enabled redis cache on the agent? For reference, see: https://docs.rackslab.io/slurm-web/install/quickstart.html#cache

If not, you should definitely do it in the first place, as it saves many requests to slurmrestd and speeds up page rendering a lot in most cases.

Does this second cluster have more jobs than the first one? In presence of thousands of jobs in slurmctld queue, slurmrestd can be slow to render the list of jobs. You can test it with:

$ time curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slur…

View full answer

atb1r21 · 2025-01-24T16:16:28Z

atb1r21
Jan 24, 2025
Author

Additionally I have also been seeing:

Server error: Request error: Network Error

0 replies

rezib · 2025-01-25T10:06:54Z

rezib
Jan 25, 2025
Maintainer

Hello @atb1r21, I just converted the issue you opened #455 into this discussion as this is more a support request than an actual bug, at least for now.

Did you enabled redis cache on the agent? For reference, see: https://docs.rackslab.io/slurm-web/install/quickstart.html#cache

If not, you should definitely do it in the first place, as it saves many requests to slurmrestd and speeds up page rendering a lot in most cases.

Does this second cluster have more jobs than the first one? In presence of thousands of jobs in slurmctld queue, slurmrestd can be slow to render the list of jobs. You can test it with:

$ time curl --silent --unix-socket /run/slurmrestd/slurmrestd.socket http://slurm/slurm/v0.0.40/jobs

If this command takes many seconds to complete, this is probably the root cause.

If not, I would then look in browser developers console to identify which networks requests take time to get response specifically.

2 replies

atb1r21 Jan 27, 2025
Author

Hello,

So I have tried redis caching to solve the issue and it helps a bit but the issue still occurs when there is an update to the cluster information.

I have tried the above command to test the time for curl the socket.

Cluster without issues:
real 0m4.320s
user 0m0.002s
sys 0m0.068s

Cluser with issues:
real 0m14.090s
user 0m0.002s
sys 0m0.028s

So there's certainly a discrepancy there, as it's about 3.5 times slower.

The thing is that on the cluster that is slower, there's only about 100 in the queue but on the faster cluster there are 496 jobs.

However the above really does indicate that is it slow to page slurmrestd so you're 100% right in that your software is not the issue here, as it seems totally related to paging that socket.

Do you have any tips? This is my first time with slurmrestd so anything you could advise on would be helpful.

rezib Jan 27, 2025
Maintainer

Indeed, 14 seconds for ~100 jobs is very slow. I don't want this channel to become an unofficial Slurm community support channel, the best place to ask is probably Slurm users mailing-list.

Just to give you some tracks, I would first check if the difference is visible with squeue as well, just to make sure the slowness does not take place in slurmctld. Then, I would check slurmrestd is just burning CPU time or is waiting for something else (eg. NSS resolution).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Really slow gateway responses on a RHEL cluster #456

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Really slow gateway responses on a RHEL cluster #456

atb1r21 Jan 24, 2025

Replies: 2 comments · 2 replies

atb1r21 Jan 24, 2025 Author

rezib Jan 25, 2025 Maintainer

atb1r21 Jan 27, 2025 Author

rezib Jan 27, 2025 Maintainer

atb1r21
Jan 24, 2025

Replies: 2 comments 2 replies

atb1r21
Jan 24, 2025
Author

rezib
Jan 25, 2025
Maintainer

atb1r21 Jan 27, 2025
Author

rezib Jan 27, 2025
Maintainer