Min:1 Max:1 - still multiple runners spawned #168

gc-nathanh · 2023-02-06T12:39:58Z

We have been testing runner-manager for our CI workloads, but for various reasons we cannot support any sort of concurrency for some of our runner configurations - there can be only a single runner available.

Despite setting min1: max:1 we still sometimes have the scenario where multiple runners are spawned.

Is this a known issue?

tcarmet · 2023-02-27T23:50:46Z

Hello @gc-nathanh,

My apologies for the late response,

Could you share an idea or example of what the runner pool configuration looks like on your end? And if possible describing workflow scenario with what is happenining and what are you expecting as an outcome. It might help me understand why you are having concurrency issue.

Regarding the issue on min: 1 max: 1, we recycle runners after they have been assigned a workflow, it may be possible for a brief moment that 2 runners are up, while one is getting deleted. However if you see more than 2, then I believe we may have a bug.

gc-nathanh · 2023-02-28T09:24:23Z

Hi @tcarmet - no problem!

A little more information on what we're trying to achieve: we're using the runner manager to stand up instances for our CI in our Openstack cluster, but testing against specific physical hardware (not managed by the runner manager, but connected into the Openstack project), of which we have only one per runner pool.

If we have two runners that both register to Github and run against the same physical hardware, one (or both) jobs will fail as the hardware doesn't support concurrency. We've observed that although set to min:1 max:1, there are times where additional runners are created and register. Right now we've reduced the project quota so much that it can only support one runner at a time, but that means we've had to put each runner pool into it's own Openstack project.

I've forked the runner manager as i've had to make some specific changes to how VMs are created (we have some specific requirements for additional metadata and particular network interface configuration) but none of this should affect the scheduling logic.

I have also optimised our startup time by baking in as much as I can into the runner image so the startup time is as fast as possible.

An example of the pool config is:

runner_pool:
   - config:
       flavor: 'amdvcpu.small'
       image: 'ubuntu20.04-runner'
       availability_zone: ''
       rnic_network_name: "dmzvpod4-rnic"
       vipu_ipaddr: "10.3.3.189"
       partition_name: "dmzvpod4-4ipu"
       vipu_port: "8090"
     quantity:
       min: 1
       max: 1
     tags:
       - Ubuntu20.04
       - pod4
       - amd
       - public
       - M2000
       - dmzvpod4

where you can see the extra params. I'll see if I can collect some logs which captures the problem.

tcarmet · 2023-02-28T18:01:31Z

Thank you for providing additional context, it's very helpfull! I'm glad you had the idea to pre-install some dependencies inside the runner image to optimize startup time 👌
Also the configuration looks good to me, nothing raised any concern as far as i know.

I can see why you needed to fork the project, I believe there's also some leftover in our code from our own openstack infra. And I confirm, as far as I know, modifications inside the openstack cloud backend shouldn't impact the scheduling.

This project was initially built with the logic of pre-creating runners due to how self-hosted runners used to work in the past on GitHub. Before GitHub added the --ephemeral flag.
I believe you are facing the following scenario:

The runner-manager creates a pool with one runner.
The runner-manager see that the runner inside that pool is currently busy (attached to a job).
The runner-manager start pre-creating a second runner (this is where you are seeing the issue)

I think we can agree that this second runner pre-creation should not happen if max is reached, even if a job is running.

Are you able to confirm that this seems like the scenario you are facing?

PS: Sorry I don't know how I accidently edited your comment instead of answering it.

gc-nathanh · 2023-03-03T16:53:12Z

I don't have any logs which can say for sure, but I think your supposition is correct. The alternative approach I'd been looking at is to put some logic into the runner registration, but I suspect that would be challenging.

tcarmet · 2023-03-03T18:07:01Z

With the runner in its current state, it might yes. I can recommend you to look at when the runner receive a webhook of a runner that started a job (busy/in_progress) and go from there to see if there's any condition missing.

Worth noting that we do have plans to rework this part in Q2 and make it easier to test and maintain, however we're still in planning I can't make any promise into weither we will be working on it or not. I can update you here if that ends up to be the case.

tcarmet · 2024-10-04T19:32:46Z

Better late than never, thanks to #685 and the 1.x of the runner-manager this issue is now solved.

tcarmet added the bug Something isn't working label Feb 28, 2023

tcarmet closed this as completed Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Min:1 Max:1 - still multiple runners spawned #168

Min:1 Max:1 - still multiple runners spawned #168

gc-nathanh commented Feb 6, 2023

tcarmet commented Feb 27, 2023

gc-nathanh commented Feb 28, 2023 •

edited by tcarmet

Loading

tcarmet commented Feb 28, 2023

gc-nathanh commented Mar 3, 2023

tcarmet commented Mar 3, 2023

tcarmet commented Oct 4, 2024

Min:1 Max:1 - still multiple runners spawned #168

Min:1 Max:1 - still multiple runners spawned #168

Comments

gc-nathanh commented Feb 6, 2023

tcarmet commented Feb 27, 2023

gc-nathanh commented Feb 28, 2023 • edited by tcarmet Loading

tcarmet commented Feb 28, 2023

gc-nathanh commented Mar 3, 2023

tcarmet commented Mar 3, 2023

tcarmet commented Oct 4, 2024

gc-nathanh commented Feb 28, 2023 •

edited by tcarmet

Loading