Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Min:1 Max:1 - still multiple runners spawned #168

Closed
gc-nathanh opened this issue Feb 6, 2023 · 6 comments
Closed

Min:1 Max:1 - still multiple runners spawned #168

gc-nathanh opened this issue Feb 6, 2023 · 6 comments
Labels
bug Something isn't working

Comments

@gc-nathanh
Copy link

We have been testing runner-manager for our CI workloads, but for various reasons we cannot support any sort of concurrency for some of our runner configurations - there can be only a single runner available.

Despite setting min1: max:1 we still sometimes have the scenario where multiple runners are spawned.

Is this a known issue?

@tcarmet
Copy link
Contributor

tcarmet commented Feb 27, 2023

Hello @gc-nathanh,

My apologies for the late response,

Could you share an idea or example of what the runner pool configuration looks like on your end? And if possible describing workflow scenario with what is happenining and what are you expecting as an outcome. It might help me understand why you are having concurrency issue.

Regarding the issue on min: 1 max: 1, we recycle runners after they have been assigned a workflow, it may be possible for a brief moment that 2 runners are up, while one is getting deleted. However if you see more than 2, then I believe we may have a bug.

@gc-nathanh
Copy link
Author

gc-nathanh commented Feb 28, 2023

Hi @tcarmet - no problem!

A little more information on what we're trying to achieve: we're using the runner manager to stand up instances for our CI in our Openstack cluster, but testing against specific physical hardware (not managed by the runner manager, but connected into the Openstack project), of which we have only one per runner pool.

If we have two runners that both register to Github and run against the same physical hardware, one (or both) jobs will fail as the hardware doesn't support concurrency. We've observed that although set to min:1 max:1, there are times where additional runners are created and register. Right now we've reduced the project quota so much that it can only support one runner at a time, but that means we've had to put each runner pool into it's own Openstack project.

I've forked the runner manager as i've had to make some specific changes to how VMs are created (we have some specific requirements for additional metadata and particular network interface configuration) but none of this should affect the scheduling logic.

I have also optimised our startup time by baking in as much as I can into the runner image so the startup time is as fast as possible.

An example of the pool config is:

runner_pool:
   - config:
       flavor: 'amdvcpu.small'
       image: 'ubuntu20.04-runner'
       availability_zone: ''
       rnic_network_name: "dmzvpod4-rnic"
       vipu_ipaddr: "10.3.3.189"
       partition_name: "dmzvpod4-4ipu"
       vipu_port: "8090"
     quantity:
       min: 1
       max: 1
     tags:
       - Ubuntu20.04
       - pod4
       - amd
       - public
       - M2000
       - dmzvpod4

where you can see the extra params. I'll see if I can collect some logs which captures the problem.

@tcarmet
Copy link
Contributor

tcarmet commented Feb 28, 2023

Thank you for providing additional context, it's very helpfull! I'm glad you had the idea to pre-install some dependencies inside the runner image to optimize startup time 👌
Also the configuration looks good to me, nothing raised any concern as far as i know.

I can see why you needed to fork the project, I believe there's also some leftover in our code from our own openstack infra. And I confirm, as far as I know, modifications inside the openstack cloud backend shouldn't impact the scheduling.

This project was initially built with the logic of pre-creating runners due to how self-hosted runners used to work in the past on GitHub. Before GitHub added the --ephemeral flag.
I believe you are facing the following scenario:

  • The runner-manager creates a pool with one runner.
  • The runner-manager see that the runner inside that pool is currently busy (attached to a job).
  • The runner-manager start pre-creating a second runner (this is where you are seeing the issue)

I think we can agree that this second runner pre-creation should not happen if max is reached, even if a job is running.

Are you able to confirm that this seems like the scenario you are facing?

PS: Sorry I don't know how I accidently edited your comment instead of answering it.

@tcarmet tcarmet added the bug Something isn't working label Feb 28, 2023
@gc-nathanh
Copy link
Author

I don't have any logs which can say for sure, but I think your supposition is correct. The alternative approach I'd been looking at is to put some logic into the runner registration, but I suspect that would be challenging.

@tcarmet
Copy link
Contributor

tcarmet commented Mar 3, 2023

With the runner in its current state, it might yes. I can recommend you to look at when the runner receive a webhook of a runner that started a job (busy/in_progress) and go from there to see if there's any condition missing.

Worth noting that we do have plans to rework this part in Q2 and make it easier to test and maintain, however we're still in planning I can't make any promise into weither we will be working on it or not. I can update you here if that ends up to be the case.

@tcarmet
Copy link
Contributor

tcarmet commented Oct 4, 2024

Better late than never, thanks to #685 and the 1.x of the runner-manager this issue is now solved.

@tcarmet tcarmet closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants