RFC: Clustering (refactor pub-sub) #2304

anbraten · 2022-02-03T15:17:36Z

anbraten
Feb 3, 2022
Maintainer

Clustering (multiple servers)

To be able to cluster we need to change some things:

refactor pub-sub
implement backend driver for pub-sub (redis, rabbitmq, ...)
adjust the queue (make it cluster ready by writing back to db and locking on queue items in db)
think about cron jobs (Cron triggers #8) clustering (maybe by reusing the existing queue mechanism)
testing 😄

Refactor pub-sub

We currently have a "interesting" pub-sub mechanism in woodpecker which works, but could get some love. I would like to refactor it a bit to have a more generic event system. For example at the moment if a build-step was updated, the event contains the complete repo, build and build step data, which seems quite heavy to me. Instead I would like some kind of { type: 'pipeline-step', data: ... } structure approach even if this would require use to send two events where we currently used one event like { repo: ...., build: ..., proc: ... }

In addition we could thing about some cool interface for the pub-sub implementation so we could add / use some external / professional pub-sub system like rabbitmq or redis, which would allow some scaling etc in future.

6543 · 2022-02-04T12:25:46Z

6543
Feb 4, 2022
Maintainer

yes I'm thinking of clustering and ha for quite a while, and the two things holding us back are: how to cluster gRPC & pub-sub

0 replies

anbraten · 2022-02-05T13:52:31Z

anbraten
Feb 5, 2022
Maintainer Author

Clustering pub-sub should easily be possible by writing a redis / rabbitmq implementation of this interface. gRPC can probably simply be clustered by putting a load-balancer in front of it. The queue would be an additional part which could need some kind of refactoring for clustering, but may be covered by simply having a centeral database already.

0 replies

6543 · 2022-02-10T17:08:54Z

6543
Feb 10, 2022
Maintainer

I would like to have an implementation of pub-sub based on redis & also for the queue (if we still need it after refactoring ...) to be able to use redis

but it should be opt-in!

also we additionaly could look at NATS ...

0 replies

LamaAni · 2022-02-10T17:42:45Z

LamaAni
Feb 10, 2022

Hum. I aways wonder about implementing messaging.

Some questions about the current state:

Can we have multiple agents at this time? It seems so.
In case there are only web-hook tasks (no scheduler/cron) and if there are multiple agents, why do we need a messaging system? Is it just for the scheduled tasks? If we are behind a load balancer that would force each web-hook to be sent once

Other thoughts:

My model of thinking is to reduce elements. In that spirit we can also do,

Trigger side:

Web-hook/cron/external -> msg to server (using same web-hook trigger)
Server -> write event to db and send msg to runners that a db has changed.

Runner:

Claims available events using Insert, e.g.

INSERT 
   executing_runner_id=[current] 
INTO [tbl] 
WHERE 
    executing_runner_id is NULL or 
    cur_time-execution_starter>elapsed 
LIMIT [parallelism]

Loads the events for the runner that have not been complete.
If failed to start execution due to internal error, cleans up the executing_runner_id for this item.

Cron:
A separate service that triggers events. In this case, we can add cron servers, they use the same "claim" method to allow for running services. This can run internal to the main server, but thats not required.

0 replies

6543 · 2022-02-10T18:39:30Z

6543
Feb 10, 2022
Maintainer

as you write, there are redundant structures we will refactor away, as a legacy from the drone codebase

0 replies

xuecanlong · 2022-03-31T10:09:15Z

xuecanlong
Mar 31, 2022

maybe add some mq and split server to server(user/ui/api) and control, control can be singleton, but api-server can be multiple servers

0 replies

anbraten · 2022-04-04T08:19:08Z

anbraten
Apr 4, 2022
Maintainer Author

maybe add some mq and split server to server(user/ui/api) and control, control can be singleton, but api-server can be multiple servers

But this would be a single point of failure again. In addition I would currently prefer to stick to a single server binary as it makes the deployment much simpler (think of RaspberryPIs etc users).

0 replies

meln5674 · 2024-09-07T20:12:44Z

meln5674
Sep 7, 2024

I'm assuming from the state of this discussion and other related issues that HA remains unavailable, which is disappointing, as this is one of the few things holding me back from going "all-in" on recommending woodpecker to my organization. I would like to propose (and possible implement) the following "baby step":

Add a new env to the server, (maybe call it WOODPECKER_QUEUE_URL), and, if set, instead of using the local persistence for the queue, it uses a GRPC call to forward any queue pushes to another server listening at that URL. As well, if this URL is set, that server will not perform any tasks requiring synchronization, such as cron jobs.

The idea here is to have a single server binary that can be a stateful all-in-one, a stateless HA UI-only server, or a stateful non-HA "queue" server. The UI servers would publish events to the queue server, which would be polled by the agent servers, and perform any synchronized tasks without the risk of collision.

While this would still have a single point of failure, it would at least allow for an HA UI behind your load-balancer of choice that still remains partially functional while the rest of the system is in a degraded state, without needing to add any additional third-party services.

From there, it would likely be much easier to implement support for an HA queuing system, such as NATS, Kafka, Redis, etc, with that singleton server left over as a single stateless component to handle functionality requiring synchronization that would not need any HA beyond automatic restarts, as it would not need to be available to answer network requests.

0 replies

zc-devs · 2024-09-08T09:26:40Z

zc-devs
Sep 8, 2024

Don't know what do you mean by pub-sub, but the Agent-Server communications could be entirely asynchronous.
Considering external Agents, the protocol could be MQTT. Eclipse Mosquitto can also work over Web Sockets (through ingress like Traefik).
https://docs.v2.netmaker.io/guide/about/architecture
https://docs.v2.netmaker.io/guide/getting-started/advanced-server-installation#highly-available-installation-kubernetes__recommended-settings

0 replies

wrenix · 2024-11-14T01:14:00Z

wrenix
Nov 14, 2024

I would suggestion as pub-sub implementation to use https://nats.io -> this is written in golang and could be compiled into woodpecker-ci (so no extra central redis, rabbitmq or so on is needed).

I believe the how communication between server and agent, and the server to ui could be handle with NATS.io (independe from replica count). Only that the Cronjob are not created by mulitple server on the same time has to be handle with a logic in code.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Clustering (refactor pub-sub) #2304

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

RFC: Clustering (refactor pub-sub) #2304

anbraten Feb 3, 2022 Maintainer

Clustering (multiple servers)

Refactor pub-sub

Replies: 10 comments

6543 Feb 4, 2022 Maintainer

anbraten Feb 5, 2022 Maintainer Author

6543 Feb 10, 2022 Maintainer

LamaAni Feb 10, 2022

Some questions about the current state:

Other thoughts:

6543 Feb 10, 2022 Maintainer

xuecanlong Mar 31, 2022

anbraten Apr 4, 2022 Maintainer Author

meln5674 Sep 7, 2024

zc-devs Sep 8, 2024

wrenix Nov 14, 2024

anbraten
Feb 3, 2022
Maintainer

6543
Feb 4, 2022
Maintainer

anbraten
Feb 5, 2022
Maintainer Author

6543
Feb 10, 2022
Maintainer

LamaAni
Feb 10, 2022

6543
Feb 10, 2022
Maintainer

xuecanlong
Mar 31, 2022

anbraten
Apr 4, 2022
Maintainer Author

meln5674
Sep 7, 2024

zc-devs
Sep 8, 2024

wrenix
Nov 14, 2024