-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Architecture] Scaling federation throughput via message queues and workers #3230
Comments
I see a change in activitypub-federation-rust (line 70) which may increase the signature duration to 5 minutes. This change is welcomed and should help alleviate the current delay issue. I don't know if that will be suffice if bulk of Reddit were to migrate to Lemmy, so I'll leave this issue open for now, and we can re-evaluate again when we see more growth of the network. |
This comment was marked as abuse.
This comment was marked as abuse.
Thank you for chiming in! I agree, we need queues, and just so I'm clear... I'm not prescribing solutions such as RabbitMQ, just that we'd need a separate independently scalable component. The separation is kind of important in my mind, as it allows for independent maintenance windows and scaling adjustments. Also, I'm so glad to see I'm not the only one thinking about these! I'll try to add more thoughts on your Lemmy post instead of cluttering up here! |
Another soultion I thought about is to have an instance that has to send an update to 10k other federated instances. In this way the main instance doesn't need to make 10k HTTP requests but only 1k in this case and the other 10 instances the main instance depicted as "forward nodes" can do the same thing with their own federated instances or can directly send everything, depending on a logic like a threshold (if > 1000, split, if < 1000 send). So you have like a "job scheduling" between instances. |
Hey! Since no one has mentioned it I wanted to say that the outgoing federation logic has been replaced with a different, hopefully better queue here: LemmyNet/activitypub-federation-rust#42
With debug mode off it was already a (suboptimal) queue in 0.17, and in 0.18 it should be a pretty efficient (in-memory) queue. I'd also say that right now there's (probably) a lot of low hanging fruit still to be picked before moving to much more complex systems that have to involve horizontal scaling with multiple servers. Of course that also has to happen at some point, but it's a huge amount of effort compared to the improvements that can still be done to improve the single-instance scaling. If you just move to an external messaging queue you're not going to actually reduce the amount of work that has to be done per time unit. Rust will already reach the maximum scalability you can get out of one server with an in-memory queue (if written well). I think the problems are mostly
|
Just as another example of low hanging fruit that we should pick before looking at exciting and complex new systems, in 0.17 the PEM key was loaded and parsed for every outgoing http request (so basically a million times for a single popular post). The rust code also does a ton of unnecessary sql selects that it then doesn't even use. So I'd encourage everyone to work on improving the plumbing (and tooling) before thinking about adding new complexity. |
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
This comment was marked as abuse.
@RocketDerp I am going to assume its being called from this code: https://github.com/LemmyNet/lemmy/blob/main/crates/db_views_actor/src/person_view.rs And- I am going to assume, when federation messages are incoming, for new comments/posts/edits/etc- the database needs to lookup the actor ID, to translate that into a local person ID. I'd say the best way to mitigate that load, is to bundle redis with lemmy, and use it to cache actor_ids, and other IDs, to remove that load from the database. Especially- for instances processing hundreds of thousands of federation messages in a short time. Redis scales well- and it could excel at that. Very easy to get up and going too. |
I don't know if we'd necessarily need Redis at this point since the query should be fairly light at this time... probably 6 figures worth of person ID records, which Postgres should be able to handle easily. However, long term, Redis will be very useful as a query cache layer for all frequent queries, especially busy threads for public instances. |
This comment was marked as abuse.
This comment was marked as abuse.
I think since the main devs are pretty overloaded your best will be to just start with small, "unoffensive" PRs and go from there. e.g. just add a really minimal admin API endpoint with the stats you mentioned, I think they would merge that. |
This comment was marked as abuse.
This comment was marked as abuse.
Should be resolved with #3605 |
Requirements
(checking that check box even though it is not a front end issue, and I will not be using lemmy ui repo for this)
Is your proposal related to a problem?
Larger instances currently are all backed up with outbound federation. Lemmy.world has bumped the federation worker count up to 10000, and outbound federation events are still reaching other servers outside of the 10 seconds window. While I have no direct proof this being the issue presently, the symptoms would suggest outbound federation events are backed up, be it due to straight up too many messages to send out (i.e.: every "write" interaction = 1 event * how many ever federated servers) or worse yet, previously federated servers shutting down, and federation worker must wait for network timeout before they can proceed with sending another federation event message.
Describe the solution you'd like.
Federation workers needs to be scalable independent of the Lemmy back-end, and perhaps managed via some sort of message bus/queue. Instead of immediately emit an out going federation event message to the federated server, the event gets put into a message queue (for example, maybe a separate RabbitMQ container). Then, have separate scalable container representing Federation worker that can be spun up in multiple server/VMs, or just multiple instances, which will fetch the federation event message, and deliver to the intended federated server. This way, larger instances can have a fleet of workers not bound by TCP "half open" limits per server, and scale outbound federation event messages better.
Describe alternatives you've considered.
As a temporary solution, increasing the 10 seconds expiration window might help these larger communities scale a little bit more before we need more advanced solutions such as the one described above.
Additional context
No response
The text was updated successfully, but these errors were encountered: