-
-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persistent, performant, reliable federation queue #3605
Conversation
The general approach looks very good. Among other things it also gives us the per-instance failure limit which I tried to implement (LemmyNet/activitypub-federation-rust#60). The main question for me is why you decided to implement this in Lemmy, and not in the federation library. It seems to make more sense to keep it encapsulated there, and only add a trait so that Lemmy can handle db storage. The federate crate looks like it needs to run as an entirely separate process. Thats too complicated especially for small instances. Better run this logic from the lemmy_server crate, and provide a command line option to enable/disable activity sending. cc @cetra3 |
Mainly for simplicity. I don't think the activitypub-federation crate should depend on postgresql specifically, and the way I fetch updates for instances and for communities is kind of tightly coupled with how lemmy stores that in the database. I can try to see how the trait would look if most of this code was in the ap crate, but it might be pretty hard to make it generic enough to actually work for other use cases (like @colatkinson wants to build here)
It would of course be possible to allow this to be run from the main process but I kind of disagree: Most admins, especially small instance admins, do the absolute minimum effort to set up an instance and then are confused if it runs badly. They don't understand postgresql tuning, configuration changes etc. Also I've seen multiple people use a script like lemmony to subscribe to every single community in existence and then are confused why their 2GB RAM server can't handle the federation traffic (I know that's just incoming traffic but still). Also, the way it's implemented is "optimized" for running in a separate tokio pool.. Similar to the issues we have with the existing in-memory queue, this code spawns instance-count tokio tasks (e.g. 1000) which i think will dominate scheduling against the 1-100 max other / api query tasks against the if the process is at load. tokio doesn't have any task prioritization so I don't know how this could be prevented. So IMO the default setup should be a very good and performant one - because admins don't understand or want to bother with optional tweaks. Lemmy already needs two processes (ui and server), I don't think adding a third one would increase the effort much? It can be in the same container as the other one by default. If you're adamant about this I can make it integratable into the main process but it should definitely be an option to have it separate, one that all instance admins that federate with 100+ other instances should take. |
@phiresky I think it would be good if we had the option here. It wouldn't be too hard to spawn some sort of background tokio task in the embedded case and then have a thin wrapper in the "separate" process case. |
Ah, another reason is that I add signal handlers for clean shutdown, since that can take multiple seconds (up to the http timeout). So if it was in the same process that would conflict with the actix shutdown handlers (no idea if it's possible to merge those) and also cause more downtime when updating / restarting processes. |
@phiresky You can't really inject your own listener for the shutdown in actix web, however you can do the reverse: signal the actix http server to shut down in your own custom signal handler. The way you do this is roughly:
|
One option might be to have the Lemmy process spawn the activity sender as a child process by default. Then its a separate process with separate tokio, but doesnt require any changes from instance admins. And both binaries can be included in the same dockerfile. I suppose having the queue in Lemmy is fine, but I dont want to maintain two different queues. So if we use this approach, then I would get rid of the queue in the federation library, and only provide a simple method for sign + send on the current thread (no workers nor retry). Then that logic doesnt have to be reimplemented here. Later the queue could still be upstreamed. |
Real efficiency/performance would require federation to be done using an efficient binary format like speedy. But I understand why diverging from apub, even if only optionally, and only with Lemmy-Lemmy instance communication, is something the project may not want to support. |
@AppleSheeple honestly JSON serialization is a tiny sliver of the perf issues that relate to apub comms. The biggest contributing factor when I've benchmarked this before is HTTP signatures. It takes up about 70% of the processing time |
A small difference becomes bigger at scale. And the small difference here covers all three of size, memory, and processing power needed. And more relevantly, if the project were to be open to non-apub Lemmy-to-Lemmy federation, then the sky would be the limit. You can do batching however you want. You can even turn the architecture into a fully pull-based one. You can... Creating a queue that can truly scale to reddit scale while sticking to apub is an unobtainable goal, was my point. The message format was just the easiest clear source of inefficiency to point out. Lemmy can obviously still do much better than the status quo. And @phiresky's efforts are greatly appreciated, not that they need to hear it from a random githubber. |
2396295
to
a2e3fc8
Compare
Updated:
|
about = "A link aggregator for the fediverse", | ||
long_about = "A link aggregator for the fediverse.\n\nThis is the Lemmy backend API server. This will connect to a PostgreSQL database, run any pending migrations and start accepting API requests." | ||
)] | ||
pub struct CmdArgs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should also make all the fields pub
to ensure that CmdArgs
can be constructed outside Lemmy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that seems like kind of a separate decision whether to actually make this part of the public interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it already is part of the public interface as it is given to the start_lemmy_server
function. How should someone who wants to embed lemmy call start_lemmy_server
if they can't construct the CmdArgs
? (and they don't want to parse them from command line args obviously).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it should also be called LemmyArgs
as well, since they may not actually come from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is starting it as a library actually supported/ documented anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was done in this one #2618
And of course it is documented on docs.rs, https://docs.rs/lemmy_server/latest/lemmy_server/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this. Lemmybb can run Lemmy as embedded library, although that project isnt maintained now because I dont have time. Anyway I dont see any downside to making them pub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the one thing from me, domain -> instance_id
.
fetcher: () => Promise<T>, | ||
checker: (t: T) => boolean, | ||
retries = 10, | ||
delaySeconds = 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems more stable anyway, thx.
crates/db_schema/src/schema.rs
Outdated
federation_queue_state (id) { | ||
id -> Int4, | ||
#[max_length = 255] | ||
domain -> Varchar, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that instance_id would be much better, for normalization purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks.
There's one more issue where if a user is deleted it is no longer found and the federation worker can error out. I'll fix that by both skipping internal errors when sending activities so the queue doesn't get stuck as well as restarting workers when they exit. then I'll merge (when tests pass). |
This PR implements a new outgoing federation queue. The end goal is to create a queue that can scale to reddit scale (that is 100-1000 activities per second, sent to each federated instance).
The basic idea is to change the primary division of the federation queue to be of the target instance. Federation to each instance is mostly handled separately.
The queue works as follows:
The main
lemmy_server
process with itssend_lemmy_activity
function only stores the sent_activity in the db (like currently), with an addition of the send targets.There is a new table
federation_queue_state (domain, last_successful_id, retries)
that tracks the state of outgoing federation per instanceOne or more
lemmy_federate
processes pick up the activities from the db and send them out. lemmy_federate works as follows:All known allow/non-blocklisted instances are read from the database every 60s
A worker tokio task is started / stopped per federated instance. It does the following
federation_queue_state
is stored to the database.A separate task logs the current progress of each domain once per minute. Example output:
If a signal is received (ctrl+c, SIGINT, SIGTERM), all the workers are gracefully stopped, their most current state stored in the db.
This implementation has the following advantages:
It has the following disadvantages:
The approach of one worker per remote instance should scale to reddit scale imo (~100 - 1000 activities per second). The details will of course need tweaking in the future when bottlenecks become clearer.
I've tested this so far only with my own very low activity instance and the basics work as expected.
Here's an example of how the federation_queue_state table looks:
And here's an example of how the activity table looks (for sendable activities):