-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swarm switching not clearing obsolete messages #470
Comments
I hate writing regex, so I put a copy of my quick and dirty regex script in case it helps anyone extract parameters from storage stats log to a csv file: (To be honest, most of regexes were written or drafted by chatGPT)
|
Updated: Examples of hourly log showing the duplicate, each row represents 1 hour. Example 1, nearly triple of user count
Example 2, nearly double of user count
|
Side note: I also have anecdotal evidence of seeing a few fresh new registered nodes having double size (than usual) of user counts in the beginning, then gradually falling to normal after 14 days. |
In theory the storage server can clean up messages belong to the old swarm before downloading messages belong to the new swarm, so the above argument might not be sufficient. I briefly checked the code but I can't find any evidence that the storage server do any clean up though. |
|
I've created a script to illustrate my comment at #470 (comment). (written by ChatGPT, ignore some silly comments in the python code) @jagerman, the next time you start a fresh new OPTF node, you can try running this script with Upon examining the csv file, I found that the "correct cluster" had 1859 IDs on the first day, while the "incorrect cluster" had 1471 IDs. Over the following 14 days, the number of IDs in the "correct cluster" remained within a stable range, whereas the number of IDs in the "incorrect cluster" decreased to zero. Note that this case is not a swarm switching, it's a fresh new node start with zero messages that suddenly owns messages associated to two swarm IDs. I wonder if there is any race condition causing two competing swarms to store messages to the current swarm at the same time. |
Another hypothesis: Those unexpected messages, presumed to belong to another swarm but surprisingly observed in our new node, might be a consequence of the "swarm switching not clearing obsolete messages" issue. This situation could occur if a neighboring node within the same swarm happens to undergo a swarm switch within the last 14 days and encounters the "swarm switching not clearing obsolete messages" bug. As a result, this neighboring node may unintentionally transmit those obsolete messages to our new node during its initialization. This would lead to our node being contaminated by messages from a different swarm, even if the initialization process itself is legitimate. I don't have any evidence to either prove or disprove this hypothesis. |
Swarm membership is (overly) complex, which means, unfortunately, that it's also hard to track which nodes were in and moved between swarms without essentially replaying the entire blockchain. When a swarm joins a swarm there should only be a one-way transfer of messages from existing swarm members to a new member, but not the other way around. It is, perhaps, possible that a new node joined a swarm and received stale messages from another swarm member who transferred into that swarm recently (and so has a hangover of messages it didn't clean up). It should probably be rejecting those (since they are for the wrong swarm), but quite possibly isn't. Is there an observable pattern to the expiry times of messages here? E.g. if the invalid swarm ids stopped a few days prior to your node joining this swarm that would suggest this is what is happening. |
This is what I meant in my previous comment, maybe I didn't explain it clear, but yes I feel that this is very likely.
I'll take a look and report back. |
This is the code to draw the chart:
|
Actually, recently my long term node suddenly onboards 1344 illegal immigrants (I manually verified their session IDs with the partitioning algorithm), increasing from 1965 to 3309 users just in an hour, despite my swarm ID has not changed. |
(Unconfirmed) Reportedly when a node switches swarms the stored messages roughly doubles (then slowly goes back to normal after 14 days). The new incoming messages makes sense (we sync on swarm change), but the doubling suggests that we aren't clearing messages that belong to the old swarm, and we should be; we can't retrieve such messages because we'll give a "wrong swarm" response if the owning user tries to fetch from us, so this is just dead data.
The text was updated successfully, but these errors were encountered: