Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VSS Integration Tracking #246

Open
8 of 12 tasks
tnull opened this issue Feb 7, 2024 · 3 comments
Open
8 of 12 tasks

VSS Integration Tracking #246

tnull opened this issue Feb 7, 2024 · 3 comments
Assignees
Milestone

Comments

@tnull
Copy link
Collaborator

tnull commented Feb 7, 2024

Most of the work is already merged, here we add some tracking for the remaining tasks:

(cc @G8XSU)

@tnull tnull added this to the 0.3 milestone Feb 7, 2024
@tnull tnull mentioned this issue Feb 7, 2024
@tnull tnull modified the milestones: 0.3, 0.4 Feb 7, 2024
@tnull tnull mentioned this issue Feb 7, 2024
8 tasks
@tnull tnull modified the milestones: 0.3, 0.4 Feb 7, 2024
@G8XSU
Copy link
Contributor

G8XSU commented May 15, 2024

event-handling-panic-persistence-failure

event-handling-panic-persistence-failure could be treated the same as persistence failure during any other persistence, where we currently panic and restart. I agree we need to fix this, but in any case, this holds true for any remote persistence that can be implemented by user for ldk/ldk-node afaiu, so this shouldn't block vss integration, and can be tracked separately.

Mid- to longer term we need to switch to a primary/secondary storage model

There might be some misunderstanding on how primary/secondary storage model will work.
Even in primary/secondary, we will always first store to primary storage and if it is successful, we can store it in secondary.
This will not help in case of transient connection failures to primary, it is only useful for vss-phase-2/multi-device.
If we choose to do it the other way around(secondary-before-primary), we compromise on the promise of live backups and cannot ensure safe multi-device usage.

need to add some huge warning that VSS is considered experimental and that either data might be lost at any time or that
LDK Node / the app using it might crash at any time if the connection is lost.

IIUC, we will replay the events if there was a panic during event handling, so there shouldn't be any data loss. but yes the node will crash if there is persistence failure, this is true for channel-manager and channel-monitor persistence as well.
Retries in persistence should address some of the pain, but can we move forward vss-alpha in current state of ldk-node/ldk, and we can address the fallible events in parallel?
If vss-integration is not there, and user wants to use remote live backup then they are forced to rollout their own kvStore, which sadly will still be plagued by this same issue.

@tnull
Copy link
Collaborator Author

tnull commented May 16, 2024

event-handling-panic-persistence-failure

event-handling-panic-persistence-failure could be treated the same as persistence failure during any other persistence, where we currently panic and restart.

For one, we generally try to avoid it and return an error wherever possible. But more importantly, persistence failure is currently ~never expected (i.e., only ever on hardware failure essentially), while with remote storage it will be very common, especially on mobile.

I agree we need to fix this, but in any case, this holds true for any remote persistence that can be implemented by user for ldk/ldk-node afaiu, so this shouldn't block vss integration, and can be tracked separately.

I disagree. We need to find a solution for this before we deem VSS integration 'stable'/ fully expose it in the stable API (and hence in bindings, as UniFFI doesn't support cfg flags or similar). We can't just treat this as an independent issue.

There might be some misunderstanding on how primary/secondary storage model will work. Even in primary/secondary, we will always first store to primary storage and if it is successful, we can store it in secondary. This will not help in case of transient connection failures to primary, it is only useful for vss-phase-2/multi-device. If we choose to do it the other way around(secondary-before-primary), we compromise on the promise of live backups and cannot ensure safe multi-device usage.

This might be true for the channel monitor updates where proceeding without being sure an update was persisted is crucial. For basically everything else it would be highly beneficial to write to a local cache, proceed, and try to sync the cache with the remote server, restarting/resetting in case of any conflicts if they can't be resolved. I'm indeed a bit dubious what the benefit is of the secondary local storage, if we can't ever use it in the first place. I think we discussed this many times before and I still keep my hopes up we'd eventually get a better solution for it. But granted, it won't be solved by VSS primary/secondary model as currently planned (and required for monitor persistence).

IIUC, we will replay the events if there was a panic during event handling, so there shouldn't be any data loss. but yes the node will crash if there is persistence failure, this is true for channel-manager and channel-monitor persistence as well. Retries in persistence should address some of the pain, but can we move forward vss-alpha in current state of ldk-node/ldk, and we can address the fallible events in parallel?

Well, yeah, I started working on fallible event handling in lightningdevkit/rust-lightning#2995 and intend to get it into the next LDK release. But as mentioned above, I don't think we can deem VSS stable/fully expose as long as we just crash and burn on persistence failure.

If vss-integration is not there, and user wants to use remote live backup then they are forced to rollout their own kvStore, which sadly will still be plagued by this same issue.

The keyword here is live. I kind of assume that at least for many of our users having a regular remote backup (while the node is stopped/no payments are inflight) of the SQLite store might be preferable and safer than exposing their users to arbitrary crashes, especially in environments where Lightning integration/LDK Node is only one of the things going on. Imagine a social media or gaming app just arbitrarily crashing because we disconnected from the internet for a second: unacceptable.

@tnull
Copy link
Collaborator Author

tnull commented Oct 17, 2024

Addressed a good chunk of the actionable items in 0.4 milestone, remaining items will slip to 0.5 (+).

@tnull tnull modified the milestones: 0.4, 0.5 Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants