-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: fix stuck inflight payments #8174
routing: fix stuck inflight payments #8174
Conversation
1435e5b
to
2793c53
Compare
d044321
to
a80524d
Compare
cc: @Roasbeef for approach ack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass done, looks really good already 🔥. I'll let this run on a node that has stuck inflight payments.
@@ -562,4 +562,8 @@ var allTestCases = []*lntest.TestCase{ | |||
Name: "payment succeeded htlc remote swept", | |||
TestFunc: testPaymentSucceededHTLCRemoteSwept, | |||
}, | |||
{ | |||
Name: "send to route failed htlc timeout", | |||
TestFunc: testSendToRouteFailHTLCTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the repeated itest needed? What's the main difference for SendPaymentV2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow, but it's testing a different RPC? Tho SendToRouteV2
looks similar to SendPaymentV2
, it's nonetheless an independent RPC that can diverge someday.
htlcswitch/switch.go
Outdated
@@ -978,7 +978,7 @@ func (s *Switch) handleLocalResponse(pkt *htlcPacket) { | |||
s.cfg.HtlcNotifier.NotifySettleEvent(key, htlc.PaymentPreimage, | |||
eventType) | |||
|
|||
case *lnwire.UpdateFailHTLC: | |||
case *lnwire.UpdateFailHTLC, *lnwire.UpdateFailMalformedHTLC: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that change needed? I think for the same reason mentioned below, we don't expect UpdateFailMalformedHTLC
here because of the conversion, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Yeah based on the callsites I don't think this is needed, now removed.
htlcs[a.AttemptID] = hash | ||
} | ||
|
||
payHash := payment.Info.PaymentIdentifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: payHash
can be used more to compress some lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
lightninglabs-deploy mute 336h |
!lightninglabs-deploy mute 336h30m |
1 similar comment
!lightninglabs-deploy mute 336h30m |
Why are you guys muting this? This happens in prod and imo serious |
Hi @kilrau the notifications for reviews were muted as YY focused on other prs, but as you can see the pr is still tagged for 18.3 and will be picked up again soon |
Getting this into 0.18.3 sounds good 🙏 |
a80524d
to
d4ff44e
Compare
Important Review skippedAuto reviews are limited to specific labels. Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
Exchanged with the guys from the BOLTZ team, and it seems like they are suffering from another issue in regards of Inflight Payments which I am not sure this PR will solve. There seems to be an edge case where we fail a payment ( exit When a service tries to use trackpayment => this call remains pending although we already canceled all resultCollector goroutines, so it will pending until the next restart. Moreover this payment will be inflight until the next restart in general, where we try to recollect all the Inflight HTLCs of a payment. I think I might have found the culprit based on a logs of a payment attempt which got stuck: So basically we are in the process of adding a new HTLC Attempt to the payment here: Code parts taken from this PR.
This will however return an error if the payment is failed already in
hence the Payment Lifecycle will exit with (as seen in the picture):
I think the problem is For MPP when we hit the MPP Timeout error we would mark a payment prematurely as failed although HTLCs are still pending in Because we would fail a payment as soon as the final hop reported the failure which is the case for the MPP Timeout for example. I think my analysis describes whats happening in the above log picture, although it's not a full log but just a grep of the payment hash. However I think we really need to be carefull to not prematurely kill the payment lifecycle until all HTLCs reported back, or when killing the payment flow, we should at least also kill the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass done, mostly nits. Except from my other comment related to the other problem with Inflight HTLCs.
htlcswitch/switch.go
Outdated
@@ -3041,3 +2949,162 @@ func (s *Switch) handlePacketAdd(packet *htlcPacket, | |||
|
|||
return destination.handleSwitchPacket(packet) | |||
} | |||
|
|||
// handlePacketAdd handles forwarding a settle packet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: handlePacketAdd => handlePacketSettle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
htlcswitch/switch.go
Outdated
// ErrUnknownCircuit error upon the inner call to CloseCircuit. | ||
// | ||
// NOTE: We can only get a nil circuit when it has already been deleted | ||
// when `UpdateFulfillHTLC` is received. After which `RevokeAndAck` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ... and when UpdateFulfillHTLC
is received.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
htlcswitch/switch.go
Outdated
return s.mailOrchestrator.Deliver(packet.incomingChanID, packet) | ||
} | ||
|
||
// handlePacketAdd handles forwarding a fail packet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: godocs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
itest/lnd_payment_test.go
Outdated
payStream := alice.RPC.TrackPaymentV2(payHash[:]) | ||
dustPayStream := alice.RPC.TrackPaymentV2(dustPayHash[:]) | ||
|
||
// Check that the dust payment is failed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Maybe add a comment why both checks, the stream and the normal findpayment command should be done here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
// handlePacketAdd handles forwarding a settle packet. | ||
func (s *Switch) handlePacketSettle(packet *htlcPacket) error { | ||
// If the source of this packet has not been set, use the circuit map | ||
// to lookup the origin. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: when can this happen, would be good to describe it in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool let me dig a bit...
@@ -33,6 +33,9 @@ | |||
created](https://github.com/lightningnetwork/lnd/pull/8921) when multiple RPC | |||
calls are made concurrently. | |||
|
|||
* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that | |||
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146). | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering whether the three commits on top of this PR are related ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean the last three? yeah they are related as they are required to get the itest pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe switch the order so we have the release at the end ?
Interesting case @ziggie1984! Cool could you open a new issue to track this instead - think I know what happened, but don't have a good fix, plus I prefer not growing the scope of this PR to stay focused. In short it's due to async fetching of the same payment, and it happens when an invoice times out exactly during the period while the node is still restarting. Would also be great to have raw logs instead of screenshots, as the timestamps were messed up and difficult to analyze. |
ae6ed43
to
06146ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks very close 🚀, just a few nits and questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🐝, still had some general questions about this code change but all non-blocking
} | ||
|
||
log.Debugf("Cleaning network result store.") | ||
if err := r.cfg.Payer.CleanStore(toKeep); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we clean the store only once on startup ? Would it make sense to run this periodically:
from the comment of the clean function:
This should be called
// preiodically to let the switch clean up payment results that we have
// handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good q - yeah I think it should be cleaned more frequently. I also remember we can fetch whatever is stored here from our open channel, which makes this store redundant. Will make an issue to track it - either clean it more often or just don't store it.
a := a | ||
|
||
// We check whether the individual attempts have their | ||
// HTLC hash set, if not we'll fall back to the overall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Is the hash set equal of AMP being used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: maybe add it to the comment then, I would have found it useful ?
// We pass in a non-timeout context, to indicate we don't need | ||
// it to timeout. It will stop immediately after the existing | ||
// attempt has finished anyway. We also set a zero fee limit, | ||
// as no more routes should be tried. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be => can we be sure that INFLIGHT HTLCs will never have payment status (e.g. StatusInFlight
) which could lead us to retry the payment and send more HTCL attempts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think once there are inflight HTLCs the payment should always be StatusInFlight
tho,
lnd/channeldb/payment_status.go
Lines 142 to 153 in ab96de3
// Based on the above variables, we derive the status using the following | |
// table, | |
// | inflight | settled | htlc failed | payment failed | status | | |
// |:--------:|:-------:|:-----------:|:--------------:|:--------------------:| | |
// | true | true | true | true | StatusInFlight | | |
// | true | true | true | false | StatusInFlight | | |
// | true | true | false | true | StatusInFlight | | |
// | true | true | false | false | StatusInFlight | | |
// | true | false | true | true | StatusInFlight | | |
// | true | false | true | false | StatusInFlight | | |
// | true | false | false | true | StatusInFlight | | |
// | true | false | false | false | StatusInFlight | |
Since the fee limit is zero, it means there are no routes to be tried, unless we can find a zero-fee route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would the zero-fee route cause a problem in terms of not setting a timeout here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there exists a zero-fee route (which seems impossible?), then whether to send another attempt or not is determined by the remaining payment amt, if it's not zero, then another attempt will be tried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question is whether we really would like to set a zero fee limit here (and starve the payment), as in principle we may want to have payments to continue after a restart, right? I think that's why resumePayment
was added. Do we have an itest for a payment that continues its pathfinding after a restart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I think we need to revisit the restart process a bit - atm, there's the existing issue re how we should go and handle the payment timeout from the sender side. The second issue now is the one you mentioned above. Say we have this case,
- invoice has a timeout of 60s.
- the sender tries to fulfill the invoice via MPP, say
htlc1
andhtlc2
. htlc1
is sent, but if the sender restarts before sendinghtlc2
(or even before the creation of it, as long as there's a remaining amount), and,- the sender's restart finishes before the invoice times out.
- the payment is now resumed, with only
htlc1
sent, and some remaining amount to try.
In this case, this payment is def gonna fail since it won't try another route. I also don't think this is an urgent issue, also guess that's why we log errors during the shutdown process if we still have inflight payments. There are multiple ways to fix it and think it's worthy having a tracking issue.
@@ -33,6 +33,9 @@ | |||
created](https://github.com/lightningnetwork/lnd/pull/8921) when multiple RPC | |||
calls are made concurrently. | |||
|
|||
* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that | |||
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146). | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe switch the order so we have the release at the end ?
@@ -48,6 +48,9 @@ | |||
bumping an anchor channel closing was not possible when no HTLCs were on the | |||
commitment when the channel was force closed. | |||
|
|||
* [Fixed](https://github.com/lightningnetwork/lnd/pull/8174) old payments that | |||
are stuck inflight. Check more info [here](https://github.com/lightningnetwork/lnd/issues/8146). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add in the description that the culprit of this problem is still not detected ? (Meaning why those payments are still inflight and did not resolve their corresponding payment attempt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool updated.
// Double check whether the reported closed channel has indeed finished | ||
// closing. | ||
// | ||
// NOTE: There are misalignments regarding when a channel's FC is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you find out which misalignments in particular ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what the TODO
is for...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure, was just asking what misalignment means in this context, afaict it means somehow we mark channels prematurely as closed although they are still pending ?
06146ce
to
ba3741c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ⚡
This commit adds a new test case to validate that when an HTLC has timed out, the corresponding payment is marked as failed.
Simply moves the code into a new method so it's easier to follow the method `handlePacketForward`.
This commit adds two methods, `handlePacketFail` and `handlePacketSettle` to handle the settle and fail packets differently.
This commit adds a check during router's startup and fails the inflight HTLCs if they are routing using channels unknown to us. The channels are unknown because they are already closed, usually long time ago.
The method `FetchClosedChannels` sometimes prematurely mark a pending force closing channel as finalized, therefore we need to furthur check `FetchPendingChannels` to make sure the channel is indeed finalized.
To reflect the new sweeping behavior, also makes it easier to be used as we need to method to quickly cleanup force closes without concerning the details when we are not testing the force close behavior.
ba3741c
to
bc31a8b
Compare
Replaces #8150, this PR attempts to fix #8146.
Timed out HTLCs are properly handle
The original assumption was, when the HTLC timed out, the attempt result wasn't sent back the payment lifecycle, thus causing the payment to be in stuck. However, this assumption is not true, as shown in the newly added itest - when an outgoing HTLC has timed out, the payment will be marked as failed. Moreover, if the outgoing HTLC is swept by the remote via the preimage path, the payment will be marked as succeeded. All these are expected behavior.
HTLC Attempt result is not saved
If there's already a result in db,
GetAttemptResult
will return it, so the result must not be found in db. Also notice that, since there's no result in db, the only line we can hit is:lnd/htlcswitch/switch.go
Line 466 in e6fbaaf
as otherwise hitting this line will return the error
ErrPaymentIDNotFound
:lnd/htlcswitch/switch.go
Line 456 in e6fbaaf
And the method
GetAttemptResult
is blocking at line:lnd/htlcswitch/switch.go
Line 456 in e6fbaaf
On the other hand, these network results will be cleaned when
ChannelRouter
starts here:lnd/routing/router.go
Line 657 in e6fbaaf
But this cleanup only applied to terminated payments. For inflight payments, we always keep the network results. This rules out the possibility that the HTLC attempt once had a result but was later removed.
This seems to leave us only one possibility - that the network result was never written to disk, causing an HTLC attempt waiting for its result forever, hence the payment stayed inflight, which leads to suspicion that this line is hit, thus the line
go s.handleLocalResponse(packet)
is skipped so the result is not saved:lnd/htlcswitch/switch.go
Lines 1289 to 1293 in e02fd39
This PR refactors
handlePacketForward
and makes sure the network result is always saved for locally-initiated packet.Fixes,