-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neutrino: improve perf in high latency networks #300
neutrino: improve perf in high latency networks #300
Conversation
dee0742
to
d4c7630
Compare
d4c7630
to
33c1869
Compare
OK, it appears that |
33c1869
to
c8fd04d
Compare
Cool, so I fixed the unit test. Now, the Blixt team will do some testing to make sure that we haven't also rolled back the perf improvements. |
@@ -956,7 +956,7 @@ func (rs *rescanState) handleBlockConnected(ntfn *blockntfns.Connected) error { | |||
|
|||
// Otherwise, we'll attempt to fetch the filter to retrieve the relevant | |||
// transactions and notify them. | |||
queryOptions := NumRetries(0) | |||
queryOptions := NumRetries(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)
also - wondering if we should make this configurable rather than hard code it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- So the timeouts and retries are directly connected to one another, as when we have a timeout we'll do a retry (bounded by the retry count). What we found from trace logs was that the networks didn't consistently timeout, but rather during higher bandwidth utilization there was a chance of hitting the low retry count limit. That's why we raised the retries, since raising the timeout would also indirectly influence the amount of extra timeout added (it doubles each time it fails).
In my highly non-scientific testing there seemed to be a trend where if we bumped timeouts alone, reliability would still be bad. So this is just kind of the solution that we converged.
- Yes, I've thought about making them configurable settings. However, we did a grid search and these were the best settings that balanced perf for slow and fast networks. I figured a follow-up PR where we break out these as configurables would be good, but let me know if we should bundle the changes. This PR is mostly focused on "saner" defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it not make sense for high latency environments to instead make the timeout longer instead of bumping num retries? (or both lol)
Increasing timeout is kinda a bigger change as the number gets increased exponentially after each try.
I think increasing both retry and timeout makes sense. But I could look into that in a follow-up PR instead.
also - wondering if we should make this configurable rather than hard code it?
Yeah, I've thought of that too. It would be a very nice thing to have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha gotcha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool LGTM - before the final ✅ though, can you open an LND PR that points to this so that we can just check the CI on that side?
Created here: lightningnetwork/lnd#8797 |
cool thanks - will ACK this once the neutrino test on that side is green |
@djkazic, remember to re-request review from reviewers when ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good on the LND side - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about disk performance on fast systems, otherwise LGTM.
After working with @hsjoberg and @niteshbalusu11 to troubleshoot Blixt performance we realized two things were consistently going wrong in networks with > around 300ms latency:
After investigating, we decided to experiment with changing some constants in neutrino to work better.
This PR addresses both issues by:
The result is that we can now use Blixt with networks that we could not previously.