-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce peer message traffic for ledger data #5126
base: develop
Are you sure you want to change the base?
Conversation
* Also log as warning when the state lowers
c357abb
to
9e5487c
Compare
09c4156
to
c69b443
Compare
* Allow a retry after 30s in case of peer or network congestion. * Addresses RIPD-1870 * (Changes levelization. That is not desirable, and will need to be fixed.)
* Allow a retry after 15s in case of peer or network congestion. * Collate duplicate TMGetLedger requests: * The requestCookie is ignored when computing the hash, thus increasing the chances of detecting duplicate messages. * With duplicate messages, keep track of the different requestCookies (or lack of cookie). When work is finally done for a given request, send the response to all the peers that are waiting on the request, sending a separate message for each requestCookie. * Addresses RIPD-1871
* Addresses RIPD-1869 --------- Co-authored-by: Valentin Balaschenko <[email protected]> Co-authored-by: Ed Hennis <[email protected]>
* When work is done for a given TMGetLedger request, send the response to all the peers that are waiting on the request, sending one message per peer, including all the cookies and a "directResponse" flag indicating the data is intended for the sender, too.
c69b443
to
e490e57
Compare
include/xrpl/basics/CanProcess.h
Outdated
insert() | ||
{ | ||
std::unique_lock<Mutex> lock_(mtx_); | ||
bool exists = collection_.contains(item_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would avoid extra lookup
auto [_, inserted] = collection_.insert(item_);
return inserted;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Fixed.
* Avoid an unnecessary lookup in CanProcess
* upstream/develop: Set version to 2.3.0-b4 feat(SQLite): allow configurable database pragma values (5135) refactor: re-order PRAGMA statements (5140) fix(book_changes): add "validated" field and reduce RPC latency (5096) chore: fix typos in comments (5094) Set version to 2.2.3 Update SQLite3 max_page_count to match current defaults (5114)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor comments. Happy to approve once addressed.
src/xrpld/overlay/detail/PeerImp.cpp
Outdated
return ledger; | ||
} | ||
|
||
JLOG(p_journal_.trace()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit]: Should this be a warn
instead of trace
? Not having a peer to relay the request may indicate some configuration or environment issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily, though. It could mean that the node has already sent the request to all it's peers. Also not that the original message is trace
.
But there's something odd here. It looks like this code block was somehow duplicated! I must have messed up resolving a conflict when I rebased from master
to develop
. I've removed the duplicate.
src/xrpld/overlay/detail/PeerImp.cpp
Outdated
@@ -2936,7 +3078,9 @@ getPeerWithLedger( | |||
void | |||
PeerImp::sendLedgerBase( | |||
std::shared_ptr<Ledger const> const& ledger, | |||
protocol::TMLedgerData& ledgerData) | |||
protocol::TMLedgerData& ledgerData, | |||
std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>> const& |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>>
is mentioned in four places. It would be more readable to define an alias:
using PeerCookieMap = std::map<std::shared_ptr<Peer>, std::set<std::optional<uint64_t>>>;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! Fixed.
* Add unit tests for to_short_string(base_uint * Remove duplicated code * Use type aliases for cookie maps
* That's what I get for rushing to push
* upstream/develop: Set version to 2.3.0 refactor(AMMClawback): move tfClawTwoAssets check (5201) Add a new serialized type: STNumber (5121) fix: check for valid ammID field in amm_info RPC (5188)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several notes for now, also I noticed there's a bunch of Added line not covered by tests
annotations from codecov, perhaps worth checking these.
include/xrpl/basics/CanProcess.h
Outdated
return canProcess_; | ||
} | ||
|
||
operator bool() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use explicit
here; it would be consistent with std::optional
and most other operator bool
inside the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use
explicit
here; it would be consistent withstd::optional
and most otheroperator bool
inside the project.
Fixed
include/xrpl/basics/CanProcess.h
Outdated
} | ||
|
||
bool | ||
canProcess() const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that it is not covered by tests, perhaps we do not need this function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that it is not covered by tests, perhaps we do not need this function ?
It's used in the call path for network traffic, most of which is not covered by tests. But I changed that call to use the boolean conversion, and everything seems fine.
include/xrpl/basics/CanProcess.h
Outdated
if (canProcess_) | ||
{ | ||
std::unique_lock<Mutex> lock_(mtx_); | ||
collection_.erase(item_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about replacing the (likely) O(log N)
element lookup with O(1)
erase of an iterator ? The iterator which is returned from insert
to be specific, and currently ignored. In this case the Item
wouldn't have to be stored inside CanProcess
object, so that's also one less template parameter and probably also smaller object size (hashes are larger than iterators I guess).
Could even go one step further and replace all data members with std::function<void()> cleanup_
which would capture all that it needs if insert
succeeded, or is empty if insert
failed. In this case no template parameters would be needed at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about replacing the (likely)
O(log N)
element lookup withO(1)
erase of an iterator ? The iterator which is returned frominsert
to be specific, and currently ignored. In this case theItem
wouldn't have to be stored insideCanProcess
object, so that's also one less template parameter and probably also smaller object size (hashes are larger than iterators I guess).
I didn't really consider using an iterator because I didn't want to consider the risks of it getting invalidated for any valid Collection
type. For example, it looks like unordered_map
would fit the template, but can invalidate iterators. One option is to force the collection to be std::set
, which the current callers use.
Could even go one step further and replace all data members with
std::function<void()> cleanup_
which would capture all that it needs ifinsert
succeeded, or is empty ifinsert
failed. In this case no template parameters would be needed at all.
So only the ctor and insert
would have template parameters? That could work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was incorrect in that last comment. unordered_map
doesn't fit the template, but unordered_set
does, and it also says it can potentially invalidate operators. So, I wrote a generic insert()
that doesn't use iterators, and a specialized one that does, and wrote tests for both. It may be overkill....
return true; | ||
return false; | ||
}(); | ||
assert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like unnecessary repetition of the condition used to initialise shouldAcquire
but there is also a subtle bug here. We are reading getOPs().isNeedNetworkLedger()
more than once, but this function returns an atomic (I mean its implementation in NetworkOPs.cpp
) so in principle each time we call it, we might get a different response. I am not certain this condition can actually happen, but at the very least it is brittle code.
Also please note 3rd call to getOPs().isNeedNetworkLedger()
inside logging below, the same applies as it could generate some confusing logs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. I should have removed that assert
before opening the PR, but forgot. I was just double-checking that I had rewritten old code into new code. I've done that, and stored the result of isNeedNetworkLedger
into a local variable, so that all the uses are guaranteed consistent.
<< ": " << e.what(); | ||
} | ||
catch (...) | ||
if (CanProcess check{acquiresMutex_, pendingAcquires_, hash}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good 👍
* upstream/develop: test: Add more test cases for Base58 parser (5174) test: Check for some unlikely null dereferences in tests (5004) Add Antithesis intrumentation (5042)
* upstream/develop: refactor: clean up `LedgerEntry.cpp` (5199)
* upstream/develop: Enforce levelization in libxrpl with CMake (5111)
* upstream/develop: Set version to 2.4.0-b1 fix: Add header for set_difference (5197) fix: allow overlapping types in `Expected` (5218) Add MPTIssue to STIssue (5200) Antithesis instrumentation improvements (5213)
* upstream/develop: chore: add macos dependency installation (5233) prefix Uint384 and Uint512 with Hash in server_definitions (5231) refactor: add `rpcName` to `LEDGER_ENTRY` macro (5202)
Co-authored-by: Bronek Kozicki <[email protected]>
* Rewrite CanProcess: Remove class template parameters. Erase collection items using iterator if appropriate. Remove the redundant canProcess() function. * Store the result of isNeedNetworkLedger() to prevent possible inconsistencies. * Make all the uses of CanProcess const objects,
* Older clang versions can't handle structured binding variables passed to lambas
* upstream/develop: chore: update deprecated Github Actions (5241) XLS-46: DynamicNFT (5048)
* upstream/develop: chore: Update Visual Studio CI to VS 2022, and add VS Debug builds (5240) Add [validator_list_threshold] to validators.txt to improve UNL security (5112) Switch from assert to XRPL_ASSERT (5245) Add missing space character to a log message (5251) Cleanup API-CHANGELOG.md (5207) test: Unit tests to recreate invalid index logic error (5242) Update branch management and merge / release processes (5215) fix: Error consistency in LedgerEntry::parsePermissionedDomains() (5252) Set version to 2.4.0-b2 fix: Use consistent CMake settings for all modules (5228) Fix levelization script to ignore commented includes (5194) Fix the flag processing of NFTokenModify (5246) Fix failing assert in `connect` RPC (5235) Permissioned Domains (XLS-80d) (5161)
* upstream/develop: Fix CI unit tests (5196) Update secp256k1 library to 0.6.0 (5254)
* upstream/develop: Set version to 2.4.0-b3 Set version to 2.3.1 Update conan in the "nix" CI jobs Add RPC "simulate" to execute a dry run of a transaction (5069) Set version to 2.3.1-rc1 Reduce the peer charges for well-behaved peers:
* upstream/develop: Add deep freeze feature (XLS-77d) (5187)
* upstream/develop: Amendment `fixFrozenLPTokenTransfer` (5227) Improve git commit hash lookup (5225)
* upstream/develop: Updates Conan dependencies (5256)
* upstream/develop: fix: Do not allow creating Permissioned Domains if credentials are not enabled (5275) fix: issues in `simulate` RPC (5265)
* upstream/develop: fix: Amendment to add transaction flag checking functionality for Credentials (5250) fix: Omit superfluous setCurrentThreadName call in GRPCServer.cpp (5280)
case CONSENSUS: | ||
return "CONSENSUS"; | ||
default: | ||
assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please replace with UNREACHABLE
and a nice & short description why (see CONTRIBUTING.md
, section " Contracts and instrumentation" for naming guideline)
{ | ||
// Done. Something else (probably consensus) built the ledger | ||
// locally while waiting for data (or possibly before requesting) | ||
assert(isDone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please replace with XRPL_ASSERT
and a short description why; see CONTRIBUTING.md
for contracts and instrumentation naming guidelines
|
||
for (auto const peerID : s.peekPeerSet()) | ||
{ | ||
if (!callback(peerID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fast are these callbacks - we are running them over a loop inside critical section. Might need to think of an alternative solution, to avoid hogging the mutex_
. Not a major issue, because as indicated in first sentence - I do not know if these callbacks are fast or slow (or we don't know ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 10 assert
added here, can you please turn them to XRPL_ASSERT
, following naming guideline in CONTRIBUTING.md
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor remarks related to use of assert
, once fixed feel free to merge
Note to self: add unit tests for PeerImp.cpp
in a future PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I realized I added a comment re. use of a loop and callback inside critical section. A sufficient explanation why this is safe might be sufficient to address this comment, but for now this is not approved.
* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268) docs: Add a summary of the git commit message rules (5283)
High Level Overview of Change
Several changes to help reduce message traffic and improve logging and visibility.
TMGetLedger
andTMLedgerData
messages, reducing the overhead of processing those messages.TMLedgerData
message, which allows multiple identical requests to be replied to with one message.These changes are organized into several commits which are organized logically separating each functional operation. They can be merged as-is, or squashed.
Context of Change
Analysis of the issue that led to #5115 identified heavy
TMGetLedger
request andTMLedgerData
response traffic between nodes leading up to the syncing incidents. It was later determined that those messages were more a symptom of the problem, and not the root cause. However, leading up to identification of the root cause, these changes were being implemented to cut down on those messages, detect duplicates, etc. That reduction in unnecessary traffic is still valuable, so it's being included here.Type of Change
API Impact
None.
Test Plan
Future Tasks
I am still working on a follow-up to #4764 that makes use of these changes and other improvements to reduce the number of requests initiated by a given node in the first place.