Metrics review Apr 2023 #5383

dapplion · 2023-04-19T06:43:30Z

dapplion
Apr 19, 2023

Reviewed all our metrics for mainnet servers to find points of improvements. I'm opening this discussion to list everything I see concerning, discuss on it and eventually open issues for it if other agree it's an issue

Items from metrics (mainnet node with +64 keys)

Un-even utilization of BLS thread pool workers

Attempt to have good latency from worker, maybe because main thread is very busy?

Block processor job time must be close to zero

beacon_block_production_seconds_bucket metrics is dead

Op pool sizes are too big? Do so much objects take a lot of memory?

Review full block processing flow to understand if we are processing blocks quick or not

Why do we receive much more data than what we send?

Why do we send more subscriptions that we receive?

There is a slow leak in discv5. We must have an alert for leaks on workers (Faith)

Execution engine job time is too high

Node should eventually stop looking for peers so badly

Why are 2 regen jobs run on average?

Block production is too slow, and need more metrics to understand the process

Node performance is not good enough

DB growth on Mainnet is 600MB/d, 219GB/y, that’s not ok

Items from CPU profile

0418_mainnet_cache_attestation_data.cpuprofile From @tuyennhv (beta lg1k)

top down

19.72% get rootHashObject @chainsafe/persistent-merkle-tree/lib/node.js
- This number must be wrong, makes no sense. Could be due to extremely long nested stack of this function?

tree

14.7% (anonymous) @chainsafe/libp2p-gossipsub/dist/src/index.js:562
- 6.5% handleRecievedRpc @chainsafe/libp2p-gossipsub/dist/src/index.js
- 3.6% abortable
- 2.4% decodeRpc
- 0.8% onRpcRecv
11.6% (anonymous) @chainsafe/libp2p-noise/dist/src/crypto/streaming.js:23
- 10.0% decrypt @chainsafe/libp2p-noise/dist/src/handshake-xx.js
- 0.6% subarray uint8arraylist/dist/src/index.js
10.75% sink @libp2p/mplex/dist/src/mplex.js

Misc items

Completely deprecate avg/min/max metrics once and for all
Reduce the number of metrics per paquet in gossip
Good gossip network participant
- 0% attestation objects are drop in normal cases
- close to < 10 ms wait time for gossip objects
- Queue length never full on normal operations
- async validation must be fast enough to have a close to 0% mcache miss rate
Time from block received to processed should be < 100 ms, currently at 600ms
Mesh performance must be very stable, currently it fluctuates significantly
Understand the memory cost of a 600,000 item gossip seenCache and fastIdCache
Add recurring task that grabs exports all dashboards from our Grafana instance, prepares a commit and fires it. Use the Grafana HTTP API with auth token for it.
Audit and debug all missed attestations that happen in production nodes. Methodically categorize them and understand them. Figure out a fast workflow for it.
Cache pubkeys of attestation committees so to prevent aggregating them over and over. If you get an attestation with +80% participation then compute the full pubkey, cache it and recompute the diff latter by substraction
De-duplicate execution payloads from local database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics review Apr 2023 #5383

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Metrics review Apr 2023 #5383

dapplion Apr 19, 2023

Items from metrics (mainnet node with +64 keys)

Items from CPU profile

Misc items

Replies: 0 comments

dapplion
Apr 19, 2023