You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reviewed all our metrics for mainnet servers to find points of improvements. I'm opening this discussion to list everything I see concerning, discuss on it and eventually open issues for it if other agree it's an issue
Items from metrics (mainnet node with +64 keys)
Un-even utilization of BLS thread pool workers
Attempt to have good latency from worker, maybe because main thread is very busy?
Block processor job time must be close to zero
beacon_block_production_seconds_bucket metrics is dead
Op pool sizes are too big? Do so much objects take a lot of memory?
Review full block processing flow to understand if we are processing blocks quick or not
Why do we receive much more data than what we send?
Why do we send more subscriptions that we receive?
There is a slow leak in discv5. We must have an alert for leaks on workers (Faith)
Execution engine job time is too high
Node should eventually stop looking for peers so badly
Why are 2 regen jobs run on average?
Block production is too slow, and need more metrics to understand the process
Node performance is not good enough
DB growth on Mainnet is 600MB/d, 219GB/y, that’s not ok
Items from CPU profile
0418_mainnet_cache_attestation_data.cpuprofile From @tuyennhv (beta lg1k)
top down
19.72% get rootHashObject @chainsafe/persistent-merkle-tree/lib/node.js
This number must be wrong, makes no sense. Could be due to extremely long nested stack of this function?
Completely deprecate avg/min/max metrics once and for all
Reduce the number of metrics per paquet in gossip
Good gossip network participant
0% attestation objects are drop in normal cases
close to < 10 ms wait time for gossip objects
Queue length never full on normal operations
async validation must be fast enough to have a close to 0% mcache miss rate
Time from block received to processed should be < 100 ms, currently at 600ms
Mesh performance must be very stable, currently it fluctuates significantly
Understand the memory cost of a 600,000 item gossip seenCache and fastIdCache
Add recurring task that grabs exports all dashboards from our Grafana instance, prepares a commit and fires it. Use the Grafana HTTP API with auth token for it.
Audit and debug all missed attestations that happen in production nodes. Methodically categorize them and understand them. Figure out a fast workflow for it.
Cache pubkeys of attestation committees so to prevent aggregating them over and over. If you get an attestation with +80% participation then compute the full pubkey, cache it and recompute the diff latter by substraction
De-duplicate execution payloads from local database
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Reviewed all our metrics for mainnet servers to find points of improvements. I'm opening this discussion to list everything I see concerning, discuss on it and eventually open issues for it if other agree it's an issue
Items from metrics (mainnet node with +64 keys)
Un-even utilization of BLS thread pool workers
Attempt to have good latency from worker, maybe because main thread is very busy?
Block processor job time must be close to zero
beacon_block_production_seconds_bucket
metrics is deadOp pool sizes are too big? Do so much objects take a lot of memory?
Review full block processing flow to understand if we are processing blocks quick or not
Why do we receive much more data than what we send?
Why do we send more subscriptions that we receive?
There is a slow leak in discv5. We must have an alert for leaks on workers (Faith)
Execution engine job time is too high
Node should eventually stop looking for peers so badly
Why are 2 regen jobs run on average?
Block production is too slow, and need more metrics to understand the process
Node performance is not good enough
DB growth on Mainnet is 600MB/d, 219GB/y, that’s not ok
Items from CPU profile
0418_mainnet_cache_attestation_data.cpuprofile
From @tuyennhv (beta lg1k)top down
tree
Misc items
Beta Was this translation helpful? Give feedback.
All reactions