-
Notifications
You must be signed in to change notification settings - Fork 5
Update Hydras to new HTTP Delegated Routing #180
Comments
Main code change in #185 I have also turned off OpenSSL in the Docker build since it keeps causing problems, it's now using Go's crypto. I'll monitor perf around that. I've deployed this to the test instance, see libp2p/hydra-booster-infra#14. I've also updated the dashboards with the new metrics. I'll let it bake overnight, if everything looks good tomorrow then I'll deploy to the whole fleet. |
Hi @guseggert . Did the prod deployment happen? Are there client side (Hydra) and server-side (cid.contact) graphs you're monitoring? |
No not yet, it was getting late Fri and I didn't want to deploy late on a Fri. Today I looked into why CPU usage was much higher than expected (almost 2x). I expected something related to disabling OpenSSL, but CPU profiles showed most time spent in GC, and allocation profiles showed top allocations were in libp2p resource manager's metric publishing, which generates a ton of garbage in the tags that it adds to metrics. So I disabled that--we don't use it anyway, hydra calculates its own resource manager metrics. That's now deployed to test and CPU usage looks much better, as does long-tail latency on cid.contact requests. This became an issue now because I also upgraded libp2p to the latest version to pick up all the security updates. Letting this bake again tonight and will take a look in the AM. Will also open an issue w/ go-libp2p to reduce the garbage generated by the resource manager metrics. |
@guseggert : how is this looking? Also, please share the issue with go-libp2p when you have it. |
I was able to grab another profile showing the OpenCensus tag allocations from OpenCensus, opened an issue with go-libp2p here: libp2p/go-libp2p#1955 I've been fighting with resource manager and I have given up on it and turned it off, and things are looking better now. Every time I would fix one limit, another would pop up and cause some degenerate behavior somewhere else, and chasing down the root cause of throttles is non-trivial. We need to move forward here so I am just disabling resource manager for now. |
@guseggert : can you also point to how you were configuring the resource manager? (I'm asking so can learn what pain another integrate experienced.) I would have expected us toonly have limits like Kubo's strategy. |
Each hydra host is effectively running many Kubo nodes at the same time, and they also don't handle Bitswap traffic, so the traffic pattern is pretty different from a single Kubo node. We have high-traffic gateway hosts to compare with but they are even more different (eg accelerated DHT client). The RM config currently deployed to prod hydras is here: https://github.com/libp2p/hydra-booster/blob/master/head/head.go#L82 . Note that those are per-head limits. After upgrading from go-libp2p v0.21 to v0.24, there was significantly more throttling, so I've been tweaking them locally and in a branch. As part of that, I pulled resource manager and connection manager out to be shared across heads instead, which makes reasoning about limits easier.. When RM throttling was interfering, there was much less processed reqs by the DHT but much higher mem usage and goroutines, mostly stuck on the identify handshake...I didn't trace through the code but I'm suspecting that they were somehow stuck due to RM throttling, since everything's running fine now with RM off. |
Coordinated with @masih this morning to flip the full Hydra fleet over to the HTTP API. Things are looking fine. The p50 cid.contact latency has dropped from ~36 ms (via reframe) to ~18 ms (via HTTP API). |
Resolving since done criteria is satisfied. |
Done Criteria
Hydras are using the HTTP Delegated Routing version compatible with ipfs/specs#337 in production.
Why Important
See motivation in ipfs/specs#337
Notes
The text was updated successfully, but these errors were encountered: