Update Hydras to new HTTP Delegated Routing #180

BigLep · 2022-11-18T21:22:58Z

Done Criteria

Hydras are using the HTTP Delegated Routing version compatible with ipfs/specs#337 in production.

Why Important

See motivation in ipfs/specs#337

Notes

Estimate in Updated/new HTTP API Migration ipfs/go-delegated-routing#63 was 2 developer days
Tracking issue for this overall effort: Updated/new HTTP API Migration ipfs/go-delegated-routing#63.
- Timeline coordination is happening there.
There are multiple steps:
- Add/merge client compatible with IPIP-337: Delegated Content Routing HTTP API ipfs/specs#337 (which includes removing the legacy Reframe client)
- (After cid.contact deployment Update STI to new HTTP Delegated Routing ipni/storetheindex#1038) Deploy new Hydra code with updated client

guseggert · 2022-12-08T22:18:08Z

Main code change in #185

I have also turned off OpenSSL in the Docker build since it keeps causing problems, it's now using Go's crypto. I'll monitor perf around that.

I've deployed this to the test instance, see libp2p/hydra-booster-infra#14. I've also updated the dashboards with the new metrics.

I'll let it bake overnight, if everything looks good tomorrow then I'll deploy to the whole fleet.

BigLep · 2022-12-09T23:35:35Z

Hi @guseggert . Did the prod deployment happen? Are there client side (Hydra) and server-side (cid.contact) graphs you're monitoring?

guseggert · 2022-12-13T01:46:20Z

No not yet, it was getting late Fri and I didn't want to deploy late on a Fri. Today I looked into why CPU usage was much higher than expected (almost 2x). I expected something related to disabling OpenSSL, but CPU profiles showed most time spent in GC, and allocation profiles showed top allocations were in libp2p resource manager's metric publishing, which generates a ton of garbage in the tags that it adds to metrics. So I disabled that--we don't use it anyway, hydra calculates its own resource manager metrics. That's now deployed to test and CPU usage looks much better, as does long-tail latency on cid.contact requests.

This became an issue now because I also upgraded libp2p to the latest version to pick up all the security updates.

Letting this bake again tonight and will take a look in the AM. Will also open an issue w/ go-libp2p to reduce the garbage generated by the resource manager metrics.

BigLep · 2022-12-14T00:35:39Z

@guseggert : how is this looking?

Also, please share the issue with go-libp2p when you have it.

guseggert · 2022-12-14T21:55:44Z

I was able to grab another profile showing the OpenCensus tag allocations from OpenCensus, opened an issue with go-libp2p here: libp2p/go-libp2p#1955

I've been fighting with resource manager and I have given up on it and turned it off, and things are looking better now. Every time I would fix one limit, another would pop up and cause some degenerate behavior somewhere else, and chasing down the root cause of throttles is non-trivial. We need to move forward here so I am just disabling resource manager for now.

BigLep · 2022-12-14T22:18:52Z

@guseggert : can you also point to how you were configuring the resource manager? (I'm asking so can learn what pain another integrate experienced.) I would have expected us toonly have limits like Kubo's strategy.

guseggert · 2022-12-14T23:38:23Z

Each hydra host is effectively running many Kubo nodes at the same time, and they also don't handle Bitswap traffic, so the traffic pattern is pretty different from a single Kubo node. We have high-traffic gateway hosts to compare with but they are even more different (eg accelerated DHT client).

The RM config currently deployed to prod hydras is here: https://github.com/libp2p/hydra-booster/blob/master/head/head.go#L82 . Note that those are per-head limits. After upgrading from go-libp2p v0.21 to v0.24, there was significantly more throttling, so I've been tweaking them locally and in a branch. As part of that, I pulled resource manager and connection manager out to be shared across heads instead, which makes reasoning about limits easier.. When RM throttling was interfering, there was much less processed reqs by the DHT but much higher mem usage and goroutines, mostly stuck on the identify handshake...I didn't trace through the code but I'm suspecting that they were somehow stuck due to RM throttling, since everything's running fine now with RM off.

guseggert · 2022-12-16T18:33:21Z

Coordinated with @masih this morning to flip the full Hydra fleet over to the HTTP API. Things are looking fine. The p50 cid.contact latency has dropped from ~36 ms (via reframe) to ~18 ms (via HTTP API).

BigLep · 2023-01-20T06:10:31Z

Resolving since done criteria is satisfied.

BigLep assigned guseggert Nov 18, 2022

BigLep mentioned this issue Nov 18, 2022

Update STI to new HTTP Delegated Routing ipni/storetheindex#1038

Open

4 tasks

BigLep added this to IPFS Shipyard Team Nov 18, 2022

BigLep moved this to 🥞 Todo in IPFS Shipyard Team Nov 18, 2022

BigLep mentioned this issue Nov 18, 2022

Updated/new HTTP API Migration ipfs/go-delegated-routing#63

Closed

7 tasks

BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Dec 8, 2022

BigLep closed this as completed Jan 20, 2023

github-project-automation bot moved this from 🏃‍♀️ In Progress to 🎉 Done in IPFS Shipyard Team Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Hydras to new HTTP Delegated Routing #180

Update Hydras to new HTTP Delegated Routing #180

BigLep commented Nov 18, 2022 •

edited

Loading

guseggert commented Dec 8, 2022 •

edited

Loading

BigLep commented Dec 9, 2022

guseggert commented Dec 13, 2022

BigLep commented Dec 14, 2022

guseggert commented Dec 14, 2022

BigLep commented Dec 14, 2022

guseggert commented Dec 14, 2022 •

edited

Loading

guseggert commented Dec 16, 2022 •

edited

Loading

BigLep commented Jan 20, 2023

Update Hydras to new HTTP Delegated Routing #180

Update Hydras to new HTTP Delegated Routing #180

Comments

BigLep commented Nov 18, 2022 • edited Loading

Done Criteria

Why Important

Notes

guseggert commented Dec 8, 2022 • edited Loading

BigLep commented Dec 9, 2022

guseggert commented Dec 13, 2022

BigLep commented Dec 14, 2022

guseggert commented Dec 14, 2022

BigLep commented Dec 14, 2022

guseggert commented Dec 14, 2022 • edited Loading

guseggert commented Dec 16, 2022 • edited Loading

BigLep commented Jan 20, 2023

BigLep commented Nov 18, 2022 •

edited

Loading

guseggert commented Dec 8, 2022 •

edited

Loading

guseggert commented Dec 14, 2022 •

edited

Loading

guseggert commented Dec 16, 2022 •

edited

Loading