storage/oasis: Cache init failure is not longer fatal #619

mitjat · 2024-01-29T22:20:03Z

When the node RPC cache (backed by a pogreb db) fails to initialize, our current code prevents the entire analyzer from starting. This PR makes it so that a failure to initialize the cache merely reports a warning, and continues without the cache.

Testing: Manual.

Ran the analyzer with a functioning cache; works.
Ran the analyzer with cache_dir: /does-not-exist, produced a warning and continued as expected.

mitjat · 2024-01-30T21:58:03Z

Aborting this PR. As discussed internally, I'd prefer to keep cache failure handling explicit (= one has to disable cache in the config) so we are more aware of caching issues. The team agrees.

Mitigations to the root cause will follow in future PRs:
The problematic situation is when
a) the pogreb store needs reindexing and
b) the sapphire or emerald nodes are unavailable.

Because of (b), nexus initialization keeps failing in a fatal way. (Nexus is lazy about connecting to the node, but not for the runtimes because it uses the SDK in addition to the raw RPC connection.) But by the time that fatal error occurs, pogreb has already started the recovery process, including .bac creation. Then k8s keeps restarting nexus. At this point, if the runtime node became available, all would be good.But it typically stays unavailable or a while, so the .bac files pile up, until eventually the pogreb recovery fails (because it cannot create backups) and nexus terminates before it even tries to connect to the runtime node. At that point, every restart will fail, so when the runtime node eventually comes back up, nexus never learns about it.

We should ideally stop using the SDK, or change it so it can connect lazily. We should also pre-clean-up the .bac files to prevent pogreb from failing just due to filename buildup.

mitjat force-pushed the mitjat/cache-optional branch from 2574723 to bd526a2 Compare January 29, 2024 22:31

storage/oasis: Cache init failure is not longer fatal

5d713f0

mitjat force-pushed the mitjat/cache-optional branch from bd526a2 to 5d713f0 Compare January 29, 2024 22:39

mitjat closed this Jan 30, 2024

mitjat deleted the mitjat/cache-optional branch February 6, 2024 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/oasis: Cache init failure is not longer fatal #619

storage/oasis: Cache init failure is not longer fatal #619

mitjat commented Jan 29, 2024

mitjat commented Jan 30, 2024

storage/oasis: Cache init failure is not longer fatal #619

storage/oasis: Cache init failure is not longer fatal #619

Conversation

mitjat commented Jan 29, 2024

mitjat commented Jan 30, 2024