Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/oasis: Cache init failure is not longer fatal #619

Closed
wants to merge 1 commit into from

Conversation

mitjat
Copy link
Contributor

@mitjat mitjat commented Jan 29, 2024

When the node RPC cache (backed by a pogreb db) fails to initialize, our current code prevents the entire analyzer from starting. This PR makes it so that a failure to initialize the cache merely reports a warning, and continues without the cache.

Testing: Manual.

  • Ran the analyzer with a functioning cache; works.
  • Ran the analyzer with cache_dir: /does-not-exist, produced a warning and continued as expected.

@mitjat mitjat force-pushed the mitjat/cache-optional branch from 2574723 to bd526a2 Compare January 29, 2024 22:31
@mitjat mitjat force-pushed the mitjat/cache-optional branch from bd526a2 to 5d713f0 Compare January 29, 2024 22:39
@mitjat
Copy link
Contributor Author

mitjat commented Jan 30, 2024

Aborting this PR. As discussed internally, I'd prefer to keep cache failure handling explicit (= one has to disable cache in the config) so we are more aware of caching issues. The team agrees.

Mitigations to the root cause will follow in future PRs:
The problematic situation is when
a) the pogreb store needs reindexing and
b) the sapphire or emerald nodes are unavailable.

Because of (b), nexus initialization keeps failing in a fatal way. (Nexus is lazy about connecting to the node, but not for the runtimes because it uses the SDK in addition to the raw RPC connection.) But by the time that fatal error occurs, pogreb has already started the recovery process, including .bac creation. Then k8s keeps restarting nexus. At this point, if the runtime node became available, all would be good.But it typically stays unavailable or a while, so the .bac files pile up, until eventually the pogreb recovery fails (because it cannot create backups) and nexus terminates before it even tries to connect to the runtime node. At that point, every restart will fail, so  when the runtime node eventually comes back up, nexus never learns about it.

We should ideally stop using the SDK, or change it so it can connect lazily. We should also pre-clean-up the .bac files to prevent pogreb from failing just due to filename buildup.

@mitjat mitjat closed this Jan 30, 2024
@mitjat mitjat deleted the mitjat/cache-optional branch February 6, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant