Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal watchdog functionality #2010

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

shawn-zil
Copy link
Contributor

@shawn-zil shawn-zil commented Dec 13, 2024

An application layer restart for internal nodes. This is in-contrast to #2004 that tried to use a OS-level restart mechanism.

This internal watchdog mechanism is more modest. Since a service may run multiple nodes, this mechanism restarts only the affected node and not the entire service. It can be used alongside any infra-level restarts that detect and restart the entire service.

2024-12-13T02:54:08.912071Z  INFO zilliqa::consensus: 487: We are on view: 1 but we are not a validator, so we are waiting.
2024-12-13T02:54:09.422499Z  INFO zilliqa::consensus: 487: We are on view: 1 but we are not a validator, so we are waiting.
2024-12-13T02:54:09.536248Z  WARN zilliqa::node_launcher: 234: WDT node stuck at self_highest=0 remote_highest=7025
2024-12-13T02:54:09.536304Z  INFO zilliqa::node_launcher: 385: WDT restarting 33468.
2024-12-13T02:54:09.536850Z  INFO zilliqa::db: 235: PRAGMA journal_mode="memory" journal_size_limit=33554432 synchronous=1 temp_store=2 page_size=32768 cache_size=8192
2024-12-13T02:54:09.573453Z  INFO zilliqa::node_launcher: 267: WDT period: 30s
2024-12-13T02:54:09.589972Z  INFO zilliqa::consensus: 487: We are on view: 1 but we are not a validator, so we are waiting.
2024-12-13T02:54:10.100579Z  INFO zilliqa::consensus: 487: We are on view: 1 but we are not a validator, so we are waiting.

I've also checked that the joinset is stable, does not grow with multiple restarts.

The definition of stuck node is:

  • Has not changed its own highest canonical block number in a while. This is an internal-check. If N historical samples are the same, then this condition is met.
  • Has a canonical block number that is smaller than the network. This is an external-check to ensure that we’re behind. This will help to mitigate against timeouts.

Copy link
Contributor

github-actions bot commented Dec 13, 2024

🐰 Bencher Report

Branch1917-internal-watchdog
Testbedself-hosted
Click to view all benchmark results
BenchmarkLatencyBenchmark Result
nanoseconds (ns)
(Result Δ%)
Upper Boundary
nanoseconds (ns)
(Limit %)
process-empty/process-empty📈 view plot
🚷 view threshold
8,433,200.00
(-7.95%)
10,637,765.68
(79.28%)
produce-full/produce-full📈 view plot
🚷 view threshold
2,250,100,000.00
(-0.96%)
3,138,061,085.10
(71.70%)
🐰 View full continuous benchmarking report in Bencher

@shawn-zil shawn-zil marked this pull request as ready for review December 13, 2024 02:59
@shawn-zil shawn-zil enabled auto-merge December 13, 2024 03:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant