Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reachability check loop to expose reachability failures in node logs #556

Merged
merged 7 commits into from
May 14, 2024

Conversation

pschork
Copy link
Contributor

@pschork pschork commented May 11, 2024

Default check interval is 5min
Minimum check interval is 10sec
Operators can disable by setting NODE_REACHABILITY_POLL_INTERVAL to 0

For backwards compatibility, the reachability check loop is disabled if the NODE_DATAAPI_URL is undefined (ie operator did not update their .env)

See also Layr-Labs/eigenda-operator-setup#123

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@pschork pschork force-pushed the pschork/node_port_check branch 3 times, most recently from 11fdf44 to 4d26766 Compare May 11, 2024 00:50
@pschork pschork requested review from jianoaix and shrimalmadhur May 11, 2024 00:52
@pschork pschork force-pushed the pschork/node_port_check branch from 4d26766 to 97ff548 Compare May 11, 2024 01:10
@pschork pschork changed the title Add reachability check loop to node to expose reachability failures in node logs Add reachability check loop to expose reachability failures in node logs May 11, 2024
@pschork pschork force-pushed the pschork/node_port_check branch from 97ff548 to 073ba2e Compare May 11, 2024 01:47
@pschork
Copy link
Contributor Author

pschork commented May 11, 2024

Adds auto semantic versioning to node makefile.

> make semver
0.7.1-pschork-node-port-check.1

GitVersion knows that the latest tag release (currently) is v0.7.0 and automatically bumps a patch release (ie 0.7.1) and because this is a feature branch it includes branch name pschork/node-port-check->pschork-node-port-check and finally adds build metadata .1

> make docker-node
> docker images opr-node
REPOSITORY   TAG                               IMAGE ID       CREATED          SIZE
opr-node     0.7.1-pschork-node-port-check.1   86d1ddd34fee   30 minutes ago   33.8MB
opr-node     latest                            86d1ddd34fee   30 minutes ago   33.8MB
> docker run -it opr-node:latest --version
da-node version 0.7.1-pschork-node-port-check.1-073ba2e-1715392025

@pschork pschork force-pushed the pschork/node_port_check branch from e5a5d24 to 3190725 Compare May 11, 2024 04:12
node/node.go Show resolved Hide resolved
@ian-shim ian-shim requested a review from anupsv May 13, 2024 19:34
node/node.go Outdated Show resolved Hide resolved
node/node.go Outdated Show resolved Hide resolved
node/node.go Outdated
}

checkUrl, err := url.Parse(fmt.Sprintf("%s/api/v1/operators-info/port-check?operator_id=%s", n.Config.DataApiUrl, n.Config.ID.Hex()))
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which condition does it fall into if the dataapi itself was down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not this one. This is just ensuring that the constructed URL parses as valid.
If DataAPI is down/unresolvable

If DataAPI is up but throwing non-200

@pschork pschork marked this pull request as ready for review May 14, 2024 01:42
pschork and others added 6 commits May 14, 2024 10:05
Default check interval in 5min
Minimum check interval is 10sec

For backwards compatibility, the reachability check loop is disabled if
1. NODE_DATA_URL is undefined (ie operator did not update their .env)
1. NODE_REACHABILITY_POLL_INTERVAL is set to 0
…go unnoticed.

- If NODE_DATAAPI_URL is not defined, we will continue to log error every interval.
- If the constructed checkUrl is invalid, we will continue to log error every interval
Disable reachability goroutine if configuration is broken
@pschork pschork force-pushed the pschork/node_port_check branch from 3f03b34 to 6f15896 Compare May 14, 2024 17:05
node/flags/flags.go Outdated Show resolved Hide resolved
node/flags/flags.go Outdated Show resolved Hide resolved
n.Logger.Debug("Calling reachability check", "url", checkUrl.String())

resp, err := http.Get(checkUrl.String())
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a value in the gauge in case it gets error? @shrimalmadhur

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should, otherwise it will show last value.

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one comment on if we need to update the gauge, when the call to dataapi fails to get response. @shrimalmadhur may have dealt with an issue for the onchain metric before, please chime in.

@pschork
Copy link
Contributor Author

pschork commented May 14, 2024

LGTM, just one comment on if we need to update the gauge, when the call to dataapi fails to get response. @shrimalmadhur may have dealt with an issue for the onchain metric before, please chime in.

Thanks @jianoaix. If Dataapi is down, this isn't really the operators issue and does not mean they are unreachable. We will log this condition as error in log every interval, so I feel like it wont go unnoticed.

@pschork pschork merged commit d1c657c into master May 14, 2024
6 checks passed
@shrimalmadhur
Copy link
Contributor

LGTM, just one comment on if we need to update the gauge, when the call to dataapi fails to get response. @shrimalmadhur may have dealt with an issue for the onchain metric before, please chime in.

Thanks @jianoaix. If Dataapi is down, this isn't really the operators issue and does not mean they are unreachable. We will log this condition as error in log every interval, so I feel like it wont go unnoticed.

so if the dataapi is down - what's the indicator for operators? like they will still see reachable from their side since the gauge value has not changed. How do they know if data api is down and reachability check is not working fine?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants