Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling Maintenance mode does not trigger an event #11330

Open
jakubgs opened this issue Oct 15, 2021 · 8 comments · May be fixed by #11332
Open

Disabling Maintenance mode does not trigger an event #11330

jakubgs opened this issue Oct 15, 2021 · 8 comments · May be fixed by #11332
Assignees
Labels
Hacktoberfest type/enhancement Proposed improvement or new feature

Comments

@jakubgs
Copy link
Contributor

jakubgs commented Oct 15, 2021

Overview of the Issue

I was looking at disabling alerts in consul-alerts when host is in maintenance mode: AcalephStorage/consul-alerts#213

And I found out that while consul-alerts does create an entry in the KV store for _node_maintenance service check when maintenance mode is disabled, it does NOT update that entry once maintenance mode is disabled:

image

So I tried running consul watch myself same as the consul-alerts service does and I found out that disabling maintenance mode does not trigger an event:

 > consul watch -type checks cat | jq '.[] | select(.CheckID == "_node_maintenance")'
{
  "Node": "node-01.example.org",
  "CheckID": "_node_maintenance",
  "Name": "Node Maintenance Mode",
  "Status": "critical",
  "Notes": "TEST",
  "Output": "",
  "ServiceID": "",
  "ServiceName": "",
  "ServiceTags": [],
  "Type": "maintenance",
  "Definition": {
    "Interval": "0s",
    "Timeout": "0s",
    "DeregisterCriticalServiceAfter": "0s",
    "HTTP": "",
    "Header": null,
    "Method": "",
    "Body": "",
    "TLSServerName": "",
    "TLSSkipVerify": false,
    "TCP": ""
  },
  "CreateIndex": 67303625,
  "ModifyIndex": 67303625
}

I receive one event with Status set to critical, but no recovery to passing, since the _node_maintenance check is removed.

For this reason consul-alerts never updates its stage of _node_maintenance check.

Reproduction Steps

  1. Use version 1.10.1
  2. Run consul watch -type checks cat
  3. See that no even is triggered for disabling of maintenance mode

Operating system and Environment details

Ubuntu 20.04.3, x86_64

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

And there we have it:

consul/agent/agent.go

Lines 3464 to 3483 in cc2abb7

// DisableServiceMaintenance will deregister the fake maintenance mode check
// if the service has been marked as in maintenance.
func (a *Agent) DisableServiceMaintenance(serviceID structs.ServiceID) error {
if a.State.Service(serviceID) == nil {
return fmt.Errorf("No service registered with ID %q", serviceID.String())
}
// Check if maintenance mode is enabled
checkID := serviceMaintCheckID(serviceID)
if a.State.Check(checkID) == nil {
// maintenance mode is not enabled
return nil
}
// Deregister the maintenance check
a.RemoveCheck(checkID, true)
a.logger.Info("Service left maintenance mode", "service", serviceID.String())
return nil
}

consul/agent/agent.go

Lines 3510 to 3517 in cc2abb7

// DisableNodeMaintenance removes a node from maintenance mode
func (a *Agent) DisableNodeMaintenance() {
if a.State.Check(structs.NodeMaintCheckID) == nil {
return
}
a.RemoveCheck(structs.NodeMaintCheckID, true)
a.logger.Info("Node left maintenance mode")
}

We can clearly see that when disabling maintenance mode for a host RemoveCheck is called without changing the state.

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

I think just calling:

	a.State.UpdateCheck(checkID, api.HealthPassing, "")

Before RemoveCheck should do the trick. I'll make a PR.

@jkirschner-hashicorp
Copy link
Contributor

It may be worth verifying that the event is still triggered even if UpdateCheck and RemoveCheck are called in quick succession. (I'm unfamiliar with this part of the code, so I can't comment on whether I'd expect the event to fire still or not.)

@jkirschner-hashicorp jkirschner-hashicorp added the type/enhancement Proposed improvement or new feature label Oct 15, 2021
@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

Though now that I look at it adding a test to verify the status was changed to passing before removal will be a bit more tricky:

consul/agent/agent_test.go

Lines 3358 to 3362 in cc2abb7

// Leave maintenance mode
a.DisableNodeMaintenance()
// Ensure the check was deregistered
requireCheckMissing(t, a, structs.NodeMaint)

I will have to do some kind of State dark magic to check that. In tests I changed before this was done using mock.NewNotify():
func TestStatusHandlerResetCountersOnNonIdenticalsConsecutiveChecks(t *testing.T) {
t.Parallel()
cid := structs.NewCheckID("foo", nil)
notif := mock.NewNotify()
logger := testutil.Logger(t)
statusHandler := NewStatusHandler(notif, logger, 2, 2, 3)

But not sure how that would work here.

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

It may be worth verifying that the event is still triggered even if UpdateCheck and RemoveCheck are called in quick succession.

Yes, good point. I'll verify that on a some host from some test fleet in our infra.

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

You might be onto something. I tried with just UpdateCheck call and it didn't work:

 > consul watch -type checks cat | jq -c '.[] | select(.CheckID == "_node_maintenance") | {CheckID, Status}'   
{"CheckID":"_node_maintenance","Status":"critical"}

But with addition of a.State.SyncChanges() it does work:

 > consul watch -type checks cat | jq -c '.[] | select(.CheckID == "_node_maintenance") | {CheckID, Status}'
{"CheckID":"_node_maintenance","Status":"critical"}                                                                                         
{"CheckID":"_node_maintenance","Status":"passing"}

I also tried a.State.SyncFull() which also worked, but not sure which one is the correct choice in this case.

Thanks for the tip. Though I'm still not sure how to change the tests to take this into account.

@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

As far as I can tell it's impossible to check what was the state of the check after it was removed.

I thought I could get it after it was removed from a.State.AllChecks() since:

// Deleted is true when the service record has been marked as deleted
// but has not been removed on the server yet.
Deleted bool

consul/agent/local/state.go

Lines 334 to 338 in cc2abb7

// To remove the service on the server we need the token.
// Therefore, we mark the service as deleted and keep the
// entry around until it is actually removed.
s.InSync = false
s.Deleted = true

But when I try to print checks after it has been removed:

	fmt.Printf("%v", a.State.AllChecks())

I get nothing:

map[]

@jakubgs jakubgs linked a pull request Oct 15, 2021 that will close this issue
@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 15, 2021

Well, for now I made a PR without tests changed: #11332

Open to suggestions on how I can check for status change in a check that was removed.

jakubgs added a commit to status-im/consul that referenced this issue Oct 15, 2021
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
jakubgs added a commit to status-im/consul that referenced this issue May 6, 2022
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue Apr 22, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue Apr 29, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 2, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 2, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 3, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 6, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 11, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 15, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 20, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue May 30, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue Aug 2, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue Dec 5, 2024
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
siddarthkay pushed a commit to status-im/consul that referenced this issue Feb 5, 2025
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest type/enhancement Proposed improvement or new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants