Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update maintenance check to passing before removing #11332

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jakubgs
Copy link
Contributor

@jakubgs jakubgs commented Oct 15, 2021

Without this update service using consul watch to monitor check changes will never see maintenance mode being disabled.

Resolves: #11330

Question for reviewer:

  • Open to suggestions on how I can check for status change in a check that was removed (so that tests can be updated)
  • Should SyncChanges or SyncFull be used to force the health check update event to fire before the health check is removed? (And are there any negative repercussions of that?)

@@ -3475,6 +3475,9 @@ func (a *Agent) DisableServiceMaintenance(serviceID structs.ServiceID) error {
return nil
}

// Update check to trigger an event for watchers
a.State.UpdateCheck(checkID, api.HealthPassing, "")
a.State.SyncChanges()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to reviewer: should SyncChanges or SyncFull be used here? This is to ensure the that watch events are fired before the check is removed.

(See thread for more context)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakubgs: I'd wait to address this until you get review feedback from engineering, but as a part of those changes...

I recommend adding a comment explaining why SyncChanges is necessary, because it's non-obvious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

@jakubgs jakubgs force-pushed the update-maint-check-state branch from e03677c to 875ab2a Compare October 15, 2021 17:03
@vercel vercel bot temporarily deployed to Preview – consul October 15, 2021 17:03 Inactive
@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 21, 2021

Any feedback?

@jkirschner-hashicorp
Copy link
Contributor

Hi @jakubgs,

The engineering team is taking a look at your questions from the description (e.g., any guidance on how to add a test).

Additionally, they may consider whether this "update to passing before check removal" change should be done in a more central location. Doing so would affect other check types, which have benefits and/or unintended side effects.

I'll get back to you at the end of this week if you haven't heard from us already by then.

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging November 17, 2021 14:44 Inactive
@dhiaayachi
Copy link
Collaborator

Hi @jakubgs,
Thank you for the contribution and sorry for the delay on this. I went through the changes and it look good, the only missing piece would be a test.

I think you can build a test based on those tests TestAgent_NodeMaintenance_Enable and TestAgent_NodeMaintenance_Disable (or modify these tests) to check that the check update was correctly propagated.

Let me know if you have questions or need more details on how to test this.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 17, 2021

Yes, that's exactly what I'm asking how to do. How can I possibly check the state of a check that was removed? This is not trivial.

@dhiaayachi
Copy link
Collaborator

@jakubgs Yes a good point, let me dig a bit to see if we can intercept the update somehow, or if we can test at a lower level by mocking the update.

@dnephin
Copy link
Contributor

dnephin commented Nov 17, 2021

If this fix is related to consul watch, maybe we should use that watch machinery and assert that we see the event we expect in the watch stream? I haven't written any test like that, so I'm not sure exactly what function we would use there, but something along those lines seems like what we want.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 17, 2021

Yes, considering the purpose is to have consul watch to notice the maintenance mode being turned off we could try to use that code to detect if the change has been registered. Although that makes the unit test less of a "unit" test and more of an "integration" test. But it's worth a try. No idea how that would work tho. I might not have time to look into that this week.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

I tried something like this:

diff --git a/agent/agent_test.go b/agent/agent_test.go
index 6638fa35c..d2da02ef5 100644
--- a/agent/agent_test.go
+++ b/agent/agent_test.go
@@ -3355,9 +3355,17 @@ func TestAgent_NodeMaintenanceMode(t *testing.T) {
                t.Fatalf("bad: %#v", check)
        }
 
+       nodeMaintCheck := a.State.Check(structs.NodeMaintCheckID)
+
+       // Ensure the check state was updated
+       require.Equal(t, api.HealthCritical, nodeMaintCheck.Status)
+
        // Leave maintenance mode
        a.DisableNodeMaintenance()
 
+       // Ensure the check state was updated
+       require.Equal(t, api.HealthPassing, nodeMaintCheck.Status)
+
        // Ensure the check was deregistered
        requireCheckMissing(t, a, structs.NodeMaint)

But it fails, since the state is unchanged:

--- FAIL: TestAgent_NodeMaintenanceMode (0.44s)
    agent_test.go:3370: 
        	Error Trace:	agent_test.go:3370
        	Error:      	Not equal: 
        	            	expected: "passing"
        	            	actual  : "critical"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-passing
        	            	+critical
        	Test:       	TestAgent_NodeMaintenanceMode
FAIL
exit status 1
FAIL	github.com/hashicorp/consul/agent	0.469s

Which I guess means I'm either checking too soon, or the instance I'm checking is outdated.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

I also tried moving the second check after requireCheckMissing() but it fails the same way.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

It appears the UpdateCheck() function uses something called TriggerSyncChanges():

consul/agent/local/state.go

Lines 704 to 707 in cc2abb7

c.Check.Status = status
c.Check.Output = output
c.InSync = false
l.TriggerSyncChanges()

Which appears to be set here to a.sync.SyncChanges.Trigger:

consul/agent/agent.go

Lines 558 to 563 in 3666401

// link the state with the consul server/client and the state syncer
// via callbacks. After several attempts this was easier than using
// channels since the event notification needs to be non-blocking
// and that should be hidden in the state syncer implementation.
a.State.Delegate = a.delegate
a.State.TriggerSyncChanges = a.sync.SyncChanges.Trigger

Which appears to be some kind of internal dark magic for service/client syncing.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

consul/agent/ae/ae.go

Lines 76 to 78 in 3666401

// SyncChanges allows triggering an immediate partial sync
// in a non-blocking way.
SyncChanges *Trigger

// Trigger implements a non-blocking event notifier. Events can be
// triggered without blocking and notifications happen only when the
// previous event was consumed.
type Trigger struct {
ch chan struct{}
}
func NewTrigger() *Trigger {
return &Trigger{make(chan struct{}, 1)}
}
func (t Trigger) Trigger() {
select {
case t.ch <- struct{}{}:
default:
}
}

Yeah, I really don't know what I'm looking at.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

One notable thing is that for the TestAgent_NodeMaintenanceMode we create a NewTestAgent:

consul/agent/agent_test.go

Lines 3444 to 3445 in 3666401

t.Parallel()
a := NewTestAgent(t, "")

And while normal Agent defines its TriggerSyncChanges:
a.State.TriggerSyncChanges = a.sync.SyncChanges.Trigger

The TestAgent defined no such thing in its Start function:

consul/agent/testagent.go

Lines 134 to 136 in 3666401

// Start starts a test agent. It returns an error if the agent could not be started.
// If no error is returned, the caller must call Shutdown() when finished.
func (a *TestAgent) Start(t *testing.T) error {

So I'm not sure how this tracking of events is even supposed to work.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

I tried adding some prints in State.UpdateCheck() and found the memory address changes every time because of:

consul/agent/local/state.go

Lines 660 to 670 in 3666401

// Ensure we only mutate a copy of the check state and put the finalized
// version into the checks map when complete.
//
// Note that we are relying upon the earlier deferred mutex unlock to
// happen AFTER this defer. As per the Go spec this is true, but leaving
// this note here for the future in case of any refactorings which may not
// notice this relationship.
c = c.Clone()
defer func(c *CheckState) {
l.checks[id] = c
}(c)

Which means we can't really monitor a copy previously fetched with a.State.Check().

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

I'm confused, where in the definition of TestAgent is there State?

// TestAgent encapsulates an Agent with a default configuration and
// startup procedure suitable for testing. It panics if there are errors
// during creation or startup instead of returning errors. It manages a
// temporary data directory which is removed after shutdown.
type TestAgent struct {
// Name is an optional name of the agent.
Name string

I can't find it.

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 23, 2021

Yeah. I don't get it, how exactly is State attached to TestAgent?

@jakubgs
Copy link
Contributor Author

jakubgs commented Nov 24, 2021

@dhiaayachi can you explain to me how exactly State is part of TestAgent? And is there some existing way to watch events in the State?

@dnephin
Copy link
Contributor

dnephin commented Nov 26, 2021

Hi @jakubgs, thanks for working on these tests!

The TestAgent has an embedded *Agent as the last field, so you can access local.State from TestAgent.Agent.State. But I'm not sure if that is our best option for emulating the behaviour of a watch.

I looked at the watch command, and I found the two API queries it is using here: https://github.com/hashicorp/consul/blob/v1.10.4/api/watch/funcs.go#L197-L199. I'm not sure which one of those two we need to use , but they should both work about the same.

Since we are testing DisableServiceMaintenance, and that is called in only one place by an API handler, it seems like using the HTTP API to drive some of the test is pretty reasonable.

To test the "watch" behaviour I think we'll probably need to use at least one goroutine. We can start a blocking query in a goroutine using httpHandlers.AgentChecks, and a query parameter of index= where the value is the last index of that check. You can call AgentChecks one time without index= to get the intial index.

That should block until the update is received. Then in the main test goroutine we can check to see if the event is received within some time window (probably 100ms is plenty of time to wait). I found an example of something like that here:
https://github.com/hashicorp/consul/blob/v1.10.4/agent/service_checks_test.go#L70-L75

I hope this information helps. Please do continue to ask questions if anything is not clear. We'll do our best to get back to you promptly, but some times there may be a few days lag.

@kisunji kisunji added the waiting-reply Waiting on response from Original Poster or another individual in the thread label Nov 30, 2021
@Amier3
Copy link
Contributor

Amier3 commented Jan 11, 2022

Hey @jakubgs,

Was the information given above able to help out? As Daniel said we're happy to answer any additional questions

@github-actions github-actions bot removed the waiting-reply Waiting on response from Original Poster or another individual in the thread label Jan 11, 2022
@jakubgs
Copy link
Contributor Author

jakubgs commented Jan 11, 2022

I just came back from a Christmas break, and this is not high priority for me, more like nice-to-have.

Thanks for the help. I'll get to it when I'll get to it.

@siddarthkay siddarthkay force-pushed the update-maint-check-state branch from 113628c to 02c65d6 Compare May 11, 2024 16:28
@siddarthkay
Copy link

Thanks for the input @jakubgs
I've reverted my changes in the latest commit.
My intent was to get feedback but I haven't gotten any so far

For the reviewers : I still need feedback on the approach in the commit here -> 02c65d6

cc @blake @zalimeni @dhiaayachi @dnephin @Amier3 @rboyer

next steps : I'll add few failed attempts at writing tests and maybe we can find a way to prove our change actually works.
OR
We deploy a version of consul on one of our nodes and prove that @jakubgs's work fixes the un-necessary alerts when we put our hosts in maintenance mode.

Thank you!

@siddarthkay siddarthkay force-pushed the update-maint-check-state branch 2 times, most recently from 2337e2b to 0e910fa Compare May 20, 2024 03:41
@siddarthkay siddarthkay force-pushed the update-maint-check-state branch from 0e910fa to 2365f0d Compare May 30, 2024 07:33
@siddarthkay
Copy link

Can we please get some feedback on this PR ?
cc @blake @zalimeni @dhiaayachi @dnephin @Amier3 @rboyer @dhiaayachi @dnephin @Amier3
if you're not the right person could you tag someone who can review this PR ?

Thank you

Copy link

github-actions bot commented Aug 2, 2024

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Aug 2, 2024
@jakubgs
Copy link
Contributor Author

jakubgs commented Aug 2, 2024

It's not inactive, we are waiting for a review from the team.

@siddarthkay siddarthkay force-pushed the update-maint-check-state branch from 2365f0d to 4de87e1 Compare August 2, 2024 06:54
@github-actions github-actions bot removed the meta/stale Automatically flagged for inactivity by stalebot label Aug 6, 2024
Copy link

github-actions bot commented Oct 5, 2024

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Oct 5, 2024
@jakubgs
Copy link
Contributor Author

jakubgs commented Oct 5, 2024

It's not inactive because of us, we're waiting for a review.

@github-actions github-actions bot removed the meta/stale Automatically flagged for inactivity by stalebot label Oct 6, 2024
Copy link

github-actions bot commented Dec 5, 2024

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Dec 5, 2024
@siddarthkay siddarthkay force-pushed the update-maint-check-state branch from 4de87e1 to 4e80718 Compare December 5, 2024 07:47
@siddarthkay siddarthkay requested a review from a team as a code owner December 5, 2024 07:47
@siddarthkay
Copy link

I've just rebased this PR on latest main, This is not stale we are waiting on review form the team at hashicorp.

@github-actions github-actions bot removed the meta/stale Automatically flagged for inactivity by stalebot label Dec 6, 2024
Copy link

github-actions bot commented Feb 5, 2025

This pull request has been automatically flagged for inactivity because it has not been acted upon in the last 60 days. It will be closed if no new activity occurs in the next 30 days. Please feel free to re-open to resurrect the change if you feel this has happened by mistake. Thank you for your contributions.

@github-actions github-actions bot added the meta/stale Automatically flagged for inactivity by stalebot label Feb 5, 2025
@jkirschner-hashicorp jkirschner-hashicorp added meta/staleproof Exempt from stalebot automation and removed meta/stale Automatically flagged for inactivity by stalebot labels Feb 5, 2025
jakubgs and others added 3 commits February 5, 2025 12:25
Without this update service using `consul watch` to monitor check
changes will never see maintenance mode being disabled.

Resolves: hashicorp#11330

Signed-off-by: Jakub Sokołowski <[email protected]>
Instead of deregistering the check on disabling maintenance mode It seemed better to just update its status as passing.
This makes it easier to know when maintenance mode was disabled.
@siddarthkay siddarthkay force-pushed the update-maint-check-state branch from 4e80718 to ba79811 Compare February 5, 2025 06:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta/staleproof Exempt from stalebot automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disabling Maintenance mode does not trigger an event
8 participants