Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradual rollout #110

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

p-strusiewiczsurmacki-mobica
Copy link
Contributor

@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica commented Mar 27, 2024

This PR implements gradual rollout as described in #98

There are 2 new CRDs added:

NodeConfig - which represents node configuration
NodeConfigProcess - which represents the global state of the configuration (can be provisioning or provisioned). This is used to check if previous leader did not fail in the middle of configuration process. If so, backups are restored.

New pod added - network-operator-configurator - this pod (daemonset) is is responsible for fetching vrfrouteconfigurations, layer2networkconfigurations and routingtables and combining those into NodeConfig for each node.

network-operator-worker pod instead of fetching separate config resources, will now only fetch NodeConfig. After configuration is done, and connectivity is checked, it will backup the config on disk. If connectivity is lost after deploying new config - configuration will be restored using the local backup.

For each node there can be 3 NodeConfig objects created:

  • <nodename> - current configuration
  • <nodename>-backup - backup configuration
  • <nodename>-invalid - last known invalid configuration

How does it work:

  1. network-operator-configurator starts and leader election takes place.
  2. Leader checks NodeConfigProcess if any config is in invalid or provisioning state to check if previous leader did not die amid the configuration process. If so, it will revert configuration for all the nodes using backup configuration.
  3. When user deploys vrfrouteconfigurations, layer2networkconfigurations and /or routingtables object, configurator configurator will:
  • combine those into separate NodeConfig for each node
    - set NodeConfigProcess state to provisioning
  1. Configurator checks new configs against known invalid configs. If any new config is equal to at least one known invalid config, deployment is aborted.
  2. Configurator backups the current config as -backup and deploys new config with status provisioning.
  3. network-operator-worker fetches new config and configures node. It checks connectivity and:
  • if it is OK, it stores backup on disk, ant updates the status of the config to provisioned
  • if connectivity was lost, it restores the configuration from local backup and (if possible) updates the config status to invalid.
  1. Configurator waits for the outcome of the config provisioning by checking the config status:
    • if status was set by worker to provisioned - it proceeds with deploying next node(s).
    • if status was set to invalid - it aborts the deployment and reverts changes on all the nodes that were changed in this iteration.
    • if it times out (e.g. node was unable to update the config state for some reason) - it invalidates the config and reverts the changes on all nodes.

Configurator can be set to update more than 1 node concurrently. Number of nodes for concurrent update can be set using update-limit configuration flag (defaults to 1).

@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica marked this pull request as ready for review April 12, 2024 17:27
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica changed the title [WIP] Gradual rollout Gradual rollout Apr 12, 2024
Copy link
Member

@chdxD1 chdxD1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sadly can't comment on the lines because Github:

https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90
this does not seem required anymore when the Layer2 Network Configurations are handled by the ConfigReconciler

chdxD1
chdxD1 previously requested changes Apr 17, 2024
Copy link
Member

@chdxD1 chdxD1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?

@p-strusiewiczsurmacki-mobica
Copy link
Contributor Author

p-strusiewiczsurmacki-mobica commented Apr 17, 2024

@chdxD1

Also what happens when a node joins(!) or leaves the cluster? Especially when the joining or leaving happens mid of a rollout?

If Node joins the cluster it should be configured in next reconciliation loop iteration (I think, will check that to be sure). But, on node leave, config will be tagged as invalid (as it should timeout) and configuration will be aborted. I'll try to fix that.

@chdxD1
Copy link
Member

chdxD1 commented Apr 18, 2024

https://github.com/telekom/das-schiff-network-operator/pull/110/files#diff-a96964d7107a0881e826cd2f6ac71e901d612df880362eb04cb0b54eb609b8e5L70-L90

would be nice to watch node events in the central manager to create that nodeconfig before the next reconcile loop

@p-strusiewiczsurmacki-mobica
Copy link
Contributor Author

p-strusiewiczsurmacki-mobica commented Apr 29, 2024

@chdxD1
I had to make tons of changes, but I think the code should be much better now.
Each config now has owner reference set to node, so whenever a node is removed, all the NodeConfig objects should be removed automatically as well.
As for nodes added, I've introduced a node_reconciler which watches for nodes and whenever node is added it sends info to config_manager so it can trigger updates as soon as possible. It also tags deleted nodes as 'inactive' so those can be skipped if for example node was deleted during the config deployment process.

api/v1alpha1/nodeconfig_types.go Outdated Show resolved Hide resolved
config/manager/kustomization.yaml Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
pkg/config_map/config_map.go Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
pkg/config_manager/config_manager.go Outdated Show resolved Hide resolved
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 4 times, most recently from 8d37361 to 945b6d3 Compare July 15, 2024 11:01
Copy link
Member

@schrej schrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work overall. I'd like to change a few things though, especially with regards to the gradual rollout mechanism.

pkg/reconciler/config_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/config_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
cmd/operator/main.go Outdated Show resolved Hide resolved
cmd/operator/main.go Outdated Show resolved Hide resolved
api/v1alpha1/networkconfigrevision_types.go Outdated Show resolved Hide resolved
cmd/operator/main.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
api/v1alpha1/nodenetworkconfig_types.go Outdated Show resolved Hide resolved
api/v1alpha1/networkconfigrevision_types.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 2 times, most recently from 17c6f79 to bb8c516 Compare August 19, 2024 12:22
Copy link
Member

@schrej schrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a few more things.

One thing I dislike with the current implementation is the lack of clear responsibilities of controllers. Most importantly, only one controller should be responsible for setting the status of a resource. This is not always the case at them moment. It's perfectly fine if status takes a while to be set after a resource is created, and I'd prefer it over guessing the status when creating the resource.

api/v1alpha1/networkconfigrevision_types.go Outdated Show resolved Hide resolved
api/v1alpha1/nodenetworkconfig_types.go Outdated Show resolved Hide resolved
api/v1alpha1/nodenetworkconfig_types.go Outdated Show resolved Hide resolved
controllers/config_controller.go Outdated Show resolved Hide resolved
pkg/healthcheck/healthcheck.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodeconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodenetworkconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodenetworkconfig_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodenetworkconfig_reconciler.go Outdated Show resolved Hide resolved
Copy link
Member

@schrej schrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way cleaner now, good job.
A few more minor things :)

pkg/reconciler/config_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/config_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/configrevision_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/configrevision_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/configrevision_reconciler.go Outdated Show resolved Hide resolved
pkg/reconciler/nodenetworkconfig_reconciler.go Outdated Show resolved Hide resolved
schrej
schrej previously approved these changes Aug 22, 2024
Copy link
Member

@schrej schrej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm now, thanks!
I'll leave it for @chdxD1 for final review

Copy link
Member

@chdxD1 chdxD1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with the overall flow and how it evolved now.

I rolled it out in one of our dev clusters and it worked, after kicking it a bit.

For some reason (not logged!) the config revision went into invalid state.

I would therefore ask to add some logging which would ease debugging in a lot of cases, like when marking a config as invalid, when creating new revisions, creating new nodeconfigs etc.

}

if err := crr.deployConfig(ctx, newConfig, currentConfig, node); err != nil {
if errors.Is(err, InvalidConfigError) || errors.Is(err, context.DeadlineExceeded) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) I think InvalidConfigError is not used anywhere
b) if the context DeadlineExceeded happens (e.g. operator can not communicate with K8s API) I would prefer a retry instead of marking the revision as invalid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a) deleted InvalidConfigError
b) I've added explicitly 3 retries. If it won;t be able to connect within 3 tries, reconciler error will be reported.

Also now all actions like creation/update/deletion of objects etc. should be logged.

Comment on lines 167 to 184
if wasConfigTimeoutReached(&configs[i]) {
// If timout was reached revision is invalid (but still counts as ongoing).
invalid++
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I am struggling with this a bit.

Imagine the following scenario:

  1. A new node joins the cluster (or network operator is updated etc)
  2. network-operator is not running for various reasons for 2 minutes

I am thinking of a better solution to find out if a rollout has reached a timeout.

Some things to consider:

  • If a node can not rollout config at all because of other issues the node should be marked as NotReady by the kubelet anyway, our automation will take care of removing that node alltogether (deleting the nodeconfig anyway)
  • A rollout can only timeout once it was started

What would you think of a applyStarted field in status which is written before a nodeconfig is rolled out (aka an agent has received the config but has not yet configured it). This is used to also update the lastUpdated timestamp and if applyStarted is true and lastUpdated has reached the timeout of 2 minutes the config is marked as failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we already have provisioning, maybe just use that instead of applyStarted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but wait a second, that's already the case, let me look further..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if, before invalidating the revision, we would check if node that 'timed out' is still available (e.g it was not deleted or not in NotReady state)? And if it was deleted or NotReady, we could then simply not use it's config's state to determine if revision should be invalidated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* A rollout can only timeout once it was started

Currently that's true, as only the agent that operates on the node updates the LastUpdate timestamp in the config's status, so if there's any value there, it means that node stared to reconfigure itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see one issue: The ConfigStatus is by default an empty string and defaults to "", the lastUpdate is 0 (aka 1990...) if we check for "" as well in line 164: case StatusProvisioning, "": and then check for the timeout it will always be true I guess?

ConfigStatus == "" should be a different case with a higher timeout first of all plus a check that lastUpdate is not 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've added longer timeout for configs with empty status, and added a check against lastUpdate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, made timeout value configurable using flags.

@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 2 times, most recently from c1c34e5 to 6ed1c53 Compare September 30, 2024 13:44
@p-strusiewiczsurmacki-mobica p-strusiewiczsurmacki-mobica force-pushed the gradual-rollout branch 3 times, most recently from e00cb48 to cc119f7 Compare November 20, 2024 15:35
Signed-off-by: Patryk Strusiewicz-Surmacki <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants