Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Snapshot Controller's unbounded VolumeSnapshot list call on startup #1238

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cmd/snapshot-controller/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ var version = "unknown"
func ensureCustomResourceDefinitionsExist(client *clientset.Clientset, enableVolumeGroupSnapshots bool) error {
condition := func(ctx context.Context) (bool, error) {
var err error
// List calls should return faster with a limit of 0.
// List calls should return faster with a limit of 1.
// We do not care about what is returned and just want to make sure the CRDs exist.
listOptions := metav1.ListOptions{Limit: 0}
listOptions := metav1.ListOptions{Limit: 1}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List snapshots is an expensive call. We should try to avoid listing all the snapshots.
The purpose for this call here is to ensure the CRD is installed. We don't really need to list the snapshots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your question here: https://kubernetes.slack.com/archives/C0EG7JC6T/p1733776266305359
It's odd that {Limit: 0} would return 10,000 entries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, at the very least can we wait for an informer sync of volumesnapshots instead of making a list call?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to #504 which added this ensureCustomResourceDefinitionsExist waiting for informer sync was not sufficient bc the controller showed as Ready even before informer had synced, so ensureCustomResourceDefinitionsExist was added to force the controller to immediately crash. So I guess it's not as simple as removing this function , I think the most correct solution would be to implement readyz probe which waited for informer sync. But that is a lot of effort so I think this PR is the best short term fix.

(As for the limit=0 behavior, indeed it is counterintuitive & I could not find where it's documented in apiserver code if at all It seems that since resourceversion is not set in this list call then the lImit=0 option gets plumbed to etcd list call here and it does document that "// If WithLimit is given a 0 limit, it is treated as no limit." https://github.com/etcd-io/etcd/blob/854bdd646c8ce50b879f79f403726c7ab0dc726c/client/v3/op.go#L345. Anyway, the details are unimportant, the point is we observe that limit=0 does the opposite of intended!!)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw that external-provisioner has a healt hz check kubernetes-csi/external-provisioner@52c3575#diff-db1b4e664fc5c6203530702df4e7c6eade0fd481ef5badffbc0bb4a92648d36e. So in short mb snapshot-controller should have also a healthz check which starts serving OK only after informer synced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in the CSI Implementation meeting today and decided we should not go with the "health check" fix. If the CRDs are not deployed, informer will returns errors. It's not going to be synced. So the current fix is good.


// scoping to an empty namespace makes `List` work across all namespaces
_, err = client.SnapshotV1().VolumeSnapshots("").List(ctx, listOptions)
Expand Down
Loading