Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Open
guettli opened this issue Sep 5, 2024 · 0 comments · May be fixed by #258
Open

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

guettli opened this issue Sep 5, 2024 · 0 comments · May be fixed by #258

Comments

@guettli
Copy link
Contributor

guettli commented Sep 5, 2024

/kind bug

In CI a test failed:

https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962/job/29717241629?pr=255

[FAIL] ClusterAddonReconciler [AfterEach] Basic test applies the helm chart
/home/runner/work/cluster-stack-operator/cluster-stack-operator/internal/test/integration/workloadcluster/cluster_addon_test.go:109

After pressing "re-run test" in the Github web UI, the test was fine.

This means that the test is flaky since it fails and succeeds without any code change.

Here is the successful run: https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962?pr=255

What did you expect to happen:

I expect tests to not be flaky.

Fixing flakyness

I used that script to run tests again and again:

#!/bin/bash

trap 'echo "ERROR: A command has failed. Exiting the script. Line was ($0:$LINENO): $(sed -n "${LINENO}p" "$0")"; exit 3' ERR
set -Eeuo pipefail

i=0
while make test-unit; do
    i=$((i + 1))
    echo "Test run $i"
    date
    sleep 1
    echo "=========================================================================================
    echo "=========================================================================================
    echo
done

Works fine, the second run failed:

Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Tests with multiple versions [It] checks ProviderClusterstackrelease is deleted when version is removed from spec
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:616

Ran 40 of 40 Specs in 10.203 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

From time to time this fails, too:


Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Basic test [It] creates the cluster stack release object with cluster stack auto subscribe false
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:487

Ran 40 of 40 Specs in 8.498 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

sometimes this fails:


Summarizing 1 Failure:
  [FAIL] ClusterAddonReconciler Basic test [It] creates the clusterAddon object
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusteraddon_controller_test.go:112

Ran 40 of 40 Specs in 9.054 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

How to make the test always fail

If the test waits until the clusterStackRelease v2 is created, then the test always fails.

			/////////////////////////////////////////////////////////////////////
			// Flaky test
			FIt("checks ProviderClusterstackrelease is deleted when version is removed from spec", func() {
				fmt.Println("itttttttttttttttttttttttttttttttttttttttttt")
				ph, err := patch.NewHelper(clusterStack, testEnv)
				Expect(err).ShouldNot(HaveOccurred())

				// new
				providerclusterStackReleaseRefV2 := &corev1.ObjectReference{
					APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
					Kind:       "TestInfrastructureProviderClusterStackRelease",
					Name:       clusterStackReleaseTagV2,
					Namespace:  testNs.Name,
				}

				Eventually(func() bool {
					obj, err := external.Get(ctx, testEnv.GetClient(), providerclusterStackReleaseRefV2, testNs.Name)
					if err != nil {
						fmt.Printf("foundProviderclusterStackReleaseRef is not found. %s\n", err.Error())
						return false
					}
					fmt.Printf("foundProviderclusterStackReleaseRef is found. %s %+v %+v\n",
						obj.GetName(), obj.GetFinalizers(), obj.GetOwnerReferences())
					return true
				}, timeout, interval).Should(BeTrue())

Instead of the Eventually block, you can use time.Sleep(time.Second * 1) it has the same effect. This will fail here:

				Eventually(func() bool {
					return apierrors.IsNotFound(testEnv.Get(ctx, clusterStackReleaseTagV2Key, &csov1alpha1.ClusterStackRelease{}))
				}, timeout, interval).Should(BeTrue())

I think the real error is in BeforeEach(). It does not wait until the resources get created.

			BeforeEach(func() {
				clusterStack.Spec = csov1alpha1.ClusterStackSpec{
					Provider:          "docker",
					Name:              "ferrol",
					KubernetesVersion: "1.27",
					Versions:          []string{"v1", "v2"},
					AutoSubscribe:     false,
					ProviderRef: &corev1.ObjectReference{
						APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
						Kind:       "TestInfrastructureProviderClusterStackReleaseTemplate",
						Name:       "provider-test1",
						Namespace:  testNs.Name,
					},
				}
				Expect(testEnv.Create(ctx, clusterStack)).To(Succeed())

				cs, err := clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v1")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV1 = cs.String()

				cs, err = clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v2")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV2 = cs.String()

				clusterStackReleaseTagV1Key = types.NamespacedName{Name: clusterStackReleaseTagV1, Namespace: testNs.Name}
				clusterStackReleaseTagV2Key = types.NamespacedName{Name: clusterStackReleaseTagV2, Namespace: testNs.Name}
			})

@janiskemper what do you think: Does it make sense to wait in BeforeEach, so that the tests get a stable environment?

The good news: The test fails reproducible with FIt() this means. The flakiness is inside this single test (because no other test gets called).

Current conclusion

The test (It("checks ProviderClusterstackrelease is deleted when version is removed from spec")) has always been wrong. In some edge cases (roughly every 8th run), it failed because the correct sequence was executed.

The deletionTimestamp does not get propagated from the parent-object to the child-object, because in envTest the GC is not running.

Related kubebuilder docs: https://book.kubebuilder.io/reference/envtest.html#testing-considerations

I asked at #controller-runtime how other developers handle it.

I will update the code, so that I check for the DeletionTimestamp of the parent-object, and then check the ownerRef at the child-object. I will remove the test that the child-object gets deleted.

@guettli guettli linked a pull request Sep 9, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant