Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

guettli · 2024-09-05T09:32:41Z

/kind bug

In CI a test failed:

https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962/job/29717241629?pr=255

[FAIL] ClusterAddonReconciler [AfterEach] Basic test applies the helm chart
/home/runner/work/cluster-stack-operator/cluster-stack-operator/internal/test/integration/workloadcluster/cluster_addon_test.go:109

After pressing "re-run test" in the Github web UI, the test was fine.

This means that the test is flaky since it fails and succeeds without any code change.

Here is the successful run: https://github.com/SovereignCloudStack/cluster-stack-operator/actions/runs/10717474962?pr=255

What did you expect to happen:

I expect tests to not be flaky.

Fixing flakyness

I used that script to run tests again and again:

#!/bin/bash

trap 'echo "ERROR: A command has failed. Exiting the script. Line was ($0:$LINENO): $(sed -n "${LINENO}p" "$0")"; exit 3' ERR
set -Eeuo pipefail

i=0
while make test-unit; do
    i=$((i + 1))
    echo "Test run $i"
    date
    sleep 1
    echo "=========================================================================================
    echo "=========================================================================================
    echo
done

Works fine, the second run failed:

Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Tests with multiple versions [It] checks ProviderClusterstackrelease is deleted when version is removed from spec
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:616

Ran 40 of 40 Specs in 10.203 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

From time to time this fails, too:


Summarizing 1 Failure:
  [FAIL] ClusterStackReconciler Test with provider Basic test [It] creates the cluster stack release object with cluster stack auto subscribe false
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusterstack_controller_test.go:487

Ran 40 of 40 Specs in 8.498 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

sometimes this fails:


Summarizing 1 Failure:
  [FAIL] ClusterAddonReconciler Basic test [It] creates the clusterAddon object
  /home/guettli/syself/cluster-stack-operator-public/internal/controller/clusteraddon_controller_test.go:112

Ran 40 of 40 Specs in 9.054 seconds
FAIL! -- 39 Passed | 1 Failed | 0 Pending | 0 Skipped

How to make the test always fail

If the test waits until the clusterStackRelease v2 is created, then the test always fails.

			/////////////////////////////////////////////////////////////////////
			// Flaky test
			FIt("checks ProviderClusterstackrelease is deleted when version is removed from spec", func() {
				fmt.Println("itttttttttttttttttttttttttttttttttttttttttt")
				ph, err := patch.NewHelper(clusterStack, testEnv)
				Expect(err).ShouldNot(HaveOccurred())

				// new
				providerclusterStackReleaseRefV2 := &corev1.ObjectReference{
					APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
					Kind:       "TestInfrastructureProviderClusterStackRelease",
					Name:       clusterStackReleaseTagV2,
					Namespace:  testNs.Name,
				}

				Eventually(func() bool {
					obj, err := external.Get(ctx, testEnv.GetClient(), providerclusterStackReleaseRefV2, testNs.Name)
					if err != nil {
						fmt.Printf("foundProviderclusterStackReleaseRef is not found. %s\n", err.Error())
						return false
					}
					fmt.Printf("foundProviderclusterStackReleaseRef is found. %s %+v %+v\n",
						obj.GetName(), obj.GetFinalizers(), obj.GetOwnerReferences())
					return true
				}, timeout, interval).Should(BeTrue())

Instead of the Eventually block, you can use time.Sleep(time.Second * 1) it has the same effect. This will fail here:

				Eventually(func() bool {
					return apierrors.IsNotFound(testEnv.Get(ctx, clusterStackReleaseTagV2Key, &csov1alpha1.ClusterStackRelease{}))
				}, timeout, interval).Should(BeTrue())

I think the real error is in BeforeEach(). It does not wait until the resources get created.

			BeforeEach(func() {
				clusterStack.Spec = csov1alpha1.ClusterStackSpec{
					Provider:          "docker",
					Name:              "ferrol",
					KubernetesVersion: "1.27",
					Versions:          []string{"v1", "v2"},
					AutoSubscribe:     false,
					ProviderRef: &corev1.ObjectReference{
						APIVersion: "infrastructure.clusterstack.x-k8s.io/v1alpha1",
						Kind:       "TestInfrastructureProviderClusterStackReleaseTemplate",
						Name:       "provider-test1",
						Namespace:  testNs.Name,
					},
				}
				Expect(testEnv.Create(ctx, clusterStack)).To(Succeed())

				cs, err := clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v1")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV1 = cs.String()

				cs, err = clusterstack.New(clusterStack.Spec.Provider, clusterStack.Spec.Name, clusterStack.Spec.KubernetesVersion, "v2")
				Expect(err).To(BeNil())
				clusterStackReleaseTagV2 = cs.String()

				clusterStackReleaseTagV1Key = types.NamespacedName{Name: clusterStackReleaseTagV1, Namespace: testNs.Name}
				clusterStackReleaseTagV2Key = types.NamespacedName{Name: clusterStackReleaseTagV2, Namespace: testNs.Name}
			})

@janiskemper what do you think: Does it make sense to wait in BeforeEach, so that the tests get a stable environment?

The good news: The test fails reproducible with FIt() this means. The flakiness is inside this single test (because no other test gets called).

Current conclusion

The test (It("checks ProviderClusterstackrelease is deleted when version is removed from spec")) has always been wrong. In some edge cases (roughly every 8th run), it failed because the correct sequence was executed.

The deletionTimestamp does not get propagated from the parent-object to the child-object, because in envTest the GC is not running.

Related kubebuilder docs: https://book.kubebuilder.io/reference/envtest.html#testing-considerations

I asked at #controller-runtime how other developers handle it.

I will update the code, so that I check for the DeletionTimestamp of the parent-object, and then check the ownerRef at the child-object. I will remove the test that the child-object gets deleted.

The text was updated successfully, but these errors were encountered:

guettli linked a pull request Sep 9, 2024 that will close this issue

🐛 Fix flaky tests #258

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

guettli commented Sep 5, 2024 •

edited

Loading

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Tests are flaky: ClusterAddonReconciler [AfterEach] Basic test applies the helm chart #257

Comments

guettli commented Sep 5, 2024 • edited Loading

Fixing flakyness

How to make the test always fail

Current conclusion

guettli commented Sep 5, 2024 •

edited

Loading