Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Improve e2e tests reliability #343

Merged
merged 5 commits into from
Feb 6, 2024

Conversation

ldoktor
Copy link
Contributor

@ldoktor ldoktor commented Jan 30, 2024

There are several improvements to the e2e pipeline to avoid odd returns as well as to interrupt on hangs. Last but not least I added extra messages that should improve reading the logs.

I'm definitely open to suggestions to remove/improve individual commits....

@ldoktor ldoktor force-pushed the reliability branch 2 times, most recently from 554b7d5 to d84dc6d Compare January 30, 2024 18:26
@ldoktor ldoktor requested a review from wainersm January 30, 2024 19:17
Copy link
Member

@portersrc portersrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

undo_changes
fi
[ "$RET" -ne 0 ] && echo && echo "::error:: Testing failed with $RET" || echo "::info:: Testing passed"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When undo is true and RET is nonzero, we'll get a couple "testing failed" error messages (one before undoing changes, and one after). I guess that's OK, though this could be tweaked if you want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my intention, you get notified things went wrong (as it might not be obvious from the logs) and that things are going to be cleared, then you get a bunch of output spilled over finishing with ansible message that everything went well after which you get this second Testing failed message to emphasize although Ansible is happy the testing actually did not went well.

we are seeing stalled jobs, let's ensure our steps won't take longer
than expected.

Signed-off-by: Lukáš Doktor <[email protected]>
ensure each test won't take longer than expected treshold by setting a
10m timeout. Note one can only set one timeout for all tests within a
single file.

Signed-off-by: Lukáš Doktor <[email protected]>
recent issues in CI indicate that kubectl might sometimes fail which
results in wait_for_process interrupting the loop. Let's improve the
command to ensure kubectl command passed and only then grep for the
(un)expected output.

Note the positive commands do not need to be treated as the output
should not contain the pod names on failure.

Fixes: confidential-containers#339

Signed-off-by: Lukáš Doktor <[email protected]>
On test failure we might still execute a cleanup that spils a bunch of
text making it not obvious whether the testing passed or failed. Note
the return code is already fine, this change is only for the users to
better notice things didn't went well.

Signed-off-by: Lukáš Doktor <[email protected]>
Replace the simple "DEBUG|ERROR|INFO" prefixes with the github action
commands "::debug::" as it should improve the GH logs readability while
leaving the bash outputs still parsable by humans.

Signed-off-by: Lukáš Doktor <[email protected]>
@ldoktor
Copy link
Contributor Author

ldoktor commented Jan 31, 2024

Changes:

  • Rebased
  • Amended commit message heading to use "tests.e2e" prefix rather than "ci"

Hopefully no other changes, tried hard not to introduce new "spaces" rather than "tabs" :-)

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Lukas!

@@ -62,7 +62,7 @@ jobs:
if [ $RUNNING_INSTANCE = "s390x" ]; then
args=""
fi
./run-local.sh -r "${{ matrix.runtimeclass }}" "${args}"
./run-local.sh -t -r "${{ matrix.runtimeclass }}" "${args}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense to have the timeout in the script because we don't map this steps to github job's steps, otherwise the timeouts could be set on the github workflows. This scenario might change when we address #309 .

run() {
duration=$1; shift
if [ "$timeout" == "true" ]; then
timeout $duration "$@"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it prints a friendly message (e.g. "Run timed out after XX") when it timed out? i.e. when $? -eq 124 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can open a can of worms as the script itself can return 124 so we'd have to add a logic to get the actual time (I mean we could use the $SECONDS so it's not that extensive but still) and then report "Run probably timed out after XXXs" when the timeout seems correct. Do you want me to add it or are we going to rely on log timestamps only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... we better leave as is. If we start seeing too many timeouts and it proves to be confusing then we may change it.

#
local cmd="! sudo -E kubectl get pods -n $op_ns |"
cmd+="grep -q -e cc-operator-daemon-install"
# (ensure failing kubectl keeps iterating)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@wainersm
Copy link
Member

wainersm commented Feb 2, 2024

Hi @ldoktor ! I left one suggestion that you might accept. Everything else looks good.

Copy link
Member

@wainersm wainersm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ldoktor

@wainersm wainersm merged commit ed63a3d into confidential-containers:main Feb 6, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants