Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add email sending step for failed scheduled CI runs #2115

Merged
merged 1 commit into from
Dec 14, 2024

Conversation

audiodude
Copy link
Member

We have been running scheduled CI tests every day for quite a while. However, it is widely documented that when these tests fail, the only person who gets an email is the "last person who touched the yml" or something along those lines.

This PR adds a step that sends an email if the action was scheduled and there is a failure, to the comma-separated list of email addresses in the repo secret CI_SCHEDULED_FAIL_EMAIL_TO.

@kelson42
Copy link
Collaborator

@audiodude I guess merging this PR implies me putting the 3 secrets to make this PR helpful?

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.24%. Comparing base (1763a23) to head (a55b273).
Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2115      +/-   ##
==========================================
- Coverage   75.43%   75.24%   -0.20%     
==========================================
  Files          41       41              
  Lines        3188     3195       +7     
  Branches      703      706       +3     
==========================================
- Hits         2405     2404       -1     
- Misses        666      674       +8     
  Partials      117      117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@audiodude
Copy link
Member Author

audiodude commented Dec 10, 2024

@audiodude I guess merging this PR implies me putting the 3 secrets to make this PR helpful?

@kelson42 No, I already populated them. And I got a test email too. Probably want to edit the body to more clearly indicate that the email represents a CI failure.

@audiodude audiodude marked this pull request as draft December 10, 2024 06:05
@kelson42
Copy link
Collaborator

@rgaudin @benoit74 I would like your feedback on:

  • the approach to solve a problem we have everywhere
  • the compatibility of the mailer platform and credentials with SPF

@audiodude audiodude force-pushed the scheduled-ci-email branch 4 times, most recently from d2b550e to cdd3f70 Compare December 10, 2024 18:11
@audiodude audiodude marked this pull request as ready for review December 10, 2024 18:34
@benoit74
Copy link
Contributor

Which mailgun domain is used? This should be the one owned and payed for by Kiwix from my PoV.

Are we sure this is working? Usually when a step fails in a job, next steps are cancelled, so I don't get why this mail sending action would be ran in case of failure.

Other than that, I'm not really convinced by this PR approach from a strategical PoV:

  • how do we maintain a list of proper emails to notify, especially across all openzim and kiwix repos?
  • since the list of emails to notify is a secret, how do we update it wisely ? it is very hard to be sure we notify who needs to be, and even harder to be sure that we do not remove someone by mistake when we update the secret to add someone (we do not have access to previous secret value)
  • are we sure we can't use Github notification system? I do not use emails anymore, and I would prefer to not be forced to use them again for these notifications
  • I'm not really convinced by the vineetchoudhary/mailgun-action action which raises strange warnings in the console, is probably going to be deprecated soon (see Node.js 12 actions are deprecated warning vineetchoudhary/mailgun-action#16, unless I'm mistaken) and comes from a mostly unidentified developer which might get access to our Mailgun API Key if he inject a side channel in the action

Do we have any literature of how other organizations are dealing with the issue? I wasn't aware of this problem, and it drives me crazy this work like it does. Not being able to be easily be notified of workflow failure seems like a nightmare for any serious maintainer.

@audiodude
Copy link
Member Author

Which mailgun domain is used? This should be the one owned and payed for by Kiwix from my PoV.

I'm using mwoffliner.0-z-0.com which I set up just for this purpose. IMO it doesn't matter what domain it comes from because it's an automated message and it's for developers, not user facing.

Are we sure this is working? Usually when a step fails in a job, next steps are cancelled, so I don't get why this mail sending action would be ran in case of failure.

Yes, the if: ${{ failure() }} clause means it gets run if steps before it fail. https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#failure

Other than that, I'm not really convinced by this PR approach from a strategical PoV:

  • how do we maintain a list of proper emails to notify, especially across all openzim and kiwix repos?
  • since the list of emails to notify is a secret, how do we update it wisely ? it is very hard to be sure we notify who needs to be, and even harder to be sure that we do not remove someone by mistake when we update the secret to add someone (we do not have access to previous secret value)

I don't think we need to maintain any such list in a central way, though I can see that if a new developer wanted to be added to the list, it would be very difficult. We don't need to use secrets, we can hardcode email addresses into the yml file.

  • are we sure we can't use Github notification system? I do not use emails anymore, and I would prefer to not be forced to use them again for these notifications

Yes, I have spent time researching this, and it is a limitation of scheduled tasks that failure messages only go to the last person who "touched" the yml file. See https://github.com/orgs/community/discussions/25351

  • I'm not really convinced by the vineetchoudhary/mailgun-action action which raises strange warnings in the console, is probably going to be deprecated soon (see Node.js 12 actions are deprecated warning vineetchoudhary/mailgun-action#16, unless I'm mistaken) and comes from a mostly unidentified developer which might get access to our Mailgun API Key if he inject a side channel in the action

Agreed it's suboptimal. https://github.com/marketplace?query=mailgun. There's an alternative here: https://github.com/marketplace/actions/action-mailgun-email

Do we have any literature of how other organizations are dealing with the issue? I wasn't aware of this problem, and it drives me crazy this work like it does. Not being able to be easily be notified of workflow failure seems like a nightmare for any serious maintainer.

The post linked above has people discussing what they do, but the only reliable method is to have a separate email sending step.


Overall, I don't necessarily believe that this is a problem that needs to be "solved" in a properly researched and documented, Kiwix-wide pattern. The problem we're facing is that mwoffliner CI breaks randomly because the tests rely on hardcoded values for fetched data that can change on the actual wikis that are being tested. When that happens, it's impossible to merge new PRs because CI is broken. So the process of merging a small change involves 1) write the PR, 2) Fix the unrelated breakage in CI 3) Merge. Instead, if CI runs daily and sends an email on failure, we can proactively fix it outside of the development of new PRs.

This PR works for me for this repo. I agree that it's not appropriate as a general pattern that should be adopted for all of Kiwix.

@audiodude audiodude merged commit 5388d72 into main Dec 14, 2024
6 checks passed
@audiodude audiodude deleted the scheduled-ci-email branch December 14, 2024 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants