-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ms-output - do not break out producer and consumer loops #12194
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapellidario please find some comments along the code.
In addition, I am curious to know how the document information makes it to the alert. I know there are limits (number of bytes?), so could you please share one possible alert that would be generated (if it does not have sensitive information)? Thanks
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str) | ||
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription) | ||
if self.msConfig["sendNotification"]: | ||
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly set an expiration time for this alert? Without it, what is the default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration is here https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/blob/master/alertmanager/alertmanager.yaml
and we use this function:
def sendAlert(self, alertName, severity, summary, description, service, tag="wmcore", endSecs=600, generatorURL=""): |
so:
- we send an alert with the tag
wmcore
- the route uses the tag to match the alert and redirect it to "dmwm admins" here
- the "dmwm admins" receiver is configured to send an email to our egroup here
- after endSecs = 10min the alert is silenced (this overrides the global resolve timeout of 5min)
- if the same alert is sent before
repeat_interval: 2h
, then alertmanager will not send the same notification again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Dario. Where did you come with repeat_interval
configuration from?
I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's always the same file, here
we can change to 12h if:
- we create a new receiver for alertmanager (for example
dmwm-admins-12h
) , and we override the default values (for example setting 12h for repeat interval) - in msoutput.py, we override the value for
tag
, let's saywmcore12h
- we create a new redirect for alertmanager that matches the tag
wmcore12h
and send the alert todmwm-admins-12h
at this point it would be beneficial to have a broader discussion on our alerts, because we could take this opportunity to improve the situation across the board.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm
, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.
And just so I don't forget, please have a look at the Jenkins report as well (which it failed to be produced so far). Let me try to trigger that. |
test this please |
1 similar comment
test this please |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapellidario these changes are looking good in general, but I left a few concerns along the code that might have to be followed up - it really depends on the interpretation and/or behavior that we want to report. Hence, I am leaving a comment instead of approval/request for changes.
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str) | ||
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription) | ||
if self.msConfig["sendNotification"]: | ||
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Dario. Where did you come with repeat_interval
configuration from?
I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?
This comment was marked as outdated.
This comment was marked as outdated.
I added a commit that improves the clarity of the counters, i hope that their meaning is a bit more clear now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mapellidario please find some comments and suggestions along the code.
Once you make new changes, feel free to already squash the commits, such that we can merge it in case everything is fine. Thanks
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str) | ||
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription) | ||
if self.msConfig["sendNotification"]: | ||
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm
, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.
2dd6799
to
9a702b2
Compare
I implemented the changes and squashed the commits, thanks for the review alan! |
Jenkins results:
|
Fixes #11941
Status
Description
When the MSOutput producer and consumer encounter a generic error, we do not want the process to terminate. We want to continue and process the next workflow.
I know that
continue
instead ofbreak
is not ideal, that forces us to read alerts to make sure that errors do not slip through the cracks. It is also true that in the past it happened that even withbreak
, nobody noticed that something was not working. I do not consider this the best solution, it is yet another patch that will hopefully improve our life :)Is it backward compatible (if not, which system it affects?)
yes
Related PRs
none
External dependencies / deployment changes
nope