Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ms-output - do not break out producer and consumer loops #12194

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mapellidario
Copy link
Member

Fixes #11941

Status

  • tested on test11/vosms0262

Description

When the MSOutput producer and consumer encounter a generic error, we do not want the process to terminate. We want to continue and process the next workflow.

I know that continue instead of break is not ideal, that forces us to read alerts to make sure that errors do not slip through the cracks. It is also true that in the past it happened that even with break, nobody noticed that something was not working. I do not consider this the best solution, it is yet another patch that will hopefully improve our life :)

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

none

External dependencies / deployment changes

nope

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapellidario please find some comments along the code.

In addition, I am curious to know how the document information makes it to the alert. I know there are limits (number of bytes?), so could you please share one possible alert that would be generated (if it does not have sensitive information)? Thanks

src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
if self.msConfig["sendNotification"]:
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly set an expiration time for this alert? Without it, what is the default value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration is here https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/blob/master/alertmanager/alertmanager.yaml
and we use this function:

def sendAlert(self, alertName, severity, summary, description, service, tag="wmcore", endSecs=600, generatorURL=""):

so:

  • we send an alert with the tag wmcore
  • the route uses the tag to match the alert and redirect it to "dmwm admins" here
  • the "dmwm admins" receiver is configured to send an email to our egroup here
  • after endSecs = 10min the alert is silenced (this overrides the global resolve timeout of 5min)
  • if the same alert is sent before repeat_interval: 2h, then alertmanager will not send the same notification again

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Dario. Where did you come with repeat_interval configuration from?

I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's always the same file, here

we can change to 12h if:

  • we create a new receiver for alertmanager (for example dmwm-admins-12h) , and we override the default values (for example setting 12h for repeat interval)
  • in msoutput.py, we override the value for tag, let's say wmcore12h
  • we create a new redirect for alertmanager that matches the tag wmcore12h and send the alert to dmwm-admins-12h

at this point it would be beneficial to have a broader discussion on our alerts, because we could take this opportunity to improve the situation across the board.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.

src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2024

And just so I don't forget, please have a look at the Jenkins report as well (which it failed to be produced so far). Let me try to trigger that.

@amaltaro
Copy link
Contributor

amaltaro commented Dec 4, 2024

test this please

1 similar comment
@amaltaro
Copy link
Contributor

amaltaro commented Dec 9, 2024

test this please

@dmwm-bot

This comment was marked as outdated.

@dmwm-bot

This comment was marked as outdated.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapellidario these changes are looking good in general, but I left a few concerns along the code that might have to be followed up - it really depends on the interpretation and/or behavior that we want to report. Hence, I am leaving a comment instead of approval/request for changes.

src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
if self.msConfig["sendNotification"]:
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Dario. Where did you come with repeat_interval configuration from?

I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?

@dmwm-bot

This comment was marked as outdated.

@mapellidario
Copy link
Member Author

I added a commit that improves the clarity of the counters, i hope that their meaning is a bit more clear now

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mapellidario please find some comments and suggestions along the code.
Once you make new changes, feel free to already squash the commits, such that we can merge it in case everything is fine. Thanks

src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
src/python/WMCore/MicroService/MSOutput/MSOutput.py Outdated Show resolved Hide resolved
alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
if self.msConfig["sendNotification"]:
self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.

@mapellidario
Copy link
Member Author

I implemented the changes and squashed the commits, thanks for the review alan!

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 10 warnings
    • 88 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/223/artifact/artifacts/PullRequestReport.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MSOutput Should not silently tolerate exceptions during rule creation
4 participants