ms-output - do not break out producer and consumer loops #12194

mapellidario · 2024-12-04T13:47:25Z

Fixes #11941

Status

tested on test11/vosms0262

Description

When the MSOutput producer and consumer encounter a generic error, we do not want the process to terminate. We want to continue and process the next workflow.

I know that continue instead of break is not ideal, that forces us to read alerts to make sure that errors do not slip through the cracks. It is also true that in the past it happened that even with break, nobody noticed that something was not working. I do not consider this the best solution, it is yet another patch that will hopefully improve our life :)

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

none

External dependencies / deployment changes

nope

todor-ivanov

looking good

amaltaro

@mapellidario please find some comments along the code.

In addition, I am curious to know how the document information makes it to the alert. I know there are limits (number of bytes?), so could you please share one possible alert that would be generated (if it does not have sensitive information)? Thanks

src/python/WMCore/MicroService/MSOutput/MSOutput.py

amaltaro · 2024-12-04T16:35:52Z

src/python/WMCore/MicroService/MSOutput/MSOutput.py

+        alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
+        self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
+        if self.msConfig["sendNotification"]:
+            self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)


Should we explicitly set an expiration time for this alert? Without it, what is the default value?

The configuration is here https://gitlab.cern.ch/cmsmonitoring/cmsmon-configs/-/blob/master/alertmanager/alertmanager.yaml
and we use this function:

WMCore/src/python/WMCore/Services/AlertManager/AlertManagerAPI.py

Line 32 in 44d2a8a

def sendAlert(self, alertName, severity, summary, description, service, tag="wmcore", endSecs=600, generatorURL=""):

so:

we send an alert with the tag wmcore

the route uses the tag to match the alert and redirect it to "dmwm admins" here

the "dmwm admins" receiver is configured to send an email to our egroup here

after endSecs = 10min the alert is silenced (this overrides the global resolve timeout of 5min)

if the same alert is sent before repeat_interval: 2h, then alertmanager will not send the same notification again

Thank you, Dario. Where did you come with repeat_interval configuration from?

I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?

it's always the same file, here

we can change to 12h if:

we create a new receiver for alertmanager (for example dmwm-admins-12h) , and we override the default values (for example setting 12h for repeat interval)

in msoutput.py, we override the value for tag, let's say wmcore12h

we create a new redirect for alertmanager that matches the tag wmcore12h and send the alert to dmwm-admins-12h

at this point it would be beneficial to have a broader discussion on our alerts, because we could take this opportunity to improve the situation across the board.

Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.

src/python/WMCore/MicroService/MSOutput/MSOutput.py

amaltaro · 2024-12-04T16:41:14Z

And just so I don't forget, please have a look at the Jenkins report as well (which it failed to be produced so far). Let me try to trigger that.

amaltaro · 2024-12-04T16:41:19Z

test this please

amaltaro · 2024-12-09T15:41:11Z

test this please

amaltaro

@mapellidario these changes are looking good in general, but I left a few concerns along the code that might have to be followed up - it really depends on the interpretation and/or behavior that we want to report. Hence, I am leaving a comment instead of approval/request for changes.

src/python/WMCore/MicroService/MSOutput/MSOutput.py

amaltaro · 2024-12-13T15:37:04Z

src/python/WMCore/MicroService/MSOutput/MSOutput.py

+        alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
+        self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
+        if self.msConfig["sendNotification"]:
+            self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)


Thank you, Dario. Where did you come with repeat_interval configuration from?

I am undecided whether we should make this time for re-raising an alert larger or not, as a fear of spam.
Would 12h be a better option?

mapellidario · 2024-12-16T10:47:41Z

I added a commit that improves the clarity of the counters, i hope that their meaning is a bit more clear now

amaltaro

@mapellidario please find some comments and suggestions along the code.
Once you make new changes, feel free to already squash the commits, such that we can merge it in case everything is fine. Thanks

src/python/WMCore/MicroService/MSOutput/MSOutput.py

amaltaro · 2024-12-16T19:02:18Z

src/python/WMCore/MicroService/MSOutput/MSOutput.py

+        alertDescription = "wf: {}\n\nmsg: {}\n\nex: {}\n\n{}".format(workflowname, msg, ex_str, document_str)
+        self.logger.error("%s\n%s\n%s", alertName, alertSummary, alertDescription)
+        if self.msConfig["sendNotification"]:
+            self.sendAlert(alertName, alertSeverity, alertSummary, alertDescription, self.alertServiceName)


Thank you for these details.
Given that we would have to either change the default repeat_interval value or fork it for dmwm, I would suggest to leave it for one of the monitoring-related issues that we are discussing and considering for Q1. I fully agree that a discussion on that is important, so I suggest to keep it out of these developments.

…eneric error

mapellidario · 2024-12-17T10:51:05Z

I implemented the changes and squashed the commits, thanks for the review alan!

dmwm-bot · 2024-12-17T10:59:36Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: succeeded
- 10 warnings
- 88 comments to review
Pycodestyle check: succeeded
- 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/223/artifact/artifacts/PullRequestReport.html

mapellidario requested review from amaltaro and todor-ivanov December 4, 2024 13:47

todor-ivanov approved these changes Dec 4, 2024

View reviewed changes

amaltaro requested changes Dec 4, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

mapellidario requested a review from amaltaro December 12, 2024 15:06

This comment was marked as outdated.

Sign in to view

amaltaro added the PR: squashing needed label Dec 13, 2024

amaltaro reviewed Dec 13, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

amaltaro requested changes Dec 16, 2024

View reviewed changes

ms-output - do not break out producer and consumer loops in case of g…

9a702b2

…eneric error

mapellidario force-pushed the 20241127_msoutput branch from 2dd6799 to 9a702b2 Compare December 17, 2024 10:43

mapellidario removed the PR: squashing needed label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ms-output - do not break out producer and consumer loops #12194

ms-output - do not break out producer and consumer loops #12194

mapellidario commented Dec 4, 2024

todor-ivanov left a comment

amaltaro left a comment

amaltaro Dec 4, 2024

mapellidario Dec 12, 2024

amaltaro Dec 13, 2024

mapellidario Dec 16, 2024

amaltaro Dec 16, 2024

amaltaro commented Dec 4, 2024

amaltaro commented Dec 4, 2024

amaltaro commented Dec 9, 2024

This comment was marked as outdated.

This comment was marked as outdated.

amaltaro left a comment

amaltaro Dec 13, 2024

This comment was marked as outdated.

mapellidario commented Dec 16, 2024

amaltaro left a comment

amaltaro Dec 16, 2024

mapellidario commented Dec 17, 2024

dmwm-bot commented Dec 17, 2024

ms-output - do not break out producer and consumer loops #12194

Are you sure you want to change the base?

ms-output - do not break out producer and consumer loops #12194

Conversation

mapellidario commented Dec 4, 2024

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

todor-ivanov left a comment

Choose a reason for hiding this comment

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Dec 4, 2024

Choose a reason for hiding this comment

mapellidario Dec 12, 2024

Choose a reason for hiding this comment

amaltaro Dec 13, 2024

Choose a reason for hiding this comment

mapellidario Dec 16, 2024

Choose a reason for hiding this comment

amaltaro Dec 16, 2024

Choose a reason for hiding this comment

amaltaro commented Dec 4, 2024

amaltaro commented Dec 4, 2024

amaltaro commented Dec 9, 2024

This comment was marked as outdated.

This comment was marked as outdated.

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Dec 13, 2024

Choose a reason for hiding this comment

This comment was marked as outdated.

mapellidario commented Dec 16, 2024

amaltaro left a comment

Choose a reason for hiding this comment

amaltaro Dec 16, 2024

Choose a reason for hiding this comment

mapellidario commented Dec 17, 2024

dmwm-bot commented Dec 17, 2024