Add setter methods map to WMWorkload and call it for all reqArg parameters #12099

todor-ivanov · 2024-09-17T11:50:05Z

Fixes #12038

Status

Ready

Description

With the current PR we add the new functionality to call all needed WMWorkload setter methods based on a full reqArgs dictionary passed from the upper level caller and try to set all of parameters at once, rather than calling every single setter method one by one. This is achieved by creating a proper map between methods and possible request arguments. And later validating if the set of arguments passed with reqArgs dictionary would properly match the signature of the setter method which is to be called. In the current implementation only a small set of methods is mapped to request parameters :

setSiteWhiteList
setSiteBlacklist
setPriority
(Mostly single argument parametrized methods, but that's ok for the time being, because those cover perfectly the functionality we need to plug in here)

With this change a path for updating all possible workqueue elements in a single go was opened. In the WorkQueue service an additional method was developed for fetching all possible workqueue elements and update them all with the full set of arguments provided through the reqArgs dictionary with the cost of a single database action per WorkQuee element, rather than 3 (or more) separate calls to the database for updating every single element parameter separately. Upon updating all workqueue elements with the new parameters the WMSpec copy of the given workflow at the workqueue is also updated in a single push using the same logic from above.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

cmsdmwmbot · 2024-09-17T11:59:26Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 9 warnings
- 195 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15218/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-17T15:03:41Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
Python3 Pylint check: failed
- 11 warnings and errors that must be fixed
- 9 warnings
- 212 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15219/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-09-17T15:11:45Z

@amaltaro Please take a look at this PR, I think this one fully covers all the requirements and addresses our fears for affecting scalability due to increased database calls.

cmsdmwmbot · 2024-09-17T15:52:32Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 9 warnings
- 211 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15220/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-09-19T02:51:12Z

src/python/WMCore/ReqMgr/Service/Request.py


        # Commit the changes of the current workload object to the database:
        workload.saveCouchUrl(workload.specUrl())

+        # Commit all Global WorkQueue changes per workflow in a single go:
+        self.gq_service.updateElementsByWorkflow(workload.name(), reqArgs, status=['Available'])


With this call, you will actually change the behavior of priority to get only applied to Available elements.

In addition, do you ensure now that reqArgs will always be only the supported parameters?

With this call, you will actually change the behavior of priority to get only applied to Available elements.

Well, does it even matter? As we discussed with you few days ago - do we actually know if we continue to update any Worqueue Element's parameters at the agent once it is acquired from the GlobalWorkQueue ? If we do, then it matters also for Sitelists as well. So IIRC, during our discussion we decided to stay safe and update only available WQE and not to risk getting into some race conditions related to input data location etc. If we do not care I can remove the requirement for status=['Availiable']. But if we care to propagate the Priority information down to the agent for every WQE status - Aqcuired et. all., but we want to update site information only at Global WQ, then we simply cannot have a mechanism to do it in a single push. The two calls should be separated. So we cannot have them both:

a single DB action for updating WQE

different WQE status requirement based on request parameters
(an escape of this is to hardcode this in the Workqueue module logic of updateelementsByWorkflow, but you know my opinion on hardcoding requirements in the code.

In addition, do you ensure now that reqArgs will always be only the supported parameters?

As long as we follow the same logic as we did while validating the allowed parameters in the first set of actions in these calls (meaning to call validate_request_update_args before calling the put method and hence _handleNoStatusUpdate) then we are keeping the same logic as before - hence following the map of allowed actions from

WMCore/src/python/WMCore/ReqMgr/DataStructs/RequestStatus.py

Lines 118 to 147 in 7ddd036

ALLOWED_ACTIONS_FOR_STATUS = {

"new": ["RequestPriority"],

"assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist",

"AcquisitionEra", "ProcessingString", "ProcessingVersion",

"Dashboard", "MergedLFNBase", "TrustSitelists",

"UnmergedLFNBase", "MinMergeSize", "MaxMergeSize",

"MaxMergeEvents", "BlockCloseMaxWaitTime",

"BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize",

"SoftTimeout", "GracePeriod",

"TrustPUSitelists", "CustodialSites",

"NonCustodialSites", "Override",

"SubscriptionPriority"],

"assigned": ["RequestPriority"],

"staging": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],

"staged": ["RequestPriority"],

"acquired": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],

"running-open": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],

"running-closed": ["RequestPriority"],

"failed": [],

"force-complete": [],

"completed": [],

"closed-out": [],

"announced": [],

"aborted": [],

"aborted-completed": [],

"rejected": [],

"normal-archived": [],

"aborted-archived": [],

"rejected-archived": [],

}

. And for this logic we needed mostly those 3 setters methods which I've already put in the reqArgs to setterMethod map at updateWorkloadArgs :

self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority)) self.settersMap['SiteBlacklist'] = setterTuple('SiteBlacklist', self.setSiteBlacklist, inspect.signature(self.setSiteBlacklist)) self.settersMap['SiteWhitelist'] = setterTuple('SiteWhitelist', self.setSiteWhitelist, inspect.signature(self.setSiteWhitelist))

If we want to be extra detailed and fully exhaustive on any possible reqarg per status setters, we may also include into this map the eventual methods for all allowed actions per assignment-approved status.:

"assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist", "AcquisitionEra", "ProcessingString", "ProcessingVersion", "Dashboard", "MergedLFNBase", "TrustSitelists", "UnmergedLFNBase", "MinMergeSize", "MaxMergeSize", "MaxMergeEvents", "BlockCloseMaxWaitTime", "BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize", "SoftTimeout", "GracePeriod", "TrustPUSitelists", "CustodialSites", "NonCustodialSites", "Override", "SubscriptionPriority"],

But for some of them the logic may need to change because some of those setters are parametrized by more than a single argument. While at the same time we already know that for assignment-approved we would never call _handleNoStatusUpdate ... So that's why, to me it seems safe to proceed only with those 3 setters mapped. Of course, if you ask me - I am completely up for moving the whole logic to be implemented here in a more generic way .... for all status updates, then get rid of a big chunk of code covering custom cases ... and only make the proper calls to this generic method here from upstream modules (e.g. Request in the current case) But I do not think we will have the time during this line of development here to do this.

We might be able to distribute this update, but before answering how we can proceed, can you please check the JobUpdater code to check:

in which databases are the WQEs updated?

does it use any status filter? If so, which statuses?

hi @amaltaro,
ok, Here it is:

JobUpdater at the agent uses the very same method updatePriority

WMCore/src/python/WMCore/Services/WorkQueue/WorkQueue.py

Line 240 in 4e0759d

def updatePriority(self, wf, priority):

from the WMCore.Services.WorkQueue to update the workflow priority (just as before), here is the actual call:

WMCore/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

Line 151 in 4e0759d

self.workqueue.updatePriority(workflow, priorityCache[workflow])

It checks for any diff between the priority in WMBS and couch database before taking actions:

WMCore/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

Line 149 in 4e0759d

if workqueuePrio != priorityCache[workflow]:

and:

WMCore/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

Line 166 in 4e0759d

if wmbsPrio != priorityCache[workflow]:

The couchd database to be used is taken from the agent configuration for the component (actually the regular local workqueue component, not the JobUpdater config section):

WMCore/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

Lines 50 to 51 in 4e0759d

self.workqueue = WorkQueue(self.config.WorkQueueManager.couchurl,

self.config.WorkQueueManager.dbname)

which in this case is (taken from one agent configuration):

config.WorkQueueManager.dbname = 'workqueue'

So the inbox db is indeed configured at the WorkqueueManager component at the agent, but as long as I do not see any actual actions during this update from the updatepriority method, I do not think we even touch this DB while updating the WQE priority at the agents.

I found No filters being applied for any of the WQE, but the workflows considerred are the ones returned by:

WMCore/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py

Line 141 in 4e0759d

workflowsToCheck = self.workqueue.getAvailableWorkflows()

and:

WMCore/src/python/WMCore/Services/WorkQueue/WorkQueue.py

Lines 209 to 218 in 4e0759d

def getAvailableWorkflows(self):

"""Get the workflows that have all their elements

available in the workqueue"""

data = self.db.loadView('WorkQueue', 'elementsDetailByWorkflowAndStatus',

{'reduce': False, 'stale': 'update_after'})

availableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if

x['key'][1] == 'Available')

notAvailableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if

x['key'][1] != 'Available')

return availableSet - notAvailableSet

amaltaro · 2024-09-19T03:04:07Z

src/python/WMCore/WMSpec/WMWorkload.py

+        #       call the proper setter methods bellow.
+
+        # populate the current instance settersMap
+        self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority))


Given that you are already checking the parameter name and associating it to the actual function (plus inspecting its signature), isn't it more clear to simply call the actual method according to the parameter?

But this is exactly what happens here. Well few line bellow though but still the same:

self.settersMap[reqArg].setterFunc(argValue)

Yes, I understand it. What I am saying is that creating a map and passing a pointer to a function does not look as friendly as I classic if-elif dealing with the 3 parameters.

creating a map gives the option of following the same pattern for adding as many setter calls (workflow parameter setter) as one may wish, by providing a generic method of validating and assembling the calls and not changing the logic for the rest of the function (i.e. not forking with hundreds of elifs, and getting into the trap of code growth by number of custom usecases). Here, the addition of a new setter is just a matter of adding yet another line in the map, following the same pattern as the rest in the map i.e.:

a named tuple of: ('reqArg', 'setterFunc', 'setterSignature')

So pretty universal way I'd say. And perfectly serves the exact purpose of what is needed here.

I understand the rationale for implementing something generic, but right now I am not aware of any other spec parameters that will have to be changed, so it's definitely not hundreds of future elifs.

In addition, I am very much adept of the KISS principle, whenever possible and fit. To me, this overly complex code (personally I don't know any of those inspect methods) can easily lead to a bug, or a buggy development.

@amaltaro , @anpicci , @todor-ivanov

Given that Python (or any programming language) offers multiple ways to achieve the same outcome, evaluating the best approach can often be subjective, depending on factors such as experience, proficiency, or personal style. Consequently, these discussions can become matters of personal preference.

For example, choosing between a full for-loop or list comprehension is a matter of choice. While we can debate which one might be "better" based on various criteria, both serve the same functional purpose. In this instance, I feel the this particular conversation is similar, and rather than debating stylistic choices, the focus should be on whether the code achieves its intended purpose. Otherwise, we risk undermining the role of both developer and reviewer by favoring one approach over another, which I find counterproductive.

If there are concerns about correctness or functionality, I would recommend incorporating unit tests for validation instead of continuing to focus on the specific implementation style.

This is not coding style and personal preferences, but I see it as code readability and sustainability.

Anyhow, thank you for sharing your perspective on this matter. I won't bug people anymore on this.

Alan,

This:

try: if 'RequestPriority' in reqArgs: self.setPriority(reqArgs['RequestPriority']) if 'SiteBlacklist' in reqArgs: self.setSiteBlacklist(reqArgs['SiteBlacklist']) if 'SiteWhitelist' in reqArgs: self.setSiteWhitelist(reqArgs['SiteWhitelist']) except Exception as ex: msg = f"Failure to update workload parameters. Details: {str(ex)}" raise WMWorkloadException(msg) from None

is simply not equal to this:

for reqArg, argValue in reqArgs.items(): if not self.settersMap.get(reqArg, None): msg = f"Unsupported or missing setter method for updating reqArg: {reqArg}." raise WMWorkloadException(msg) try: self.settersMap[reqArg].setterSignature.bind(argValue) except TypeError as ex: msg = f"Setter's method signature does not match the method calls we currently support: Error: req{str(ex)}" raise WMWorkloadException(msg) from None # Now go through the reqArg again and call every setter method according to the map for reqArg, argValue in reqArgs.items(): try: self.settersMap[reqArg].setterFunc(argValue) except Exception as ex: currFrame = inspect.currentframe() argsInfo = inspect.getargvalues(currFrame) argVals = {arg: argsInfo.locals.get(arg) for arg in argsInfo.args} msg = f"Failure while calling setter method {self.settersMap[reqArg].setterFunc.__name__} " msg += f"With arguments: {argVals}" msg += f"Full exception string: {str(ex)}" raise WMWorkloadException(msg) from None

At the end, in those both implementations, if the call to the setter method is wrong they will both fail the call, but the later is much more descriptive. It recognizes (at least 3) different possible conditions of failure and forwards the proper message for later debugging, while the former just fails the call and masks all the relevant information for the one who is to chase eventual problems. And that was the exact reason why I went for this implementation. I know you'd always prefer the shorter version, but sometimes it hides/masks valuable information.

Something more, since this is supposed to be a method triggered by a user's call rather than an operation triggered by the internals of our system, the later method inspects the stack and gives you the values of all parameters in the call. Meaning it shows you the exact (eventually) user mistake.

Talking about inspection and validation. Validation of the user parameters is not supposed to happen at this layer (at the setter call). Validation should be performed in the Service/Request class (which is likely calling the Validation utils).

With that said, I just noticed that we are not validating the site lists, as we usually do for workflow assignment.
So we should add that validation in here as well to protect the system and us from unneded debugging. For reference, this is how we define the site lists and the validation function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1147

cmsdmwmbot · 2024-09-26T11:12:47Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 9 warnings
- 211 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15240/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-26T11:16:31Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 9 warnings
- 211 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15241/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-09-26T15:11:28Z

hi @amaltaro I have addressed your comments - Mostly, I removed the WQE status filter, based on the investigation I did on the JobUpdater component. You may take another look.

amaltaro

Todor, please find more comments along the code.

Once we are about to converge on this development, I think it is important to test it in a Dev environment (both for central services and WMAgent).

amaltaro · 2024-09-27T02:24:08Z

src/python/WMCore/WMSpec/WMWorkload.py

+        #       call the proper setter methods bellow.
+
+        # populate the current instance settersMap
+        self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority))


I understand the rationale for implementing something generic, but right now I am not aware of any other spec parameters that will have to be changed, so it's definitely not hundreds of future elifs.

In addition, I am very much adept of the KISS principle, whenever possible and fit. To me, this overly complex code (personally I don't know any of those inspect methods) can easily lead to a bug, or a buggy development.

amaltaro · 2024-09-27T02:27:37Z

src/python/WMCore/Services/WorkQueue/WorkQueue.py

+
+        # Update all WorkQueue elements with the parameters provided in a single push
+        if elementsToUpdate:
+            self.updateElements(*elementsToUpdate, **updateParams)


I believe I've seen you answering my question about these updateParams.
Are we completely 100% sure that the only parameters that will reach this update are one and/or a combination of:

RequestPriority

SiteWhitelist

SiteBlacklist
?
Of course, also considering that we only want to update these values when there is an actual update to the value.

In addition, I would like you to test this in WMAgent with a crafted workflow/scenario, please.

I believe I've seen you answering my question about these updateParams

Yes the answer is the second one in line, here from this comment: #12099 (comment)

In addition, I would like you to test this in WMAgent with a crafted workflow/scenario, please.

I had it tested. With a full validation campaign, not a single workflow, in my dev cluster.

I could not find an exact answer to the question above.
Additionally, the code you are deleting had an else statement populating a list of not handled arguments:
https://github.com/dmwm/WMCore/pull/12099/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794L440
which makes me believe that it can receive parameters other than only those 3.

About the validation, to properly validate these changes, we need to actually trigger specific scenarios and actions. The standard "campaign-style" validation is not going to expose any issues with this code.

@todor-ivanov @amaltaro I don't know if you have already converged on a common ground, but here is my opinion about map vs if-elif-else approach.

Considering that I am not as experienced as both of you with the WMCore software, to me the map approach looks clearer and more understandable than the if-elif-else approach. Indeed, I can read which are the parameters of interest at the very beginning, and from these lines I can see that only one parameter among RequestPriority, SiteWhitelist, SiteBlacklist can be modified. The all process doesn't require more intellectual effort than the previous if-elif-else approach, and in addition to me sounds more robust and easier to debug (where with "debug" I am also referring to "actually fix a potential issue").

Regarding the possible concerns:

I believe I've seen you answering my question about these updateParams. Are we completely 100% sure that the only parameters that will reach this update are one and/or a combination of:
* RequestPriority
* SiteWhitelist
* SiteBlacklist
?
Of course, also considering that we only want to update these values when there is an actual update to the value.

Given the original code, we can assume that this is a complete list of parameters that are supposed to reach this update, provided that what we want to get with this PR doesn't require other parameters to be properly updated

Additionally, the code you are deleting had an else statement populating a list of not handled arguments:
https://github.com/dmwm/WMCore/pull/12099/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794L440
which makes me believe that it can receive parameters other than only those 3.

This occurrence should be handled by the try-except in these lines, right @todor-ivanov? @amaltaro do you have any concern that this approach could fail to prevent updating parameters other than RequestPriority, SiteWhitelist, and SiteBlacklist?

@anpicci I think you meant to send this reply in this thread instead https://github.com/dmwm/WMCore/pull/12099/files#r1766081573 ?

To answer again the original question here:

With the latest changes in this commit: dfbda2a it is now 100% sure that if anything but a supported argument update reaches that point, we will raise the proper exception and it will be handled accordingly in the caller method

amaltaro · 2024-09-27T02:31:10Z

src/python/WMCore/Services/WorkQueue/WorkQueue.py

+                                 'reduce': False})
+
+        # Fetch only a list of WorkQueue element Ids && Filter them by allowed status
+        if status:


I like that we are not breaking the current behavior of RequestPriority update.

However, thinking twice about this, if we want to make these updates more efficient, we could update it only for elements in Available, Negotiating and Acquired, from the list of potential statuses for WQE:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataStructs/WorkQueueElement.py#L14

If you change it, could you please also update the Issue description (because it says only Available).

cmsdmwmbot · 2024-09-27T12:10:18Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 9 warnings
- 211 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15249/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-30T20:59:10Z

Can one of the admins verify this patch?

anpicci · 2024-10-14T10:39:57Z

@todor-ivanov apart from my comment to @amaltaro 's review, I only suggest to resolve the conflicts on src/python/WMCore/ReqMgr/Service/Request.py, if you have not considered yet to address that

anpicci · 2024-10-17T12:19:13Z

@vkuznet thank you for the feedback.
@todor-ivanov @amaltaro , I propose to proceed as follows:

let's stick with the original Todor's proposal to use maps;
@amaltaro will focus his review on checking that such implementation is valid and it is consistent with the rest of the system;
at the same time, @todor-ivanov provides documentation both for the external packages introduced in the PR, and in terms of docstring and in-line comments, when necessary according to both developer and reviewer point of view, such that maintenance of the code is enhanced;
once Jenkins tests are passed and the PR merged, @todor-ivanov is responsible for fixing issues arising during future release validations that are ascribible to this PR, or, at least, investigate and provide guidances for resolving such issues.

I want to clarify that I would like to adopt these accountability principles to every issue, meaning that this is not a special treatment outlined only for this PR. I will follow up on this during the next group meeting.

Thanks!

dmwm-bot · 2024-11-19T15:17:17Z

Jenkins results:

Python3 Unit tests: failed
- 2 new failures
- 1 tests added
- 1 changes in unstable tests
Python3 Pylint check: failed
- 8 warnings and errors that must be fixed
- 16 warnings
- 260 comments to review
Pycodestyle check: succeeded
- 27 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/87/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2024-11-19T15:51:28Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 9 warnings and errors that must be fixed
- 16 warnings
- 260 comments to review
Pycodestyle check: succeeded
- 27 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/88/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-12-03T21:27:07Z

It looks like I forgot to update this PR with some previous reviews that have been done here and there, so here we go.

There are some tasks pending from the other related PR, which have been described/discussed in this thread:
#12120 (review)

dmwm-bot · 2024-12-07T08:39:46Z

Jenkins results:

Python3 Unit tests: succeeded
- 3 changes in unstable tests
Python3 Pylint check: failed
- 10 warnings and errors that must be fixed
- 16 warnings
- 255 comments to review
Pycodestyle check: succeeded
- 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/162/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2024-12-07T10:07:40Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 2 changes in unstable tests
Python3 Pylint check: failed
- 14 warnings and errors that must be fixed
- 17 warnings
- 307 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/163/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-12-07T10:15:33Z

Hi @amaltaro @vkuznet @anpicci
This Review call is mostly for @amaltaro and also to update the others on the latest development triggered by his review comment here: #12099 (comment) linking to this one: #12120 (review) , which we agreed to address in the current issue, rather than prolonging the transient development in the previous one.

So here is a quick summary:

With this commit: Restore unhandled arguments mechanism
I am restoring the previous behavior of stopping any workflow changes in the case of not supported parameters update calls, but still signaling which are those which we do not handle.
NOTE: any non-status update parameters for changes of workflows in assignment_approved status are still treated separately
With this commit: Call reqMgr api only once && Preserve RequestStatus in the workload.
I am addressing @amaltaro's request to reduce the calls to Reqmgr APIs to only one and only on demand and as early as possible - during the validation step only. Upon which the workflow status is preserved in the .request section of the workload object and accessible through the relevant methods through out the whole rest of the process
With this commit: Add arguments validation for partial workflow parameters update
I am addressing the request to validate all the values for those partial workflow parameters Updates

dmwm-bot · 2024-12-09T11:02:49Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 307 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/168/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2024-12-09T12:37:30Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 306 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/171/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-12-09T16:22:17Z

src/python/WMCore/ReqMgr/Service/Request.py


        # Commit the changes of the current workload object to the database:
        workload.saveCouchUrl(workload.specUrl())

+        # Commit all Global WorkQueue changes per workflow in a single go:
+        self.gq_service.updateElementsByWorkflow(workload.name(), reqArgs, status=['Available', 'Negotiating', 'Acquired'])


hi @amaltaro the line that I was talking about during the WMCore meeting is this one. It triggers a second call to workload.updateWorkloadArgs internally, similarly to what is done on line: 431 at WMcore.ReqMgr.Service.Request from the current PR.

The internal call is happening at line 316 at WMCore.Services.WorkQueue again from the very same PR.

You have asked me to implement all the changes to both reqMgr (in this case through the workload object) and all WQEs in a single push. It is now possible. All that needs to happen, is to substitute the first call from above with the second call from the current line. I have the code change prepared. Just let me know if I should proceed with it or we should play it safe and we still do the action in two steps.

vkuznet

I’ve reviewed the code provided in this PR and, from a coding perspective, I don’t have any specific suggestions for improvement. I noticed the ongoing discussion between Alan and Todor on various related topics; however, I believe my input may not add significant value to that part of the conversation.

From the standpoint of pure code review, the implementation looks good and is ready to be merged. I don’t have a strong preference between Todor's map/namedtuple/setter approach or Alan's if/else flow, as I view this as a matter of personal or team preference.

dmwm-bot · 2024-12-10T11:18:21Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 306 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/179/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-12-10T11:27:14Z

hi @amaltaro
I just figured out a misbehavior, when we validate only the changed arguments. This allows it for some unsupported changes to sneak in. e.g. Changing only the SiteBlacklist and leaving SiteWhitelist untouched would allow a workflow change to propagate such collision. I created an extra fix for that: 666b5f3 And it works: [1], but unfortunately it affects also stat arguments [2]. I need to think a way out of this.

[1]

[10/Dec/2024:11:19:34]  Updating request "tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169" with these user-provided args: {'RequestPriority': 200000, 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': 'T2_CH_CERN'}
[10/Dec/2024:11:19:35]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 190, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 102, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 114, in validate_request_update_args
    workload.validateArgumentsPartialUpdate(request_args)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 2048, in validateArgumentsPartialUpdate
    validateArgumentsUpdate(schema, argumentDefinition, optionKey=None)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 294, in validateArgumentsUpdate
    validateSiteLists(arguments)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 255, in validateSiteLists
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN']
	ClassName : None
	ModuleName : WMCore.WMSpec.WMWorkloadTools
	MethodName : validateSiteLists
	ClassInstance : None
	FileName : /usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py
	LineNumber : 255
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

[10/Dec/2024:11:19:35]  SERVER REST ERROR WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue 1d6e1dca69646015783171f13d9b11eb (Invalid spec parameter value: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN'])
[10/Dec/2024:11:19:35]    Traceback (most recent call last):
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 749, in default
[10/Dec/2024:11:19:35]        return self._call(RESTArgs(list(args), kwargs))
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 828, in _call
[10/Dec/2024:11:19:35]        v(apiobj, request.method, api, param, safe)
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 225, in validate
[10/Dec/2024:11:19:35]        raise InvalidSpecParameterValue(msg) from None
[10/Dec/2024:11:19:35]    WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue: InvalidSpecParameterValue 1d6e1dca69646015783171f13d9b11eb [HTTP 400, APP 1102, MSG "Invalid spec parameter value: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN']", INFO None, ERR None]

[2]

[10/Dec/2024:11:20:25]  Updating request "tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093438_2280" with these user-provided args: {'total_jobs': 0, 'input_events': 0, 'input_lumis': 0, 'input_num_files': 0, 'RequestName': 'tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093438_2280'}
[10/Dec/2024:11:20:25]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 190, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 102, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 114, in validate_request_update_args
    workload.validateArgumentsPartialUpdate(request_args)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 2048, in validateArgumentsPartialUpdate
    validateArgumentsUpdate(schema, argumentDefinition, optionKey=None)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 292, in validateArgumentsUpdate
    validateUnknownArgs(arguments, argumentDefinition)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 323, in validateUnknownArgs
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: There are unknown/unsupported arguments in your request spec: ['total_jobs', 'input_lumis', 'input_num_files', 'input_events']
	ClassName : None
	ModuleName : WMCore.WMSpec.WMWorkloadTools
	MethodName : validateUnknownArgs
	ClassInstance : None
	FileName : /usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py
	LineNumber : 323
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

anpicci · 2024-12-10T11:32:19Z

@todor-ivanov may you ping us when you will provide the fix to the latest issue?

todor-ivanov · 2024-12-10T13:14:41Z

hi @anpicci

@todor-ivanov may you ping us when you will provide the fix to the latest issue?

Yes, I will. Most probably after the training session today.

todor-ivanov · 2024-12-11T18:02:54Z

hi @anpicci @amaltaro here is the solution promised: Skip validating update stat arguments

dmwm-bot · 2024-12-11T18:15:10Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 306 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/195/artifact/artifacts/PullRequestReport.html

todor-ivanov · 2024-12-12T15:50:44Z

@amaltaro I do not know if you have started looking into this PR, but I need your opinion at least on: #12099 (comment)

So far the code works just as expected with the fix for stat update arguments. I've tested it. On top of it here is the state of one such WQE:

Before Sitelists update:

{"_id":"2c07fe74e23bcfb33a6fe27115db496b","_rev":"3-bc9c8a587f56eecaefe67dfa12cc9375","timestamp":1733826411.0451522,"updatetime":1733826766.7949412,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/JetHT/Run2012C-v1/RAW#e759da30-f693-11e1-847a-842b2b4671d8":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":51,"NumberOfFiles":2,"NumberOfEvents":6320,"Jobs":13,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","TaskName":"HLTD","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T1_US_FNAL","T2_CH_CERN"],"SiteBlacklist":[],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":200000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826410,"CreationTime":1733826411.0451522}}

Upon Sitelists updates:

{"_id":"2c07fe74e23bcfb33a6fe27115db496b","_rev":"4-c121956f972ed71d0435d05680ece8bb","timestamp":1733826411.0451522,"updatetime":1733826766.7949412,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/JetHT/Run2012C-v1/RAW#e759da30-f693-11e1-847a-842b2b4671d8":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":51,"NumberOfFiles":2,"NumberOfEvents":6320,"Jobs":13,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","TaskName":"HLTD","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T2_CH_CERN"],"SiteBlacklist":["T1_US_FNAL"],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":200000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826410,"CreationTime":1733826411.0451522}}

Shortly speaking both ReqMgr and the WorkQueue elements have been updated after the arguments have been properly validated not only for format but also for values type and content. the validation process concerns only no status update actions.

I am still keeping all those commits unsquashed, only because I am waiting for your answer to the above question and to preserve the different steps of the solution to be visible during the review process. So if you think no further optimisations are needed I am ready to squash them and get it ready for merge. If you think I should push further for optimizing these calls to make them in one go I am ready do that as well.

dmwm-bot · 2024-12-12T18:19:22Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 304 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/198/artifact/artifacts/PullRequestReport.html

Fix bad updateElementsByWorkflow argument

todor-ivanov · 2024-12-12T18:35:06Z

And to put it: #12099 (comment) in perspective, this is what I am talking about: Update workload args only once

And here are the WQE updated:

Before:

{"_id":"d70157fad5e8778da29c600630bf8521","_rev":"3-b9e5e4642aba0e18b0644a626821d2c9","timestamp":1733826378.292129,"updatetime":1733826385.1671724,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a9c70800-f936-4c6c-a3a8-c6fa4c9421d0":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":15,"NumberOfFiles":1,"NumberOfEvents":2839,"Jobs":1,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","TaskName":"DataProcessing","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T1_US_FNAL","T2_CH_CERN"],"SiteBlacklist":[],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":600000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826378,"CreationTime":1733826378.292129}}

After:

{"_id":"d70157fad5e8778da29c600630bf8521","_rev":"4-49f3b4a5c45a199783d2347b1a608d3a","timestamp":1733826378.292129,"updatetime":1733826385.1671724,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a9c70800-f936-4c6c-a3a8-c6fa4c9421d0":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":15,"NumberOfFiles":1,"NumberOfEvents":2839,"Jobs":1,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","TaskName":"DataProcessing","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T2_CH_CERN"],"SiteBlacklist":["T1_US_FNAL"],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":600000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826378,"CreationTime":1733826378.292129}}

So it works!

@amaltaro - your turn

dmwm-bot · 2024-12-12T18:42:08Z

Jenkins results:

Python3 Unit tests: succeeded
- 2 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 304 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/199/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2024-12-13T07:34:11Z

Jenkins results:

Python3 Unit tests: succeeded
- 4 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 304 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/201/artifact/artifacts/PullRequestReport.html

amaltaro

@todor-ivanov these changes are looking good in general. I left a few comments along the code that would be better to get them addressed though.

amaltaro · 2024-12-13T11:29:36Z

src/python/WMCore/ReqMgr/Service/Request.py

@@ -413,6 +414,8 @@ def _handleNoStatusUpdate(self, workload, request_args, dn):
        request_args will be ignored.
        """
        reqArgs = deepcopy(request_args)
+        reqStatus = workload.getStatus()
+        cherrypy.log(f"CurrentRequest status: {reqStatus}")


I think this log record should be removed before this PR gets merged.

actually I just noticed, by moving it few lines above we are now missing the requstName in the information deliverred with that message - before it was coming from previous messages, ut now wince it sits infornt of all of them I'd rather evolve it to the normal introductory message signaling how we enter the _handleNoStatusUpdate method.

amaltaro · 2024-12-13T11:37:13Z

src/python/WMCore/ReqMgr/Service/Request.py

-        if reqArgsNothandled:
+        try:
+            # Commit all Global WorkQueue changes per workflow in a single go:
+            self.gq_service.updateElementsByWorkflow(workload, reqArgs, status=['Available', 'Negotiating', 'Acquired'])


Without a deep inspection of the code, this looks incorrect. Why are we trying to execute updateElementsByWorkflow() even before some of the methods in the except block (e.g. _handleAssignmentStateTransition() and/or claiming that there might be "unhandled arguments")?

Call to updateElementsByWorkflow() should only be made if we are certain that WQEs need to be updated. As global workqueue can be a scale-limiting component of the system and we should use it wisely.

The WMWorkloadUnhandledException is telling you if there have been any unhandled arguments. It is thrown when we call workload.updateWorkloadArgs from within updateElementsByWorkflow - here: https://github.com/dmwm/WMCore/pull/12099/files#diff-d758570ce7f6baddaeefb0bdb4e2015b06e5b7772bf929248b686adc64286233R316

And I did it so because of your request during our discussion of the issue to have the workflow and the workqueue elements being updated in a single push.
It was not like that just until last night. And I've asked you at least twice ... trying to make all the needed pointers in my comments here: #12099 (comment) and here: #12099 (comment) before I proceed with unifying those two steps with this commit: 93d0c3b (hence why I still keep all the commits not squashed)

And I do not see how would we recognize if there have been any workqueue elements to be updated until we query the the workqueue and see if there are any. It have been sequential step and a mandatory one in this method even before I make it all in one go. We were still always going through it, as long as we were called.

Usually the parameter reqArgsDiff was the one that would have told us if there was anything to be done at this stage within this method, but I remember in the past you asked me to avoid calling the validation twice at the head of this method here. But regardless... , in all the cases of a call to update a request with no RequestStatus update we still always return the reqArgsDiff from here:

WMCore/src/python/WMCore/ReqMgr/Utils/Validation.py

Line 111 in 44d2a8a

return workload, reqArgsDiff

instead of the originally passed request_args and then any actions should have been halted at the very top of this method here:

WMCore/src/python/WMCore/ReqMgr/Service/Request.py

Lines 417 to 419 in 44d2a8a

if not reqArgs:

cherrypy.log(f"Nothing to be changed at this stage for {workload.name()}")

return 'OK'

So if there was nothing to be updated in the workflow itself - ergo in its WQE elements of the given status, we should never reach that point in the code. But if there have been any change with the workflow arguments instead, we will have to proceed with updating the WQE elements as well.

On top of that we update only the ones in the selected statuses, rather than all of them. I think we are safe in what we are doing here.

Given that these changes are now being applied on top of your previous incomplete changes, it is a bit harder to see how things were implemented.

I don't really understand why we call _handleAssignmentStateTransition() from this _handleNoStatusUpdate() method, as it says that there is not supposed to be any status transition. Can you please explain that?

In addition, I believe that this call updateElementsByWorkflow() could become not super cheap (few secs?) if workqueue is loaded with documents. So, IMO we should only call it if really necessary. To put it in a different way, a request getting assigned (hence in assignment-approved) does not have workqueue elements.

I don't really understand why we call _handleAssignmentStateTransition() from this _handleNoStatusUpdate() method, as it says that there is not supposed to be any status transition. Can you please explain that?

Because, if you have called the _handleNoStatusUpdate for a workflow that is in assignment_approved then you will get with the request all those full set of parameters that you get during the assignment_approved state transition. Which will blow up, because we decided to still support a limited set of parameters for no status update - only SiteLists related and RequestPriority (which was yet again a wrong decision). This is the root cause of the BUG you reported in the first implementation an did not accept the code back then (even though I have provided this same solution on the very next day). So this was the solution - the way out of the situation was - in order to avoid rewriting all which is already there in _handleAssignmentStateTRansition and assimilating it in workload.updateWorkloadArgs and since the _handleAssignmentStateTRansition method has nothing to do with the state transition action itself.... it was safe enough to call it directly here. If you want, I will take this method and will rewrite it under the context of workload.updateWorkloadArgs. I do not mind. Actually I think this is the proper way to do it. And I've expressed that in the past.

And what is the relationship of a workflow in assignment_approved with updateElementsByWorkflow()? None, the workflow has not been assigned yet. Calling workqueue is a waste of resources.

If we had an issue (maybe that was with ACDC), making a call to WorkQueue won't resolve that issue. This implementation is wrong. If you want to keep it, keep it.

Alan, make up your mind.....

you asked me to reduce the calls to worload and push it all in one go. I twisted my mind to implement it. Before I did it I asked for your feedback twice.... now you tell me this is wrong.

If we want to have it all correct we must stop putting exrta background knowledge in the algorithms we follow ... and tie ourselves in a figure 8 knot of conditions.

In order to avoid confusion and blame .... Please put the correct set of conditions here in simple English under which the work queue elements update must be called. And I will implement it.

As discussed over Zoom:

_handleAssignmentStateTransition is not performing any status transition;

and the exception WMWorkloadUnhandledException is not meant to be raised by updateElementsByWorkflow(), but by its internal call workload.updateWorkloadArgs(updateParams), if needed.

Thank you for explaining this Todor. I have no more concerns on this code.

amaltaro · 2024-12-13T11:51:49Z

src/python/WMCore/Services/WorkQueue/WorkQueue.py

@@ -284,6 +284,41 @@ def updatePriority(self, wf, priority):
            wmspec.saveCouch(self.hostWithAuth, self.db.name, dummy_values)
        return

+    def updateElementsByWorkflow(self, workload, updateParams, status=None):


Instead of having two different methods for the same thing, I would suggest to converge on only one implementation for this.
Valentin already provided this site specific method: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WorkQueue/WorkQueue.py#L240

So I would suggest to either refactor this one and change to make sure it works with WMAgent WorkflowComponent, or adapt this PR to use the method already provided by Valentin.

We quickly exchanged a message or two with @vkuznet through Mattermost and he kindly agreed on my request to keep the current method from this PR here and him adapting his code. Thank you @vkuznet for this!

Sounds good. Valentin, when you make that pull request, please refer it to the actual GH issue then, which I think it is #12039

amaltaro · 2024-12-13T11:56:43Z

src/python/WMCore/WMSpec/WMWorkload.py

@@ -68,6 +81,57 @@ class WMWorkloadHelper(PersistencyHelper):

    def __init__(self, wmWorkload=None):
        self.data = wmWorkload
+        self.settersMap = {}
+
+    def updateWorkloadArgs(self, reqArgs):


We need unit tests for this method.

amaltaro · 2024-12-13T11:56:56Z

src/python/WMCore/WMSpec/WMWorkload.py

@@ -1176,6 +1240,25 @@ def getDbsUrl(self):

        return getattr(self.data.request.schema, "DbsUrl")

+    def setStatus(self, status):


We need unit tests for this method as well

amaltaro · 2024-12-13T11:57:02Z

src/python/WMCore/WMSpec/WMWorkload.py

+        self.data.request.status = status
+        return
+
+    def getStatus(self):


We need unit tests for this method as well

amaltaro · 2024-12-13T12:01:10Z

src/python/WMCore/WMSpec/WMWorkload.py

@@ -1971,6 +2054,26 @@ def validateArgumentForAssignment(self, schema):
        validateArgumentsUpdate(schema, argumentDefinition)
        return

+    def validateArgumentsPartialUpdate(self, schema):


This is probably a good idea to write tests too as well.

dmwm-bot · 2024-12-13T14:22:27Z

Jenkins results:

Python3 Unit tests: succeeded
- 4 changes in unstable tests
Python3 Pylint check: failed
- 13 warnings and errors that must be fixed
- 17 warnings
- 304 comments to review
Pycodestyle check: succeeded
- 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/205/artifact/artifacts/PullRequestReport.html

dmwm-bot · 2024-12-17T16:18:28Z

Jenkins results:

Python3 Unit tests: succeeded
- 1 tests no longer failing
- 1 tests added
- 2 changes in unstable tests
Python3 Pylint check: failed
- 28 warnings and errors that must be fixed
- 18 warnings
- 452 comments to review
Pycodestyle check: succeeded
- 59 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/224/artifact/artifacts/PullRequestReport.html

todor-ivanov requested a review from amaltaro September 17, 2024 14:54

todor-ivanov mentioned this pull request Sep 18, 2024

Remake input data placement upon site list changes #12040

Closed

amaltaro requested changes Sep 19, 2024

View reviewed changes

todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from 3ea670b to 099da8b Compare September 26, 2024 11:04

todor-ivanov requested a review from amaltaro September 26, 2024 13:25

todor-ivanov requested a review from anpicci September 26, 2024 15:15

amaltaro requested changes Sep 27, 2024

View reviewed changes

anpicci approved these changes Oct 14, 2024

View reviewed changes

todor-ivanov mentioned this pull request Oct 21, 2024

Separate state transition validation from nonState transition valdations #12120

Merged

todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from 1f78a64 to b05def4 Compare November 19, 2024 15:03

todor-ivanov requested review from anpicci, vkuznet and amaltaro December 7, 2024 09:54

todor-ivanov added 2 commits December 9, 2024 11:46

Call reqMgr api only once && Preserve RequestStatus in the workload.

45de5af

Add arguments validation for partial workflow parameters update

9c0a81d

todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from a44e47c to 9c0a81d Compare December 9, 2024 10:47

Use imported ALLOWED_STAT_KEYS during validation

4478201

todor-ivanov commented Dec 9, 2024

View reviewed changes

vkuznet approved these changes Dec 9, 2024

View reviewed changes

Validate full set of arguments instead of only reqargsDiff

666b5f3

Skip validating update stat arguments

fcc7d6c

todor-ivanov added the PR: squashing needed label Dec 12, 2024

Update workload args only once

93d0c3b

Fix bad updateElementsByWorkflow argument

todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from c83b52f to 93d0c3b Compare December 12, 2024 18:27

Be explicit in the check for stat update arguments during validation

8928f8a

amaltaro requested changes Dec 13, 2024

View reviewed changes

Review Comments

7fddd87

Unit tests

b94267a

	ALLOWED_ACTIONS_FOR_STATUS = {
	"new": ["RequestPriority"],
	"assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist",
	"AcquisitionEra", "ProcessingString", "ProcessingVersion",
	"Dashboard", "MergedLFNBase", "TrustSitelists",
	"UnmergedLFNBase", "MinMergeSize", "MaxMergeSize",
	"MaxMergeEvents", "BlockCloseMaxWaitTime",
	"BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize",
	"SoftTimeout", "GracePeriod",
	"TrustPUSitelists", "CustodialSites",
	"NonCustodialSites", "Override",
	"SubscriptionPriority"],
	"assigned": ["RequestPriority"],
	"staging": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
	"staged": ["RequestPriority"],
	"acquired": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
	"running-open": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
	"running-closed": ["RequestPriority"],
	"failed": [],
	"force-complete": [],
	"completed": [],
	"closed-out": [],
	"announced": [],
	"aborted": [],
	"aborted-completed": [],
	"rejected": [],
	"normal-archived": [],
	"aborted-archived": [],
	"rejected-archived": [],
	}

	self.workqueue = WorkQueue(self.config.WorkQueueManager.couchurl,
	self.config.WorkQueueManager.dbname)

	def getAvailableWorkflows(self):
	"""Get the workflows that have all their elements
	available in the workqueue"""
	data = self.db.loadView('WorkQueue', 'elementsDetailByWorkflowAndStatus',
	{'reduce': False, 'stale': 'update_after'})
	availableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if
	x['key'][1] == 'Available')
	notAvailableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if
	x['key'][1] != 'Available')
	return availableSet - notAvailableSet

	if not reqArgs:
	cherrypy.log(f"Nothing to be changed at this stage for {workload.name()}")
	return 'OK'

		@@ -1176,6 +1240,25 @@ def getDbsUrl(self):

		return getattr(self.data.request.schema, "DbsUrl")

		def setStatus(self, status):

Add setter methods map to WMWorkload and call it for all reqArg parameters #12099

Are you sure you want to change the base?

Add setter methods map to WMWorkload and call it for all reqArg parameters #12099

Conversation

todor-ivanov commented Sep 17, 2024 • edited Loading

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Sep 17, 2024

cmsdmwmbot commented Sep 17, 2024

todor-ivanov commented Sep 17, 2024

cmsdmwmbot commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

todor-ivanov Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

todor-ivanov Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 26, 2024

cmsdmwmbot commented Sep 26, 2024

todor-ivanov commented Sep 26, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anpicci Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Regarding the possible concerns:

Choose a reason for hiding this comment

todor-ivanov Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsdmwmbot commented Sep 27, 2024

cmsdmwmbot commented Sep 30, 2024

anpicci commented Oct 14, 2024

anpicci commented Oct 17, 2024 • edited Loading

dmwm-bot commented Nov 19, 2024

dmwm-bot commented Nov 19, 2024

amaltaro commented Dec 3, 2024

dmwm-bot commented Dec 7, 2024

dmwm-bot commented Dec 7, 2024

todor-ivanov commented Dec 7, 2024 • edited Loading

dmwm-bot commented Dec 9, 2024

dmwm-bot commented Dec 9, 2024

todor-ivanov Dec 9, 2024 • edited Loading

Choose a reason for hiding this comment

vkuznet left a comment

Choose a reason for hiding this comment

dmwm-bot commented Dec 10, 2024

todor-ivanov commented Dec 10, 2024 • edited Loading

anpicci commented Dec 10, 2024

todor-ivanov commented Dec 10, 2024

todor-ivanov commented Dec 11, 2024 • edited Loading

dmwm-bot commented Dec 11, 2024

todor-ivanov commented Dec 12, 2024

dmwm-bot commented Dec 12, 2024

todor-ivanov commented Dec 12, 2024

dmwm-bot commented Dec 12, 2024

dmwm-bot commented Dec 13, 2024

amaltaro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

todor-ivanov commented Sep 17, 2024 •

edited

Loading

todor-ivanov Sep 26, 2024 •

edited

Loading

todor-ivanov Sep 26, 2024 •

edited

Loading

anpicci Oct 14, 2024 •

edited

Loading

todor-ivanov Dec 9, 2024 •

edited

Loading

anpicci commented Oct 17, 2024 •

edited

Loading

todor-ivanov commented Dec 7, 2024 •

edited

Loading

todor-ivanov Dec 9, 2024 •

edited

Loading

todor-ivanov commented Dec 10, 2024 •

edited

Loading

todor-ivanov commented Dec 11, 2024 •

edited

Loading

todor-ivanov Dec 13, 2024 •

edited

Loading