Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setter methods map to WMWorkload and call it for all reqArg parameters #12099

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

todor-ivanov
Copy link
Contributor

@todor-ivanov todor-ivanov commented Sep 17, 2024

Fixes #12038

Status

Ready

Description

With the current PR we add the new functionality to call all needed WMWorkload setter methods based on a full reqArgs dictionary passed from the upper level caller and try to set all of parameters at once, rather than calling every single setter method one by one. This is achieved by creating a proper map between methods and possible request arguments. And later validating if the set of arguments passed with reqArgs dictionary would properly match the signature of the setter method which is to be called. In the current implementation only a small set of methods is mapped to request parameters :

  • setSiteWhiteList
  • setSiteBlacklist
  • setPriority
    (Mostly single argument parametrized methods, but that's ok for the time being, because those cover perfectly the functionality we need to plug in here)

With this change a path for updating all possible workqueue elements in a single go was opened. In the WorkQueue service an additional method was developed for fetching all possible workqueue elements and update them all with the full set of arguments provided through the reqArgs dictionary with the cost of a single database action per WorkQuee element, rather than 3 (or more) separate calls to the database for updating every single element parameter separately. Upon updating all workqueue elements with the new parameters the WMSpec copy of the given workflow at the workqueue is also updated in a single push using the same logic from above.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

None

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 9 warnings
    • 195 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15218/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
  • Python3 Pylint check: failed
    • 11 warnings and errors that must be fixed
    • 9 warnings
    • 212 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15219/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

@amaltaro Please take a look at this PR, I think this one fully covers all the requirements and addresses our fears for affecting scalability due to increased database calls.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 9 warnings
    • 211 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15220/artifact/artifacts/PullRequestReport.html


# Commit the changes of the current workload object to the database:
workload.saveCouchUrl(workload.specUrl())

# Commit all Global WorkQueue changes per workflow in a single go:
self.gq_service.updateElementsByWorkflow(workload.name(), reqArgs, status=['Available'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this call, you will actually change the behavior of priority to get only applied to Available elements.

In addition, do you ensure now that reqArgs will always be only the supported parameters?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this call, you will actually change the behavior of priority to get only applied to Available elements.

Well, does it even matter? As we discussed with you few days ago - do we actually know if we continue to update any Worqueue Element's parameters at the agent once it is acquired from the GlobalWorkQueue ? If we do, then it matters also for Sitelists as well. So IIRC, during our discussion we decided to stay safe and update only available WQE and not to risk getting into some race conditions related to input data location etc. If we do not care I can remove the requirement for status=['Availiable']. But if we care to propagate the Priority information down to the agent for every WQE status - Aqcuired et. all., but we want to update site information only at Global WQ, then we simply cannot have a mechanism to do it in a single push. The two calls should be separated. So we cannot have them both:

  • a single DB action for updating WQE
  • different WQE status requirement based on request parameters
    (an escape of this is to hardcode this in the Workqueue module logic of updateelementsByWorkflow, but you know my opinion on hardcoding requirements in the code.

In addition, do you ensure now that reqArgs will always be only the supported parameters?

As long as we follow the same logic as we did while validating the allowed parameters in the first set of actions in these calls (meaning to call validate_request_update_args before calling the put method and hence _handleNoStatusUpdate) then we are keeping the same logic as before - hence following the map of allowed actions from

ALLOWED_ACTIONS_FOR_STATUS = {
"new": ["RequestPriority"],
"assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist",
"AcquisitionEra", "ProcessingString", "ProcessingVersion",
"Dashboard", "MergedLFNBase", "TrustSitelists",
"UnmergedLFNBase", "MinMergeSize", "MaxMergeSize",
"MaxMergeEvents", "BlockCloseMaxWaitTime",
"BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize",
"SoftTimeout", "GracePeriod",
"TrustPUSitelists", "CustodialSites",
"NonCustodialSites", "Override",
"SubscriptionPriority"],
"assigned": ["RequestPriority"],
"staging": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"staged": ["RequestPriority"],
"acquired": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"running-open": ["RequestPriority", "SiteWhitelist", "SiteBlacklist"],
"running-closed": ["RequestPriority"],
"failed": [],
"force-complete": [],
"completed": [],
"closed-out": [],
"announced": [],
"aborted": [],
"aborted-completed": [],
"rejected": [],
"normal-archived": [],
"aborted-archived": [],
"rejected-archived": [],
}
. And for this logic we needed mostly those 3 setters methods which I've already put in the reqArgs to setterMethod map at updateWorkloadArgs :

self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority))
self.settersMap['SiteBlacklist'] = setterTuple('SiteBlacklist', self.setSiteBlacklist, inspect.signature(self.setSiteBlacklist))
self.settersMap['SiteWhitelist'] = setterTuple('SiteWhitelist', self.setSiteWhitelist, inspect.signature(self.setSiteWhitelist))

If we want to be extra detailed and fully exhaustive on any possible reqarg per status setters, we may also include into this map the eventual methods for all allowed actions per assignment-approved status.:

    "assignment-approved": ["RequestPriority", "Team", "SiteWhitelist", "SiteBlacklist",
                            "AcquisitionEra", "ProcessingString", "ProcessingVersion",
                            "Dashboard", "MergedLFNBase", "TrustSitelists",
                            "UnmergedLFNBase", "MinMergeSize", "MaxMergeSize",
                            "MaxMergeEvents", "BlockCloseMaxWaitTime",
                            "BlockCloseMaxFiles", "BlockCloseMaxEvents", "BlockCloseMaxSize",
                            "SoftTimeout", "GracePeriod",
                            "TrustPUSitelists", "CustodialSites",
                            "NonCustodialSites", "Override",
                            "SubscriptionPriority"],

But for some of them the logic may need to change because some of those setters are parametrized by more than a single argument. While at the same time we already know that for assignment-approved we would never call _handleNoStatusUpdate ... So that's why, to me it seems safe to proceed only with those 3 setters mapped. Of course, if you ask me - I am completely up for moving the whole logic to be implemented here in a more generic way .... for all status updates, then get rid of a big chunk of code covering custom cases ... and only make the proper calls to this generic method here from upstream modules (e.g. Request in the current case) But I do not think we will have the time during this line of development here to do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to distribute this update, but before answering how we can proceed, can you please check the JobUpdater code to check:

  1. in which databases are the WQEs updated?
  2. does it use any status filter? If so, which statuses?

Copy link
Contributor Author

@todor-ivanov todor-ivanov Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @amaltaro,
ok, Here it is:

which in this case is (taken from one agent configuration):

config.WorkQueueManager.dbname = 'workqueue'

So the inbox db is indeed configured at the WorkqueueManager component at the agent, but as long as I do not see any actual actions during this update from the updatepriority method, I do not think we even touch this DB while updating the WQE priority at the agents.

  • I found No filters being applied for any of the WQE, but the workflows considerred are the ones returned by:
    workflowsToCheck = self.workqueue.getAvailableWorkflows()

    and:
    def getAvailableWorkflows(self):
    """Get the workflows that have all their elements
    available in the workqueue"""
    data = self.db.loadView('WorkQueue', 'elementsDetailByWorkflowAndStatus',
    {'reduce': False, 'stale': 'update_after'})
    availableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if
    x['key'][1] == 'Available')
    notAvailableSet = set((x['value']['RequestName'], x['value']['Priority']) for x in data.get('rows', []) if
    x['key'][1] != 'Available')
    return availableSet - notAvailableSet

# call the proper setter methods bellow.

# populate the current instance settersMap
self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you are already checking the parameter name and associating it to the actual function (plus inspecting its signature), isn't it more clear to simply call the actual method according to the parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is exactly what happens here. Well few line bellow though but still the same:

                self.settersMap[reqArg].setterFunc(argValue)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I understand it. What I am saying is that creating a map and passing a pointer to a function does not look as friendly as I classic if-elif dealing with the 3 parameters.

Copy link
Contributor Author

@todor-ivanov todor-ivanov Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a map gives the option of following the same pattern for adding as many setter calls (workflow parameter setter) as one may wish, by providing a generic method of validating and assembling the calls and not changing the logic for the rest of the function (i.e. not forking with hundreds of elifs, and getting into the trap of code growth by number of custom usecases). Here, the addition of a new setter is just a matter of adding yet another line in the map, following the same pattern as the rest in the map i.e.:

a named tuple of: ('reqArg', 'setterFunc', 'setterSignature') 

So pretty universal way I'd say. And perfectly serves the exact purpose of what is needed here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the rationale for implementing something generic, but right now I am not aware of any other spec parameters that will have to be changed, so it's definitely not hundreds of future elifs.

In addition, I am very much adept of the KISS principle, whenever possible and fit. To me, this overly complex code (personally I don't know any of those inspect methods) can easily lead to a bug, or a buggy development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaltaro , @anpicci , @todor-ivanov

Given that Python (or any programming language) offers multiple ways to achieve the same outcome, evaluating the best approach can often be subjective, depending on factors such as experience, proficiency, or personal style. Consequently, these discussions can become matters of personal preference.

For example, choosing between a full for-loop or list comprehension is a matter of choice. While we can debate which one might be "better" based on various criteria, both serve the same functional purpose. In this instance, I feel the this particular conversation is similar, and rather than debating stylistic choices, the focus should be on whether the code achieves its intended purpose. Otherwise, we risk undermining the role of both developer and reviewer by favoring one approach over another, which I find counterproductive.

If there are concerns about correctness or functionality, I would recommend incorporating unit tests for validation instead of continuing to focus on the specific implementation style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not coding style and personal preferences, but I see it as code readability and sustainability.

Anyhow, thank you for sharing your perspective on this matter. I won't bug people anymore on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan,

This:

        try:
            if 'RequestPriority' in reqArgs:
                self.setPriority(reqArgs['RequestPriority'])
            if 'SiteBlacklist' in reqArgs:
                self.setSiteBlacklist(reqArgs['SiteBlacklist'])
            if 'SiteWhitelist' in reqArgs:
                self.setSiteWhitelist(reqArgs['SiteWhitelist'])
        except Exception as ex:
            msg = f"Failure to update workload parameters. Details: {str(ex)}"
            raise WMWorkloadException(msg) from None

is simply not equal to this:

        for reqArg, argValue in reqArgs.items():
            if not self.settersMap.get(reqArg, None):
                msg = f"Unsupported or missing setter method for updating reqArg: {reqArg}."
                raise WMWorkloadException(msg)
            try:
                self.settersMap[reqArg].setterSignature.bind(argValue)
            except TypeError as ex:
                msg = f"Setter's method signature does not match the method calls we currently support: Error: req{str(ex)}"
                raise WMWorkloadException(msg) from None

        # Now go through the reqArg again and call every setter method according to the map
        for reqArg, argValue in reqArgs.items():
            try:
                self.settersMap[reqArg].setterFunc(argValue)
            except Exception as ex:
                currFrame = inspect.currentframe()
                argsInfo = inspect.getargvalues(currFrame)
                argVals = {arg: argsInfo.locals.get(arg) for arg in argsInfo.args}
                msg = f"Failure while calling setter method {self.settersMap[reqArg].setterFunc.__name__} "
                msg += f"With arguments: {argVals}"
                msg += f"Full exception string: {str(ex)}"
                raise WMWorkloadException(msg) from None

At the end, in those both implementations, if the call to the setter method is wrong they will both fail the call, but the later is much more descriptive. It recognizes (at least 3) different possible conditions of failure and forwards the proper message for later debugging, while the former just fails the call and masks all the relevant information for the one who is to chase eventual problems. And that was the exact reason why I went for this implementation. I know you'd always prefer the shorter version, but sometimes it hides/masks valuable information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something more, since this is supposed to be a method triggered by a user's call rather than an operation triggered by the internals of our system, the later method inspects the stack and gives you the values of all parameters in the call. Meaning it shows you the exact (eventually) user mistake.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking about inspection and validation. Validation of the user parameters is not supposed to happen at this layer (at the setter call). Validation should be performed in the Service/Request class (which is likely calling the Validation utils).

With that said, I just noticed that we are not validating the site lists, as we usually do for workflow assignment.
So we should add that validation in here as well to protect the system and us from unneded debugging. For reference, this is how we define the site lists and the validation function:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1147

@todor-ivanov todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from 3ea670b to 099da8b Compare September 26, 2024 11:04
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 9 warnings
    • 211 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15240/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 9 warnings
    • 211 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15241/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

hi @amaltaro I have addressed your comments - Mostly, I removed the WQE status filter, based on the investigation I did on the JobUpdater component. You may take another look.

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Todor, please find more comments along the code.

Once we are about to converge on this development, I think it is important to test it in a Dev environment (both for central services and WMAgent).

# call the proper setter methods bellow.

# populate the current instance settersMap
self.settersMap['RequestPriority'] = setterTuple('RequestPriority', self.setPriority, inspect.signature(self.setPriority))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the rationale for implementing something generic, but right now I am not aware of any other spec parameters that will have to be changed, so it's definitely not hundreds of future elifs.

In addition, I am very much adept of the KISS principle, whenever possible and fit. To me, this overly complex code (personally I don't know any of those inspect methods) can easily lead to a bug, or a buggy development.


# Update all WorkQueue elements with the parameters provided in a single push
if elementsToUpdate:
self.updateElements(*elementsToUpdate, **updateParams)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've seen you answering my question about these updateParams.
Are we completely 100% sure that the only parameters that will reach this update are one and/or a combination of:

  • RequestPriority
  • SiteWhitelist
  • SiteBlacklist
    ?
    Of course, also considering that we only want to update these values when there is an actual update to the value.

In addition, I would like you to test this in WMAgent with a crafted workflow/scenario, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I've seen you answering my question about these updateParams

Yes the answer is the second one in line, here from this comment: #12099 (comment)

In addition, I would like you to test this in WMAgent with a crafted workflow/scenario, please.

I had it tested. With a full validation campaign, not a single workflow, in my dev cluster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find an exact answer to the question above.
Additionally, the code you are deleting had an else statement populating a list of not handled arguments:
https://github.com/dmwm/WMCore/pull/12099/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794L440
which makes me believe that it can receive parameters other than only those 3.

About the validation, to properly validate these changes, we need to actually trigger specific scenarios and actions. The standard "campaign-style" validation is not going to expose any issues with this code.

Copy link
Contributor

@anpicci anpicci Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov @amaltaro I don't know if you have already converged on a common ground, but here is my opinion about map vs if-elif-else approach.

Considering that I am not as experienced as both of you with the WMCore software, to me the map approach looks clearer and more understandable than the if-elif-else approach. Indeed, I can read which are the parameters of interest at the very beginning, and from these lines I can see that only one parameter among RequestPriority, SiteWhitelist, SiteBlacklist can be modified. The all process doesn't require more intellectual effort than the previous if-elif-else approach, and in addition to me sounds more robust and easier to debug (where with "debug" I am also referring to "actually fix a potential issue").

Regarding the possible concerns:

I believe I've seen you answering my question about these updateParams. Are we completely 100% sure that the only parameters that will reach this update are one and/or a combination of:
* RequestPriority
* SiteWhitelist
* SiteBlacklist
?
Of course, also considering that we only want to update these values when there is an actual update to the value.

Given the original code, we can assume that this is a complete list of parameters that are supposed to reach this update, provided that what we want to get with this PR doesn't require other parameters to be properly updated

Additionally, the code you are deleting had an else statement populating a list of not handled arguments:
https://github.com/dmwm/WMCore/pull/12099/files#diff-120ee6838284a3d1c1799f511da7f147179d0a955f87d0da6fc8b58a8b66c794L440
which makes me believe that it can receive parameters other than only those 3.

This occurrence should be handled by the try-except in these lines, right @todor-ivanov? @amaltaro do you have any concern that this approach could fail to prevent updating parameters other than RequestPriority, SiteWhitelist, and SiteBlacklist?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anpicci I think you meant to send this reply in this thread instead https://github.com/dmwm/WMCore/pull/12099/files#r1766081573 ?

Copy link
Contributor Author

@todor-ivanov todor-ivanov Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer again the original question here:

With the latest changes in this commit: dfbda2a it is now 100% sure that if anything but a supported argument update reaches that point, we will raise the proper exception and it will be handled accordingly in the caller method

'reduce': False})

# Fetch only a list of WorkQueue element Ids && Filter them by allowed status
if status:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we are not breaking the current behavior of RequestPriority update.

However, thinking twice about this, if we want to make these updates more efficient, we could update it only for elements in Available, Negotiating and Acquired, from the list of potential statuses for WQE:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/DataStructs/WorkQueueElement.py#L14

If you change it, could you please also update the Issue description (because it says only Available).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
  • Python3 Pylint check: failed
    • 7 warnings and errors that must be fixed
    • 9 warnings
    • 211 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15249/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

@anpicci
Copy link
Contributor

anpicci commented Oct 14, 2024

@todor-ivanov apart from my comment to @amaltaro 's review, I only suggest to resolve the conflicts on src/python/WMCore/ReqMgr/Service/Request.py, if you have not considered yet to address that

@anpicci
Copy link
Contributor

anpicci commented Oct 17, 2024

@vkuznet thank you for the feedback.
@todor-ivanov @amaltaro , I propose to proceed as follows:

  • let's stick with the original Todor's proposal to use maps;
  • @amaltaro will focus his review on checking that such implementation is valid and it is consistent with the rest of the system;
  • at the same time, @todor-ivanov provides documentation both for the external packages introduced in the PR, and in terms of docstring and in-line comments, when necessary according to both developer and reviewer point of view, such that maintenance of the code is enhanced;
  • once Jenkins tests are passed and the PR merged, @todor-ivanov is responsible for fixing issues arising during future release validations that are ascribible to this PR, or, at least, investigate and provide guidances for resolving such issues.

I want to clarify that I would like to adopt these accountability principles to every issue, meaning that this is not a special treatment outlined only for this PR. I will follow up on this during the next group meeting.

Thanks!

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 8 warnings and errors that must be fixed
    • 16 warnings
    • 260 comments to review
  • Pycodestyle check: succeeded
    • 27 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/87/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 9 warnings and errors that must be fixed
    • 16 warnings
    • 260 comments to review
  • Pycodestyle check: succeeded
    • 27 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/88/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor

amaltaro commented Dec 3, 2024

It looks like I forgot to update this PR with some previous reviews that have been done here and there, so here we go.

There are some tasks pending from the other related PR, which have been described/discussed in this thread:
#12120 (review)

@dmwm-bot
Copy link

dmwm-bot commented Dec 7, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 16 warnings
    • 255 comments to review
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/162/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 7, 2024

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 17 warnings
    • 307 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/163/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Dec 7, 2024

Hi @amaltaro @vkuznet @anpicci
This Review call is mostly for @amaltaro and also to update the others on the latest development triggered by his review comment here: #12099 (comment) linking to this one: #12120 (review) , which we agreed to address in the current issue, rather than prolonging the transient development in the previous one.

So here is a quick summary:

  • With this commit: Restore unhandled arguments mechanism
    I am restoring the previous behavior of stopping any workflow changes in the case of not supported parameters update calls, but still signaling which are those which we do not handle.
    NOTE: any non-status update parameters for changes of workflows in assignment_approved status are still treated separately

  • With this commit: Call reqMgr api only once && Preserve RequestStatus in the workload.
    I am addressing @amaltaro's request to reduce the calls to Reqmgr APIs to only one and only on demand and as early as possible - during the validation step only. Upon which the workflow status is preserved in the .request section of the workload object and accessible through the relevant methods through out the whole rest of the process

  • With this commit: Add arguments validation for partial workflow parameters update
    I am addressing the request to validate all the values for those partial workflow parameters Updates

@todor-ivanov todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from a44e47c to 9c0a81d Compare December 9, 2024 10:47
@dmwm-bot
Copy link

dmwm-bot commented Dec 9, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 307 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/168/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

dmwm-bot commented Dec 9, 2024

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 306 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/171/artifact/artifacts/PullRequestReport.html


# Commit the changes of the current workload object to the database:
workload.saveCouchUrl(workload.specUrl())

# Commit all Global WorkQueue changes per workflow in a single go:
self.gq_service.updateElementsByWorkflow(workload.name(), reqArgs, status=['Available', 'Negotiating', 'Acquired'])
Copy link
Contributor Author

@todor-ivanov todor-ivanov Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @amaltaro the line that I was talking about during the WMCore meeting is this one. It triggers a second call to workload.updateWorkloadArgs internally, similarly to what is done on line: 431 at WMcore.ReqMgr.Service.Request from the current PR.

The internal call is happening at line 316 at WMCore.Services.WorkQueue again from the very same PR.

You have asked me to implement all the changes to both reqMgr (in this case through the workload object) and all WQEs in a single push. It is now possible. All that needs to happen, is to substitute the first call from above with the second call from the current line. I have the code change prepared. Just let me know if I should proceed with it or we should play it safe and we still do the action in two steps.

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve reviewed the code provided in this PR and, from a coding perspective, I don’t have any specific suggestions for improvement. I noticed the ongoing discussion between Alan and Todor on various related topics; however, I believe my input may not add significant value to that part of the conversation.

From the standpoint of pure code review, the implementation looks good and is ready to be merged. I don’t have a strong preference between Todor's map/namedtuple/setter approach or Alan's if/else flow, as I view this as a matter of personal or team preference.

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 306 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/179/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Dec 10, 2024

hi @amaltaro
I just figured out a misbehavior, when we validate only the changed arguments. This allows it for some unsupported changes to sneak in. e.g. Changing only the SiteBlacklist and leaving SiteWhitelist untouched would allow a workflow change to propagate such collision. I created an extra fix for that: 666b5f3 And it works: [1], but unfortunately it affects also stat arguments [2]. I need to think a way out of this.

[1]

[10/Dec/2024:11:19:34]  Updating request "tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169" with these user-provided args: {'RequestPriority': 200000, 'SiteWhitelist': ['T1_US_FNAL', 'T2_CH_CERN'], 'SiteBlacklist': 'T2_CH_CERN'}
[10/Dec/2024:11:19:35]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 190, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 102, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 114, in validate_request_update_args
    workload.validateArgumentsPartialUpdate(request_args)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 2048, in validateArgumentsPartialUpdate
    validateArgumentsUpdate(schema, argumentDefinition, optionKey=None)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 294, in validateArgumentsUpdate
    validateSiteLists(arguments)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 255, in validateSiteLists
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN']
	ClassName : None
	ModuleName : WMCore.WMSpec.WMWorkloadTools
	MethodName : validateSiteLists
	ClassInstance : None
	FileName : /usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py
	LineNumber : 255
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

[10/Dec/2024:11:19:35]  SERVER REST ERROR WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue 1d6e1dca69646015783171f13d9b11eb (Invalid spec parameter value: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN'])
[10/Dec/2024:11:19:35]    Traceback (most recent call last):
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 749, in default
[10/Dec/2024:11:19:35]        return self._call(RESTArgs(list(args), kwargs))
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/REST/Server.py", line 828, in _call
[10/Dec/2024:11:19:35]        v(apiobj, request.method, api, param, safe)
[10/Dec/2024:11:19:35]      File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 225, in validate
[10/Dec/2024:11:19:35]        raise InvalidSpecParameterValue(msg) from None
[10/Dec/2024:11:19:35]    WMCore.ReqMgr.DataStructs.RequestError.InvalidSpecParameterValue: InvalidSpecParameterValue 1d6e1dca69646015783171f13d9b11eb [HTTP 400, APP 1102, MSG "Invalid spec parameter value: Validation failed: The same site cannot be white and blacklisted: ['T2_CH_CERN']", INFO None, ERR None]

[2]

[10/Dec/2024:11:20:25]  Updating request "tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093438_2280" with these user-provided args: {'total_jobs': 0, 'input_events': 0, 'input_lumis': 0, 'input_num_files': 0, 'RequestName': 'tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093438_2280'}
[10/Dec/2024:11:20:25]  Error: Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 190, in validate
    self._validateRequestBase(param, safe, validate_request_update_args, requestName)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Service/Request.py", line 102, in _validateRequestBase
    workload, r_args = valFunc(args, self.config, self.reqmgr_db_service, param)
  File "/usr/local/lib/python3.8/site-packages/WMCore/ReqMgr/Utils/Validation.py", line 114, in validate_request_update_args
    workload.validateArgumentsPartialUpdate(request_args)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkload.py", line 2048, in validateArgumentsPartialUpdate
    validateArgumentsUpdate(schema, argumentDefinition, optionKey=None)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 292, in validateArgumentsUpdate
    validateUnknownArgs(arguments, argumentDefinition)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py", line 323, in validateUnknownArgs
    raise WMSpecFactoryException(msg)
WMCore.WMSpec.WMSpecErrors.WMSpecFactoryException: <@========== WMException Start ==========@>
Exception Class: WMSpecFactoryException
Message: There are unknown/unsupported arguments in your request spec: ['total_jobs', 'input_lumis', 'input_num_files', 'input_events']
	ClassName : None
	ModuleName : WMCore.WMSpec.WMWorkloadTools
	MethodName : validateUnknownArgs
	ClassInstance : None
	FileName : /usr/local/lib/python3.8/site-packages/WMCore/WMSpec/WMWorkloadTools.py
	LineNumber : 323
	ErrorNr : 0

Traceback: 

<@---------- WMException End ----------@>

@anpicci
Copy link
Contributor

anpicci commented Dec 10, 2024

@todor-ivanov may you ping us when you will provide the fix to the latest issue?

@todor-ivanov
Copy link
Contributor Author

hi @anpicci

@todor-ivanov may you ping us when you will provide the fix to the latest issue?

Yes, I will. Most probably after the training session today.

@todor-ivanov
Copy link
Contributor Author

todor-ivanov commented Dec 11, 2024

hi @anpicci @amaltaro here is the solution promised: Skip validating update stat arguments

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 306 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/195/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor Author

@amaltaro I do not know if you have started looking into this PR, but I need your opinion at least on: #12099 (comment)

So far the code works just as expected with the fix for stat update arguments. I've tested it. On top of it here is the state of one such WQE:

  • Before Sitelists update:
{"_id":"2c07fe74e23bcfb33a6fe27115db496b","_rev":"3-bc9c8a587f56eecaefe67dfa12cc9375","timestamp":1733826411.0451522,"updatetime":1733826766.7949412,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/JetHT/Run2012C-v1/RAW#e759da30-f693-11e1-847a-842b2b4671d8":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":51,"NumberOfFiles":2,"NumberOfEvents":6320,"Jobs":13,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","TaskName":"HLTD","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T1_US_FNAL","T2_CH_CERN"],"SiteBlacklist":[],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":200000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826410,"CreationTime":1733826411.0451522}}
  • Upon Sitelists updates:
{"_id":"2c07fe74e23bcfb33a6fe27115db496b","_rev":"4-c121956f972ed71d0435d05680ece8bb","timestamp":1733826411.0451522,"updatetime":1733826766.7949412,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/JetHT/Run2012C-v1/RAW#e759da30-f693-11e1-847a-842b2b4671d8":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":51,"NumberOfFiles":2,"NumberOfEvents":6320,"Jobs":13,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","TaskName":"HLTD","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T2_CH_CERN"],"SiteBlacklist":["T1_US_FNAL"],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":200000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_TaskChain_LumiMask_multiRun_SiteListsTest_v7_241210_093515_7169","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826410,"CreationTime":1733826411.0451522}}

Shortly speaking both ReqMgr and the WorkQueue elements have been updated after the arguments have been properly validated not only for format but also for values type and content. the validation process concerns only no status update actions.

I am still keeping all those commits unsquashed, only because I am waiting for your answer to the above question and to preserve the different steps of the solution to be visible during the review process. So if you think no further optimisations are needed I am ready to squash them and get it ready for merge. If you think I should push further for optimizing these calls to make them in one go I am ready do that as well.

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 304 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/198/artifact/artifacts/PullRequestReport.html

Fix bad updateElementsByWorkflow argument
@todor-ivanov todor-ivanov force-pushed the feature_Sitelists_UpdateWorkqueueElements_fix-12038 branch from c83b52f to 93d0c3b Compare December 12, 2024 18:27
@todor-ivanov
Copy link
Contributor Author

And to put it: #12099 (comment) in perspective, this is what I am talking about: Update workload args only once

And here are the WQE updated:

  • Before:
{"_id":"d70157fad5e8778da29c600630bf8521","_rev":"3-b9e5e4642aba0e18b0644a626821d2c9","timestamp":1733826378.292129,"updatetime":1733826385.1671724,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a9c70800-f936-4c6c-a3a8-c6fa4c9421d0":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":15,"NumberOfFiles":1,"NumberOfEvents":2839,"Jobs":1,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","TaskName":"DataProcessing","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T1_US_FNAL","T2_CH_CERN"],"SiteBlacklist":[],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":600000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826378,"CreationTime":1733826378.292129}}
  • After:
{"_id":"d70157fad5e8778da29c600630bf8521","_rev":"4-49f3b4a5c45a199783d2347b1a608d3a","timestamp":1733826378.292129,"updatetime":1733826385.1671724,"thunker_encoded_json":true,"type":"WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement","WMCore.WorkQueue.DataStructs.WorkQueueElement.WorkQueueElement":{"Inputs":{"/NoBPTX/Run2018D-12Nov2019_UL2018-v1/MINIAOD#a9c70800-f936-4c6c-a3a8-c6fa4c9421d0":["T2_CH_CERN","T2_CH_CERN_P5","T2_CH_CERN_HLT"]},"ParentFlag":false,"ParentData":{},"NumberOfLumis":15,"NumberOfFiles":1,"NumberOfEvents":2839,"Jobs":1,"OpenForNewData":false,"NoInputUpdate":false,"NoPileupUpdate":false,"Status":"Acquired","RequestName":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","TaskName":"DataProcessing","Dbs":"https://cmsweb-testbed.cern.ch/dbs/int/global/DBSReader","SiteWhitelist":["T2_CH_CERN"],"SiteBlacklist":["T1_US_FNAL"],"StartPolicy":"Block","EndPolicy":{"policyName":"SingleShot"},"Priority":600000,"PileupData":{},"ProcessedInputs":[],"RejectedInputs":[],"ParentQueueId":"tivanov_ReReco_LumiMask_SiteListsTest_v7_241210_093504_2777","SubscriptionId":null,"EventsWritten":0,"FilesProcessed":0,"PercentComplete":0,"PercentSuccess":0,"TeamName":"testbed-vocms0290","ACDC":{},"ChildQueueUrl":"http://vocms0290.cern.ch:5984","ParentQueueUrl":"https://cmsweb-test1.cern.ch/couchdb/workqueue","WMBSUrl":null,"NumOfFilesAdded":0,"Mask":null,"TimestampFoundNewData":1733826378,"CreationTime":1733826378.292129}}

So it works!

@amaltaro - your turn

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 304 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/199/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 304 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/201/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@todor-ivanov these changes are looking good in general. I left a few comments along the code that would be better to get them addressed though.

@@ -413,6 +414,8 @@ def _handleNoStatusUpdate(self, workload, request_args, dn):
request_args will be ignored.
"""
reqArgs = deepcopy(request_args)
reqStatus = workload.getStatus()
cherrypy.log(f"CurrentRequest status: {reqStatus}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this log record should be removed before this PR gets merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I just noticed, by moving it few lines above we are now missing the requstName in the information deliverred with that message - before it was coming from previous messages, ut now wince it sits infornt of all of them I'd rather evolve it to the normal introductory message signaling how we enter the _handleNoStatusUpdate method.

if reqArgsNothandled:
try:
# Commit all Global WorkQueue changes per workflow in a single go:
self.gq_service.updateElementsByWorkflow(workload, reqArgs, status=['Available', 'Negotiating', 'Acquired'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a deep inspection of the code, this looks incorrect. Why are we trying to execute updateElementsByWorkflow() even before some of the methods in the except block (e.g. _handleAssignmentStateTransition() and/or claiming that there might be "unhandled arguments")?

Call to updateElementsByWorkflow() should only be made if we are certain that WQEs need to be updated. As global workqueue can be a scale-limiting component of the system and we should use it wisely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WMWorkloadUnhandledException is telling you if there have been any unhandled arguments. It is thrown when we call workload.updateWorkloadArgs from within updateElementsByWorkflow - here: https://github.com/dmwm/WMCore/pull/12099/files#diff-d758570ce7f6baddaeefb0bdb4e2015b06e5b7772bf929248b686adc64286233R316

And I did it so because of your request during our discussion of the issue to have the workflow and the workqueue elements being updated in a single push.
It was not like that just until last night. And I've asked you at least twice ... trying to make all the needed pointers in my comments here: #12099 (comment) and here: #12099 (comment) before I proceed with unifying those two steps with this commit: 93d0c3b (hence why I still keep all the commits not squashed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I do not see how would we recognize if there have been any workqueue elements to be updated until we query the the workqueue and see if there are any. It have been sequential step and a mandatory one in this method even before I make it all in one go. We were still always going through it, as long as we were called.

Usually the parameter reqArgsDiff was the one that would have told us if there was anything to be done at this stage within this method, but I remember in the past you asked me to avoid calling the validation twice at the head of this method here. But regardless... , in all the cases of a call to update a request with no RequestStatus update we still always return the reqArgsDiff from here:

return workload, reqArgsDiff
instead of the originally passed request_args and then any actions should have been halted at the very top of this method here:
if not reqArgs:
cherrypy.log(f"Nothing to be changed at this stage for {workload.name()}")
return 'OK'

So if there was nothing to be updated in the workflow itself - ergo in its WQE elements of the given status, we should never reach that point in the code. But if there have been any change with the workflow arguments instead, we will have to proceed with updating the WQE elements as well.

On top of that we update only the ones in the selected statuses, rather than all of them. I think we are safe in what we are doing here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these changes are now being applied on top of your previous incomplete changes, it is a bit harder to see how things were implemented.

I don't really understand why we call _handleAssignmentStateTransition() from this _handleNoStatusUpdate() method, as it says that there is not supposed to be any status transition. Can you please explain that?

In addition, I believe that this call updateElementsByWorkflow() could become not super cheap (few secs?) if workqueue is loaded with documents. So, IMO we should only call it if really necessary. To put it in a different way, a request getting assigned (hence in assignment-approved) does not have workqueue elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why we call _handleAssignmentStateTransition() from this _handleNoStatusUpdate() method, as it says that there is not supposed to be any status transition. Can you please explain that?

Because, if you have called the _handleNoStatusUpdate for a workflow that is in assignment_approved then you will get with the request all those full set of parameters that you get during the assignment_approved state transition. Which will blow up, because we decided to still support a limited set of parameters for no status update - only SiteLists related and RequestPriority (which was yet again a wrong decision). This is the root cause of the BUG you reported in the first implementation an did not accept the code back then (even though I have provided this same solution on the very next day). So this was the solution - the way out of the situation was - in order to avoid rewriting all which is already there in _handleAssignmentStateTRansition and assimilating it in workload.updateWorkloadArgs and since the _handleAssignmentStateTRansition method has nothing to do with the state transition action itself.... it was safe enough to call it directly here. If you want, I will take this method and will rewrite it under the context of workload.updateWorkloadArgs. I do not mind. Actually I think this is the proper way to do it. And I've expressed that in the past.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what is the relationship of a workflow in assignment_approved with updateElementsByWorkflow()? None, the workflow has not been assigned yet. Calling workqueue is a waste of resources.

If we had an issue (maybe that was with ACDC), making a call to WorkQueue won't resolve that issue. This implementation is wrong. If you want to keep it, keep it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alan, make up your mind.....

you asked me to reduce the calls to worload and push it all in one go. I twisted my mind to implement it. Before I did it I asked for your feedback twice.... now you tell me this is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to have it all correct we must stop putting exrta background knowledge in the algorithms we follow ... and tie ourselves in a figure 8 knot of conditions.

Copy link
Contributor Author

@todor-ivanov todor-ivanov Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to avoid confusion and blame .... Please put the correct set of conditions here in simple English under which the work queue elements update must be called. And I will implement it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed over Zoom:

  • _handleAssignmentStateTransition is not performing any status transition;
  • and the exception WMWorkloadUnhandledException is not meant to be raised by updateElementsByWorkflow(), but by its internal call workload.updateWorkloadArgs(updateParams), if needed.

Thank you for explaining this Todor. I have no more concerns on this code.

@@ -284,6 +284,41 @@ def updatePriority(self, wf, priority):
wmspec.saveCouch(self.hostWithAuth, self.db.name, dummy_values)
return

def updateElementsByWorkflow(self, workload, updateParams, status=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two different methods for the same thing, I would suggest to converge on only one implementation for this.
Valentin already provided this site specific method: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/WorkQueue/WorkQueue.py#L240

So I would suggest to either refactor this one and change to make sure it works with WMAgent WorkflowComponent, or adapt this PR to use the method already provided by Valentin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We quickly exchanged a message or two with @vkuznet through Mattermost and he kindly agreed on my request to keep the current method from this PR here and him adapting his code. Thank you @vkuznet for this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Valentin, when you make that pull request, please refer it to the actual GH issue then, which I think it is #12039

@@ -68,6 +81,57 @@ class WMWorkloadHelper(PersistencyHelper):

def __init__(self, wmWorkload=None):
self.data = wmWorkload
self.settersMap = {}

def updateWorkloadArgs(self, reqArgs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need unit tests for this method.

@@ -1176,6 +1240,25 @@ def getDbsUrl(self):

return getattr(self.data.request.schema, "DbsUrl")

def setStatus(self, status):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need unit tests for this method as well

self.data.request.status = status
return

def getStatus(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need unit tests for this method as well

@@ -1971,6 +2054,26 @@ def validateArgumentForAssignment(self, schema):
validateArgumentsUpdate(schema, argumentDefinition)
return

def validateArgumentsPartialUpdate(self, schema):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good idea to write tests too as well.

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 changes in unstable tests
  • Python3 Pylint check: failed
    • 13 warnings and errors that must be fixed
    • 17 warnings
    • 304 comments to review
  • Pycodestyle check: succeeded
    • 25 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/205/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 28 warnings and errors that must be fixed
    • 18 warnings
    • 452 comments to review
  • Pycodestyle check: succeeded
    • 59 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/224/artifact/artifacts/PullRequestReport.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update workqueue elements upon workflow site list change
6 participants