-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change how CUDA runtime and capabilities are defined in the task and Condor #11689
base: master
Are you sure you want to change the base?
Conversation
Jenkins results:
|
Jenkins results:
|
Fix reference variable in WMTask
fix WMTask unit tests
Jenkins results:
|
In order to perform full testing of these changes, we need to have changes at the SI level (see initial description). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amaltaro
Things look good to me, but I've left few comments inline. You may wish to take a look.
src/python/Utils/Utilities.py
Outdated
""" | ||
if not isinstance(versionList, list): | ||
return versionList | ||
versionList.sort(key=StrictVersion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the good documentation provided with this function, which I admit gives good info, why do we need it. The list.sort
method works in place, so executing this same line outside of the function would do just fine. We can still make a NOTE
comment providing the information from the docstring though.
defaultRes = 0 | ||
# get an ordered list of the versions and use the very first element | ||
capabilities = orderVersionList(capabilities) | ||
if not capabilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If capabilities is not a list
or None
(which is the default for the method), but rather could be a dictionary
or string
... , because the function orderVersionList
returns its argument as it is if it is not a list, then this protection here here won't stop the execution further, and the next line will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the GPU parameters are validated at workflow creation (StdBase), so there shouldn't be a type different than list or None at this level
for comparison/job matchmaking purposes. | ||
Version conversion formula is: (1000 * major + 10 * medium + minor) | ||
:param capabilities: a list of string versions | ||
:return: an integer with the version value; 0 in case of failure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the order here ... is 0
with least significance or vise versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect the job matchmaking to be available_version >= required_version
, so 0 here would mean that any cuda capability is supported by the job. I am not sure this answers your question, otherwise I might have not understood your question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially it does.
What bothers me here, is that default value is again 0
and it overlaps with the value returned in case of an error/failure. Which would be a valid case if 0 was never to be reached in the matchmaking
condition. But since it is included, then it would trigger similar behavior in failure and by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can return from a version to integer conversion or from an error. However, this code will only be executed if GPUs are required, and if they are, then there are validated GPU parameters and the version is enforced to be something like r"^\d+\.\d+(\.\d+)?$"
. IF a version provided is 0.0, then yes, we get a "valid" 0 single version, which we could get if an empty list had been provided. But then again, everything is validated at the spec level.
I could change it to None and then mark the classad as undefined. But then it will be a job requiring GPU but asking for an undefined CUDA capability. I am not sure whether that could trigger other errors and unwanted behavior or not. Please let me know if you have any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alan, I do not want to say it... but usually there are many if
s in the justification process of an unhealthy code behavior. Overlapping of default
and failure
return values is not a good practice in general. Let alone type modification. I know we are supposed to be protected on another level, and we should not enter this bit of code under bad conditions, but lets imagine something changes in the future and we change our policy regarding GPU job execution as well.... Then we may enter this code under unexpected conditions, or because of a bug in the request validation code.... anything may happen.
So I'd bet for returning a None
value in case of a failure and let the upstream code deal with it. But that's up to you. If you think we are 100% safe and we are about to call this function under the best condition only at any possible moment in the future, we can leave it as it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have a strong preference here. I can make it return None by default, have a check in SingleCondorPlugin for None and map that to undefined
classad. Sounds good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, yes
thank you @amaltaro
minimalCapability = self.cudaCapabilityToSingleVersion(job['gpuRequirements']['CUDACapabilities']) | ||
ad['My.CUDACapability'] = classad.quote(str(minimalCapability)) | ||
ad['My.OriginalCUDACapability'] = classad.quote(str(cudaCapabilities)) | ||
cudaRuntime = ','.join(sorted(job['gpuRequirements']['CUDARuntime'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to sot the cudaRuntime
variable as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not sure whether this is needed or not. I was going to say to keep condor classad values consistent among different jobs of the same workflow/task, but it could be that this is already consistent/ordered in the original list. I'd keep it around in case the original list can come out with scrambled versions.
Empty list should return None as well remove empty line
Jenkins results:
|
Jenkins results:
|
@todor-ivanov the last 2 commits provide changes based on your review. Please have another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
things look good
thanks @amaltaro
@mapellidario @belforte Hi Dario, Stefano, these are the latest changes that we are trying to commission and deploy in production for GPU jobs. I just updated the initial description, but please let me know if you have any questions. Note that this is not yet in production and we need to discuss/plan the required changes at the SI layer. |
thanks Alan, @novicecpp will be back on Aug 24 and will be able to look at integrating this changes in CRAB as well. It will be nice to have some example to test, and of course what we dearly miss is users ! |
test this please |
Jenkins results:
|
Can one of the admins verify this patch? |
Fixes #11595
Status
In development
Description
The following is provided in this PR:
GPUMemoryMB
: use the largest values among the steps within a taskCUDARuntime
: changed from simple string to a list of strings with an union of the CUDA runtime versionsCUDACapabilities
: still a list of strings, but now with an union of the CUDA capabilities versionscudaCapabilityToSingleVersion
)(1000 * major + 10 * medium + minor)
, where1.2.3
would be major=1, medium=2, minor=3. See [1] for further context.CUDACapability
classad to a single integer (actually string represented). For instance, it changes from"1.2,3.2,1.4"
to"1020"
. Matchmaking then can be done with something like:Node CUDACapability >= Job CUDACapability
, after converting the capability version to a single integer to be compatible with these changes.CUDARuntime
classad to be a comma separated list of CUDA runtimes. For instance, it changes from"1.2"
to"1.2,3.2"
. Matchmaking then can be done with something like:stringListSubsetMatch(job_list_cudaruntime, node_list_cudaruntime)
. -->> NOTE that this needs an HTCondor upgrade to >= 10.0.6OriginalCUDACapability
with the actual comma separated list of CUDA capabilitiesIs it backward compatible (if not, which system it affects?)
NO (in the sense that job matchmaking will have to be updated)
Related PRs
Complement to #11588 such that hybrid GPU workflows can be supported.
External dependencies / deployment changes
Submission Infrastructure needs to update HTCondor to >= 10.0.6 and GlideinWMS needs to update the job matchmaking expression for CUDARuntime and CUDACapabilities.
SI ticket: https://its.cern.ch/jira/browse/CMSSI-79
[1]
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html