-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change how CUDA runtime and capabilities are defined in the task and Condor #11689
base: master
Are you sure you want to change the base?
Changes from 4 commits
0235cd0
3643e42
06f5b59
3e269ae
ec3777c
cee6f27
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,7 +8,7 @@ | |
from builtins import object, str, bytes | ||
from future.utils import viewvalues | ||
|
||
from Utils.Utilities import decodeBytesToUnicode | ||
from Utils.Utilities import decodeBytesToUnicode, orderVersionList | ||
from WMCore.WMException import WMException | ||
from WMCore.WMRuntime.Tools.Scram import ARCH_TO_OS, SCRAM_TO_ARCH | ||
|
||
|
@@ -181,3 +181,31 @@ def scramArchtoRequiredArch(scramArch=None): | |
archs = defaultArch | ||
|
||
return archs | ||
|
||
@staticmethod | ||
def cudaCapabilityToSingleVersion(capabilities=None): | ||
""" | ||
Given a list of CUDA capabilities (with strings in a version style), | ||
finds the smallest version required and convert it to a single integer | ||
for comparison/job matchmaking purposes. | ||
Version conversion formula is: (1000 * major + 10 * medium + minor) | ||
:param capabilities: a list of string versions | ||
:return: an integer with the version value; 0 in case of failure | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the order here ... is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I expect the job matchmaking to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Partially it does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it can return from a version to integer conversion or from an error. However, this code will only be executed if GPUs are required, and if they are, then there are validated GPU parameters and the version is enforced to be something like I could change it to None and then mark the classad as undefined. But then it will be a job requiring GPU but asking for an undefined CUDA capability. I am not sure whether that could trigger other errors and unwanted behavior or not. Please let me know if you have any suggestions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alan, I do not want to say it... but usually there are many So I'd bet for returning a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't really have a strong preference here. I can make it return None by default, have a check in SingleCondorPlugin for None and map that to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good, yes |
||
|
||
For further details: | ||
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART____VERSION.html | ||
""" | ||
defaultRes = 0 | ||
# get an ordered list of the versions and use the very first element | ||
capabilities = orderVersionList(capabilities) | ||
if not capabilities: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If capabilities is not a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All the GPU parameters are validated at workflow creation (StdBase), so there shouldn't be a type different than list or None at this level |
||
return defaultRes | ||
|
||
smallestVersion = capabilities[0] | ||
smallestVersion = smallestVersion.split(".") | ||
# deal with versions like: "1", "1.2" and "1.2.3" | ||
for _i in range(0, 3 - len(smallestVersion)): | ||
smallestVersion.append(0) | ||
|
||
intVersion = int(smallestVersion[0]) * 1000 + int(smallestVersion[1]) * 10 + int(smallestVersion[2]) | ||
return intVersion |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -566,8 +566,11 @@ def getJobParameters(self, jobList): | |
if job.get('gpuRequirements', None): | ||
ad['My.GPUMemoryMB'] = str(job['gpuRequirements']['GPUMemoryMB']) | ||
cudaCapabilities = ','.join(sorted(job['gpuRequirements']['CUDACapabilities'])) | ||
ad['My.CUDACapability'] = classad.quote(str(cudaCapabilities)) | ||
ad['My.CUDARuntime'] = classad.quote(job['gpuRequirements']['CUDARuntime']) | ||
minimalCapability = self.cudaCapabilityToSingleVersion(job['gpuRequirements']['CUDACapabilities']) | ||
ad['My.CUDACapability'] = classad.quote(str(minimalCapability)) | ||
ad['My.OriginalCUDACapability'] = classad.quote(str(cudaCapabilities)) | ||
cudaRuntime = ','.join(sorted(job['gpuRequirements']['CUDARuntime'])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need to sot the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am actually not sure whether this is needed or not. I was going to say to keep condor classad values consistent among different jobs of the same workflow/task, but it could be that this is already consistent/ordered in the original list. I'd keep it around in case the original list can come out with scrambled versions. |
||
ad['My.CUDARuntime'] = classad.quote(str(cudaRuntime)) | ||
else: | ||
ad['My.GPUMemoryMB'] = undefined | ||
ad['My.CUDACapability'] = undefined | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the good documentation provided with this function, which I admit gives good info, why do we need it. The
list.sort
method works in place, so executing this same line outside of the function would do just fine. We can still make aNOTE
comment providing the information from the docstring though.