-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NRPE: add check: check_sriov_numvfs.py #670
base: master
Are you sure you want to change the base?
Conversation
Check to verify the number of virtual functions (VFs) active on network interfaces. Based on /sys/class/net/<interface>/device/sriov_numvfs and /sys/class/net/<interface>/device/sriov_totalvfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great start, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, changes look good. I've shared a couple of suggestions. One changes what is considered "unknown" and what is "critical". The other is a readability suggestion.
if error_msg: | ||
print("CRITICAL: {} problems detected\n".format(len(error_msg)) + "\n".join(error_msg)) | ||
sys.exit(2) | ||
except: # noqa: E722 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to a catch-all scope for unexpected exceptions. However, we should consider adding a previous specific list of exceptions that we think could be raised (and maybe raise a critical error when they happen). Alerting is triggered in Nagios when unknown/critical/recover events happen, so this change would only help in sharing a more meaningful message (a sentence vs a python traceback).
e.g. PermissionError, FileNotFoundError, AssertionError.
Side note: instead of using "assert ...", an exception class inherited from Exception (e.g. ArgsFormatError) could be raised when parsing CLI arguments (and caught here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestions, I changed it to match your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good at the high level, but a few comments inline.
|
||
|
||
def get_interface_setting(file): | ||
"""Return the value content of a setting file as int""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstrings should also document the function arguments and return expectations (including exceptions raised).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added parameter and return docstrings to all functions now, please verify.
|
||
|
||
def check_interface_numvfs(iface, numvfs): | ||
"""Check SR-IOV numvfs config""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as docstring comment as above
|
||
|
||
def parse_sriov_numvfs(device_numvfs): | ||
"""Parse parameters and check format""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
iface = str(device_numvfs.split(":")[0]) | ||
numvfs = int(device_numvfs.split(":")[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_numvfs
doesn't need to be split again as you already have it in parts
. You may also want to consider just unpacking the split into iface and numvfs directly. Yes, if there are too many ':' characters, it will raise a ValueError but its essentially the len(parts) != 2 check. Also, FYI the parens on the return are superfluous
try:
iface, numvfs = device_numvfs.split(':')
numvfs = int(numvfs)
if len(iface) < 3 or numvfs < 0:
raise ArgsFormatError(msg)
return iface, numvfs
except ValueError:
raise ArgsFormatError(msg)
Alternatively, you could use regexes as well - though that is arguably more or less complex, depending on your love of regex:
DEV_VF_PATTERN = r'(?P<iface>[0-9a-z]{3,}):(?P<numvfs>\d+)'
...
match = re.match(DEV_VF_PATTERN, device_numvfs)
if match is None:
raise ArgsFormatError(msg)
return match.group('iface'), int(match.group('numvfs'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! I changed it to use regex.
len(parts) != 2 or # exactly 2 parts | ||
len(parts[0]) < 3 or # interface name should be at least 3 chars | ||
len(parts[1]) < 1 or # numvfs should be at least 1 char | ||
int(parts[1]) < 0 # numvfs should be a valid int > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any particular reason you are relying on a ValueError here rather than using an ArgsFormatError? Its not necessarily wrong, and conveys information - but arguably a non-int value here is an ArgsFormatError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No particular reason. Due to the change to regex this is an ArgsFormatError now.
|
||
DEVICE_TEMPLATE = "/sys/class/net/{0}" | ||
SRIOV_NUMVFS_TEMPLATE = "/sys/class/net/{0}/device/sriov_numvfs" | ||
SRIOV_TOTALVFS_TEMPLATE = "/sys/class/net/{0}/device/sriov_totalvfs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit more on why this is a critical alert when the requested VFs cannot be provided by the card? It seems conceivable to me that 2 cards with different capabilities across different units have the same interface name but one can serve 128 VFs and another 64. Setting the value to 128 maxes out both available VFs but it is fewer than 128*2.
@wolsen Thank you for the review! I improved the MR based on your comments. Regarding the sriov_totalvfs: My assumption is that the number of virtual functions that was requested should be available for use. If the requested numvfs > sriov_totalvfs then the alert will have two messages:
The second message is just a hint for an operator that the reason might be the hardware or hardware configuration that does not support the requested number of VFs. This might be helpful if the cluster hardware is heterogeneous or a NIC has been replace with a different model that supports less VFs (we had a ase like this in the past). I can convert it to a Warning or remove it entirely if this behaviour is not desired but it might lead to a situation when VMs can not be deployed because there are no more VFs available. |
To explore it further, for these given checks, what actions are available to the operator with to avoid getting spammed with critical alerts when the heterogeneous scenario occurs? |
If the hardware is heterogeneous by design then an operator would have to create separate applications with different settings. We already have cases like this where were we have different type of machines due to different NICs or customer needs for the machines (with/without SR-IOV). There we deploy separate applications due to differences in hardware related settings (e.g. nova-compute and ovn-chassis). An alternative would be to make this behavior configurable - essentially disabling checks 1. and 2. from #670 (comment). |
I think if its known in advance, then the heterogeneous by design could be accounted for. However, in the case of an existing deployment having to consider this I think it places a burden on the operator in this scenario. One option is certainly to make this configurable and this is a worthwhile discussion as to whether it should be configurable and what the default value should be. However I think there's value in the other checks that are part of this PR and I don't want to hold those up for the discussion on this particular check. I propose that you break out the case where numvfs expected > total_vfs_sriov into another pull request and we can have appropriate discourse separately. Does that sound reasonable? |
Only the "VFs are enabled" check would remain.
Because the second check would still be CRITICAL. If that is needed to get this PR to move forward then I can remove the checks 2 and 3. |
Check to verify the number of virtual functions (VFs) active on network interfaces. Based on /sys/class/net//device/sriov_numvfs and /sys/class/net//device/sriov_totalvfs.
This check is intended to be used in various networking charms like ovn-chassis, therefore it is added here to avoid replicating the code.
The goal of the check is to identify network interface configuration problems regarding virtual functions.
Documentation: