Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRPE: add check: check_sriov_numvfs.py #670

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

SPFZ
Copy link
Contributor

@SPFZ SPFZ commented Jan 13, 2022

Check to verify the number of virtual functions (VFs) active on network interfaces. Based on /sys/class/net//device/sriov_numvfs and /sys/class/net//device/sriov_totalvfs.

This check is intended to be used in various networking charms like ovn-chassis, therefore it is added here to avoid replicating the code.

The goal of the check is to identify network interface configuration problems regarding virtual functions.

Documentation:

Check to verify the number of virtual functions (VFs) active on network
interfaces. Based on /sys/class/net/<interface>/device/sriov_numvfs and
/sys/class/net/<interface>/device/sriov_totalvfs.

    usage: check_sriov_numvfs.py [-h] sriov_numvfs [sriov_numvfs ...]

    Check sriov numvfs configuration

    positional arguments:
    sriov_numvfs  format: <interface>:<numvfs>

For each interfaces:numvfs pair given it verifies that:
    1. VFs are enabled
    2. VFs are configured are the same as specified in the check parameter
    3. VFs configured do not exceed the maximum available on the interface

Example: ./check_sriov_numvfs.py ens3f0:32 ens3f1:32 ens6f0:32 ens6f1:32

Check to verify the number of virtual functions (VFs) active on network
interfaces. Based on /sys/class/net/<interface>/device/sriov_numvfs and
/sys/class/net/<interface>/device/sriov_totalvfs.
Copy link
Contributor

@jtroup jtroup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, thanks.

charmhelpers/contrib/network/files/__init__.py Outdated Show resolved Hide resolved
charmhelpers/contrib/charmsupport/nrpe.py Outdated Show resolved Hide resolved
charmhelpers/contrib/charmsupport/nrpe.py Outdated Show resolved Hide resolved
charmhelpers/contrib/charmsupport/nrpe.py Show resolved Hide resolved
tests/contrib/network/files/__init__.py Outdated Show resolved Hide resolved
tests/contrib/network/files/test_check_sriov_numvfs.py Outdated Show resolved Hide resolved
tests/contrib/network/files/test_check_sriov_numvfs.py Outdated Show resolved Hide resolved
tests/contrib/network/files/test_check_sriov_numvfs.py Outdated Show resolved Hide resolved
tests/contrib/network/files/test_check_sriov_numvfs.py Outdated Show resolved Hide resolved
Copy link
Contributor

@auria auria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, changes look good. I've shared a couple of suggestions. One changes what is considered "unknown" and what is "critical". The other is a readability suggestion.

if error_msg:
print("CRITICAL: {} problems detected\n".format(len(error_msg)) + "\n".join(error_msg))
sys.exit(2)
except: # noqa: E722
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to a catch-all scope for unexpected exceptions. However, we should consider adding a previous specific list of exceptions that we think could be raised (and maybe raise a critical error when they happen). Alerting is triggered in Nagios when unknown/critical/recover events happen, so this change would only help in sharing a more meaningful message (a sentence vs a python traceback).

e.g. PermissionError, FileNotFoundError, AssertionError.

Side note: instead of using "assert ...", an exception class inherited from Exception (e.g. ArgsFormatError) could be raised when parsing CLI arguments (and caught here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestions, I changed it to match your suggestion.

charmhelpers/contrib/network/files/check_sriov_numvfs.py Outdated Show resolved Hide resolved
auria
auria previously approved these changes Jan 18, 2022
Copy link
Contributor

@auria auria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you.

Copy link
Collaborator

@wolsen wolsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at the high level, but a few comments inline.



def get_interface_setting(file):
"""Return the value content of a setting file as int"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstrings should also document the function arguments and return expectations (including exceptions raised).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added parameter and return docstrings to all functions now, please verify.



def check_interface_numvfs(iface, numvfs):
"""Check SR-IOV numvfs config"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as docstring comment as above



def parse_sriov_numvfs(device_numvfs):
"""Parse parameters and check format"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Comment on lines 106 to 107
iface = str(device_numvfs.split(":")[0])
numvfs = int(device_numvfs.split(":")[1])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device_numvfs doesn't need to be split again as you already have it in parts. You may also want to consider just unpacking the split into iface and numvfs directly. Yes, if there are too many ':' characters, it will raise a ValueError but its essentially the len(parts) != 2 check. Also, FYI the parens on the return are superfluous

try:
    iface, numvfs = device_numvfs.split(':')
    numvfs = int(numvfs)
    if len(iface) < 3 or numvfs < 0:
        raise ArgsFormatError(msg)
    return iface, numvfs
except ValueError:
    raise ArgsFormatError(msg)

Alternatively, you could use regexes as well - though that is arguably more or less complex, depending on your love of regex:

DEV_VF_PATTERN = r'(?P<iface>[0-9a-z]{3,}):(?P<numvfs>\d+)'
...
match = re.match(DEV_VF_PATTERN, device_numvfs)
if match is None:
    raise ArgsFormatError(msg)
return match.group('iface'), int(match.group('numvfs'))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I changed it to use regex.

len(parts) != 2 or # exactly 2 parts
len(parts[0]) < 3 or # interface name should be at least 3 chars
len(parts[1]) < 1 or # numvfs should be at least 1 char
int(parts[1]) < 0 # numvfs should be a valid int > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason you are relying on a ValueError here rather than using an ArgsFormatError? Its not necessarily wrong, and conveys information - but arguably a non-int value here is an ArgsFormatError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No particular reason. Due to the change to regex this is an ArgsFormatError now.


DEVICE_TEMPLATE = "/sys/class/net/{0}"
SRIOV_NUMVFS_TEMPLATE = "/sys/class/net/{0}/device/sriov_numvfs"
SRIOV_TOTALVFS_TEMPLATE = "/sys/class/net/{0}/device/sriov_totalvfs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit more on why this is a critical alert when the requested VFs cannot be provided by the card? It seems conceivable to me that 2 cards with different capabilities across different units have the same interface name but one can serve 128 VFs and another 64. Setting the value to 128 maxes out both available VFs but it is fewer than 128*2.

@SPFZ
Copy link
Contributor Author

SPFZ commented Jan 20, 2022

@wolsen Thank you for the review! I improved the MR based on your comments.

Regarding the sriov_totalvfs: My assumption is that the number of virtual functions that was requested should be available for use.

If the requested numvfs > sriov_totalvfs then the alert will have two messages:

  1. Number of VFs on interface does not match expected (because numvfs != sriov_numvfs L95)
  2. Maximum number of VFs available on interface is lower than the expected (because numvfs > sriov_totalvfs L103)

The second message is just a hint for an operator that the reason might be the hardware or hardware configuration that does not support the requested number of VFs. This might be helpful if the cluster hardware is heterogeneous or a NIC has been replace with a different model that supports less VFs (we had a ase like this in the past). I can convert it to a Warning or remove it entirely if this behaviour is not desired but it might lead to a situation when VMs can not be deployed because there are no more VFs available.

@wolsen
Copy link
Collaborator

wolsen commented Jan 21, 2022

@wolsen Thank you for the review! I improved the MR based on your comments.

Regarding the sriov_totalvfs: My assumption is that the number of virtual functions that was requested should be available for use.

If the requested numvfs > sriov_totalvfs then the alert will have two messages:

  1. Number of VFs on interface does not match expected (because numvfs != sriov_numvfs L95)
  2. Maximum number of VFs available on interface is lower than the expected (because numvfs > sriov_totalvfs L103)

The second message is just a hint for an operator that the reason might be the hardware or hardware configuration that does not support the requested number of VFs. This might be helpful if the cluster hardware is heterogeneous or a NIC has been replace with a different model that supports less VFs (we had a ase like this in the past). I can convert it to a Warning or remove it entirely if this behaviour is not desired but it might lead to a situation when VMs can not be deployed because there are no more VFs available.

To explore it further, for these given checks, what actions are available to the operator with to avoid getting spammed with critical alerts when the heterogeneous scenario occurs?

@SPFZ
Copy link
Contributor Author

SPFZ commented Jan 21, 2022

To explore it further, for these given checks, what actions are available to the operator with to avoid getting spammed with critical alerts when the heterogeneous scenario occurs?

If the hardware is heterogeneous by design then an operator would have to create separate applications with different settings. We already have cases like this where were we have different type of machines due to different NICs or customer needs for the machines (with/without SR-IOV). There we deploy separate applications due to differences in hardware related settings (e.g. nova-compute and ovn-chassis).

An alternative would be to make this behavior configurable - essentially disabling checks 1. and 2. from #670 (comment).

@wolsen
Copy link
Collaborator

wolsen commented Jan 24, 2022

To explore it further, for these given checks, what actions are available to the operator with to avoid getting spammed with critical alerts when the heterogeneous scenario occurs?

If the hardware is heterogeneous by design then an operator would have to create separate applications with different settings. We already have cases like this where were we have different type of machines due to different NICs or customer needs for the machines (with/without SR-IOV). There we deploy separate applications due to differences in hardware related settings (e.g. nova-compute and ovn-chassis).

An alternative would be to make this behavior configurable - essentially disabling checks 1. and 2. from #670 (comment).

I think if its known in advance, then the heterogeneous by design could be accounted for. However, in the case of an existing deployment having to consider this I think it places a burden on the operator in this scenario. One option is certainly to make this configurable and this is a worthwhile discussion as to whether it should be configurable and what the default value should be.

However I think there's value in the other checks that are part of this PR and I don't want to hold those up for the discussion on this particular check. I propose that you break out the case where numvfs expected > total_vfs_sriov into another pull request and we can have appropriate discourse separately.

Does that sound reasonable?

@SPFZ
Copy link
Contributor Author

SPFZ commented Jan 25, 2022

However I think there's value in the other checks that are part of this PR and I don't want to hold those up for the discussion on this particular check. I propose that you break out the case where numvfs expected > total_vfs_sriov into another pull request and we can have appropriate discourse separately.
Does that sound reasonable?

Only the "VFs are enabled" check would remain.

For each interfaces:numvfs pair given it verifies that:
1. VFs are enabled
2. VFs are configured are the same as specified in the check parameter
3. VFs configured do not exceed the maximum available on the interface

Because the second check would still be CRITICAL.
E.g. card supports 64 VFs, check is set to 128 VFs then numvfs=64 != sriov_numvfs=128
https://github.com/SPFZ/charm-helpers/blob/master/charmhelpers/contrib/network/files/check_sriov_numvfs.py#L95

If that is needed to get this PR to move forward then I can remove the checks 2 and 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants