Skip to content


Latest commit



190 lines (153 loc) · 9.15 KB

File metadata and controls

190 lines (153 loc) · 9.15 KB

License Go Report Card Coverage Status Build, Test, Lint CodeQL Image push

NVIDIA Nic Configuration Operator

NVIDIA Nic Configuration Operator provides Kubernetes API(Custom Resource Definition) to allow FW configuration on Nvidia NICs in a coordinated manner. It deploys a configuration daemon on each of the desired nodes to configure Nvidia NICs there. NVIDIA Nic Configuration operator uses maintenance operator to prepare a node for maintenance before the actual configuration.




Deploy latest from project sources

# Clone project
git clone ; cd nic-configuration-operator

# Install Operator
helm install -n nic-configuration-operator --create-namespace --set operator.image.tag=latest nic-configuration ./deployment/nic-configuration-operator-chart

# View deployed resources
kubectl -n nic-configuration-operator get all


Refer to helm values documentation for more information

Deploy last release from OCI repo

helm install -n nic-configuration-operator --create-namespace nic-configuration-operator oci://



The NICConfigurationTemplate CRD is used to request FW configuration for a subset of devices

Nic Configuration Operator will select NIC devices in the cluster that match the template's selectors and apply the configuration spec to them.

If more than one template match a single device, none will be applied and the error will be reported in all of their statuses.

for more information refer to api-reference.


ResetToDefault In NIC Configuration Operator template v0.1.14 BF2/BF3 DPUs (not SuperNics) FW reset flow isn't supported.

kind: NicConfigurationTemplate
   name: connectx6-config
   namespace: nic-configuration-operator
   nodeSelector: "true"
      # nicType selector is mandatory the rest are optional. Only a single type can be specified.
      nicType: 101b
         - "0000:03:00.0"
         - “0000:04:00.0”
         - "MT2116X09299"
   resetToDefault: false # if set, template is ignored, device configuration should reset
      numVfs: 2
      linkType: Ethernet
         enabled: true
         maxAccOutRead: 44
         maxReadRequest: 4096
         enabled: true
            trust: dscp
            pfc: "0,0,0,1,0,0,0,0"
         enabled: true
         env: Baremetal

Configuration details

  • numVFs: if provided, configure SR-IOV VFs via nvconfig.
    • This is a mandatory parameter.
    • E.g: if numVFs=2 then SRIOV_EN=1 and SRIOV_NUM_OF_VFS=2.
    • If numVFs=0 then SRIOV_EN=0 and SRIOV_NUM_OF_VFS=0.
  • linkType: if provided configure linkType for the NIC for all NIC ports.
    • This is a mandatory parameter.
    • E.g linkType = Infiniband then set LINK_TYPE_P1=IB and LINK_TYPE_P2=IB if second PCI function is present
  • pciPerformanceOptimized: performs PCI performance optimizations. If enabled then by default the following will happen:
    • Set nvconfig MAX_ACC_OUT_READ nvconfig parameter to 0 (use device defaults)
    • Set PCI max read request size for each PF to 4096 (note: this is a runtime config and is not persistent)
    • Users can override values via maxAccOutRead and maxReadRequest


According to the PRM, setting MAX_ACC_OUT_READ to zero enables the auto mode, which applies the best suitable optimizations. However, there is a bug in certain FW versions, where the zero value is not available. In this case, until the fix is available, MAX_ACC_OUT_READ will not be set and a warning event will be emitted for this device's CR.

  • roceOptimized: performs RoCE related optimizations. If enabled performs the following by default:
    • Nvconfig set for both ports (can be applied from PF0)
      • Conditionally applied for second port if present
        • ROCE_CC_PRIO_MASK_P1=255, ROCE_CC_PRIO_MASK_P2=255
        • CNP_DSCP_P1=4, CNP_DSCP_P2=4
        • CNP_802P_PRIO_P1=6, CNP_802P_PRIO_P2=6
    • Configure pfc (Priority Flow Control) for priority 3 and set trust to dscp on each PF
      • Non-persistent (need to be applied after each boot)
      • Users can override values via trust and pfc parameters
    • Can only be enabled with linkType=Ethernet
  • gpuDirectOptimized: performs gpu direct optimizations. ATM only optimizations for Baremetal environment are supported. If enabled perform the following:
    • Set nvconfig ATS_ENABLED=0
    • Can only be enabled when pciPerformanceOptimized is enabled
    • Both the numeric values and their string aliases, supported by NVConfig, are allowed (e.g. REAL_TIME_CLOCK_ENABLE=False, REAL_TIME_CLOCK_ENABLE=0).
    • For per port parameters (suffix _P1, _P2) parameters with _P2 suffix are ignored if the device is single port.
  • If a configuration is not set in spec, its non-volatile configuration parameters (if any) should be set to device default.


The NicDevice CRD is created automatically by the configuration daemon and represents a specific NVIDIA NIC on a specific K8s node. The name of the device combines the node name, device type and its serial number for easier tracking.

ConfigUpdateInProgress status condition can be used for tracking the state of the FW configuration update on a specific device. If an error occurs during FW configuration update, it will be reflected in this field.

for more information refer to api-reference.

Example NicDevice

kind: NicDevice
   name: co-node-25-101b-mt2232t13210
   namespace: nic-configuration-operator
         linkType: Ethernet
         numVfs: 8
            enabled: true
      - reason: UpdateSuccessful
        status: "False"
        type: ConfigUpdateInProgress
   firmwareVersion: 20.42.1000
   node: co-node-25
   partNumber: mcx632312a-hdat
      - networkInterface: enp4s0f0np0
        pci: "0000:04:00.0"
        rdmaInterface: mlx5_0
      - networkInterface: enp4s0f1np1
        pci: "0000:04:00.1"
        rdmaInterface: mlx5_1
   psid: mt_0000000225
   serialNumber: mt2232t13210
   type: 101b

Implementation details:

The NicDevice CRD is created and reconciled by the configuration daemon. The reconciliation logic scheme can be found here.

Provisioning a storage class for NIC FW upgrade

To enable the NIC FW upgrade feature, nicFirmwareStorage.pvcName parameter should be provided in the helm chart. There is an option to create a new PVC or use the existing one. Firmware binaries will be provisioned by a provisioner controller which will watch for NICFirmwareSource obj and provision the binaries in a shared volume enabled by the given storage class. Node agents will need to make sure that the reference NICFirmwareSource object is fully reconciled (status.state == Success) before proceeding with firmware update.

Example of the storage class deployment

To set up a persistent NFS storage in the cluster, the example from the CSI NFS Driver repository might be used. After deploying the NFS server and NFS CSI driver, the storage class should be deployed. The name of the storage class can then be passed to the NIC Configuration Operator helm chart.