Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run neuron device plugin on EKS with containerd only #794

Closed
bryantbiggs opened this issue Nov 21, 2023 · 6 comments
Closed

Unable to run neuron device plugin on EKS with containerd only #794

bryantbiggs opened this issue Nov 21, 2023 · 6 comments

Comments

@bryantbiggs
Copy link
Contributor

bryantbiggs commented Nov 21, 2023

I am trying to setup the Neuron device plugin on EKS with a custom AL2023 AMI but I am getting the following error:

k logs -n kube-system neuron-device-plugin-daemonset-ql6gf
which: no docker-runc in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2023/11/21 23:19:02 kubeconfig /etc/kubernetes/kubelet.conf failed to find due to stat /etc/kubernetes/kubelet.conf: no such file or directory
2023/11/21 23:19:02 Device Plugin is in Driver Mode:true
neuron-device-plugin: 2023/11/21 23:19:02 Fetching Cores.
neuron-device-plugin: 2023/11/21 23:19:02 Core list: [0 1]
neuron-device-plugin: 2023/11/21 23:19:02 Starting FS watcher.
neuron-device-plugin: 2023/11/21 23:19:02 Starting OS watcher.
neuron-device-plugin: 2023/11/21 23:19:32 Get "https://172.20.0.1:443/api/v1/nodes/ip-10-0-23-61.ec2.internal": dial tcp 172.20.0.1:443: i/o timeout

In the containerd journalctl logs I am seeing these log lines, but I can't track down any info on this so far:

Nov 21 23:19:02 ip-10-0-23-61.ec2.internal oci_neuron_hook[46443]: add devices at cnt root rootfs for pid:46437
Nov 21 23:19:02 ip-10-0-23-61.ec2.internal oci_neuron_hook[46443]: No devices were specified

I am not installing any Docker components, I am only using containerd (this is the norm starting in EKS 1.24+). I have installed the following on the AMi:

  • oci-add-hooks
  • aws-neuronx-dkms-2.*
  • aws-neuronx-oci-hook-2.*

The source AMI is ami-0d4df6583e939a1c4 (us-east-1) which is the latest Amazon Linux 2023 minimal - all of the EKS components have been installed and validated (kubelet, containerd, etc.)

The containerd config in use:

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
disabled_plugins = [
  "io.containerd.internal.v1.opt",
  "io.containerd.snapshotter.v1.aufs",
  "io.containerd.snapshotter.v1.devmapper",
  "io.containerd.snapshotter.v1.native",
  "io.containerd.snapshotter.v1.zfs",
]

[grpc]
  address = "/run/containerd/containerd.sock"

[plugins."io.containerd.grpc.v1.cri"]
  sandbox_image = "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.8"

  [plugins."io.containerd.grpc.v1.cri".cni]
    bin_dir  = "/opt/cni/bin"
    conf_dir = "/etc/cni/net.d"

  [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name    = "neuron"
    discard_unpacked_layers = true

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron]
      runtime_type = "io.containerd.runc.v2"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron.options]
        BinaryName    = "/opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh"
        SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".registry]
  config_path = "/etc/containerd/certs.d"

The Neuron device plugin daemonset in use:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  annotations:
    deprecated.daemonset.template.generation: "2"
  creationTimestamp: "2023-11-21T21:01:19Z"
  generation: 2
  name: neuron-device-plugin-daemonset
  namespace: kube-system
  resourceVersion: "792379"
  uid: 0b4af0df-0531-46da-aa29-b4e58db8cf76
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      name: neuron-device-plugin-ds
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: neuron-device-plugin-ds
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: node.kubernetes.io/instance-type
                operator: In
                values:
                - trn1.32xlarge
                - inf1.6xlarge
                - inf2.xlarge
                - inf1.2xlarge
                - inf1.xlarge
                - trn1.2xlarge
                - inf1.24xlarge
                - inf2.4xlarge
                - trn1n.32xlarge
                - inf2.8xlarge
                - inf2.24xlarge
                - inf2.48xlarge
      automountServiceAccountToken: true
      containers:
      - env:
        - name: KUBECONFIG
          value: /etc/kubernetes/kubelet.conf
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        image: public.ecr.aws/neuron/neuron-device-plugin:2.16.18.0
        imagePullPolicy: Always
        name: neuron-device-plugin
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          mountPropagation: None
          name: device-plugin
        - mountPath: /run
          mountPropagation: None
          name: infa-map
      dnsPolicy: ClusterFirst
      enableServiceLinks: true
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      shareProcessNamespace: false
      terminationGracePeriodSeconds: 30
      tolerations:
      - key: CriticalAddonsOnly
        operator: Exists
      - effect: NoSchedule
        key: aws.amazon.com/neuron
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/device-plugins
          type: ""
        name: device-plugin
      - hostPath:
          path: /run
          type: ""
        name: infa-map
  updateStrategy:
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
    type: RollingUpdate
status:
  currentNumberScheduled: 1
  desiredNumberScheduled: 1
  numberMisscheduled: 0
  numberReady: 0
  numberUnavailable: 1
  observedGeneration: 2
  updatedNumberScheduled: 1

With clusterrole:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: "2023-11-18T18:51:07Z"
  name: neuron-device-plugin
  resourceVersion: "8369"
  uid: 9b166553-307b-47e7-9cad-c5a7870bb3e7
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - update
  - patch
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/status
  verbs:
  - patch
  - update
[root@ip-10-0-23-61 ~]# lsmod | grep neuron
neuron                274432  0

From dmesg:

[    8.910802] neuron: loading out-of-tree module taints kernel.
[    8.910867] neuron: module verification failed: signature and/or required key missing - tainting kernel
[    8.913939] Neuron Driver Started with Version:2.14.5.0-191a3a6ffdceb767d689674ecdb6f863627ce5ae
@bryantbiggs
Copy link
Contributor Author

I believe these:

[    8.910802] neuron: loading out-of-tree module taints kernel.
[    8.910867] neuron: module verification failed: signature and/or required key missing - tainting kernel

Are related to aws-neuron/aws-neuron-driver#6 but not part of the issue described here

@james-aws
Copy link
Contributor

@bryantbiggs thanks for your report. We will investigate and respond soon.

@james-aws
Copy link
Contributor

@bryantbiggs when using eks it is unnecessary to install aws-neuronx-oci-hook. The only necessary package to install on worker nodes is the driver aws-neuronx-dkms.

Could you please try again, only installing aws-neuronx-dkms? Also let me know the instance type in your node group if you continue to encounter issues.

@bryantbiggs
Copy link
Contributor Author

@james-aws - just to clarify, the aws-neuronx-oci-hook and add-oci-hooks are no longer required at all?

Does that also mean that the containerd config.toml should not be updated as well or is there a different config for that since the hooks aren't used?

default_runtime_name = "neuron"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron]
   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.neuron.options]
      BinaryName = "/opt/aws/neuron/bin/oci_neuron_hook_wrapper.sh"

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/tutorial-oci-hook.html?highlight=containerd#for-containerd-runtime-setup-containerd-to-use-oci-neuron-oci-runtime

@james-aws
Copy link
Contributor

when using eks aws-neuronx-oci-hook and add-oci-hooks are not required. There is also no need to modify the containrd config.toml on worker nodes.

@adammw
Copy link
Contributor

adammw commented Oct 9, 2024

when using eks aws-neuronx-oci-hook and add-oci-hooks are not required

I assume you mean when using the supplied EKS AMIs they are pre-installed, since if you are supplying your own AMI running on EKS makes no difference..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants