This section summarizes the steps that may be needed during the entire lifecycle of PMEM in a cluster, starting with the initial preparations and ending with decommissioning the hardware. The other sections explain each step in more detail.
When setting up a cluster, the administrator must install the PMEM hardware on nodes and configure some or all of those instances of PMEM for usage by PMEM-CSI (prerequisites). Nodes where PMEM-CSI is supposed to run must have a certain label in Kubernetes.
The administrator must install PMEM-CSI, using the PMEM-CSI
operator (recommended) or with scripts
and YAML files in the source code. The default
install settings should work for most clusters. Some clusters don't
use /var/lib/kubelet
as the data directory for kubelet and then the
corresponding PMEM-CSI setting must be changed accordingly because
otherwise kubelet does not find PMEM-CSI. The operator has an option
for that in its API (kubeletDir
in the DeploymentSpec
),
the YAML files can be edited or modified with
kustomize.
A PMEM-CSI installation can only use direct device mode or LVM device mode. It is possible to install PMEM-CSI twice on the same cluster with different modes, with these restrictions:
- The driver names must be different.
- The installations must run on different nodes by using different node labels, or the "usage" parameter of the LVM mode driver installation one a node must be so that it leaves spaces available for the direct mode driver installation on that same node.
The administrator must decide which storage classes shall be available to users of the cluster. A storage class references a driver installation by name, which indirectly determines the device mode. A storage class also chooses which filesystem is used (xfs or ext4) and enables Kata Containers support.
Optionally, the administrator can enable the scheduler extensions and monitoring of resource usage via the metrics support.
It is recommended
to enable the scheduler extensions and use
volumeBindingMode: WaitForFirstConsumer
as in the
pmem-storageclass-late-binding.yaml
example. This ensures that pods get scheduled onto nodes that have
sufficient RAM, CPU and PMEM. Without the scheduler extensions, it is
random whether the scheduler picks a node that has PMEM available and
immediate binding (the default volume binding mode) might work
better. However, then pods might not be able to run when the node
where volumes were created are overloaded.
Starting with Kubernetes 1.21, PMEM-CSI uses storage capacity
tracking
to handle Pod scheduling and the scheduler extensions are not needed
anymore. WaitForFirstConsumer
still is the recommended volume
binding mode.
Optionally, the log output format can be changed from the default "text" format (= the traditional glog format) to "json" (= output via zap) for easier processing.
When using the operator, existing PMEM-CSI installations can be upgraded seamlessly by installing a newer version of the operator. Downgrading by installing an older version is also supported, but may need manual work which will be documented in the release notes.
When using YAML files, the only reliable way of up- or downgrading is to remove the installation and install anew.
Users can then create PMEM volumes via persistent volume claims that reference the storage classes or via ephemeral inline volumes.
A node should only be removed from a cluster after ensuring that there
is no pod running on it which uses PMEM and that there is no
persistent volume (PV
) on it. This can be checked via kubectl get -o yaml pv
and looking for a nodeAffinity
entry that references the
node or via metrics data for the node. When removing a node or even
the entire PMEM-CSI driver installation too soon, attempts to remove
pods or volumes via the Kubernetes API will fail. Administrators can
recover from that by force-deleting PVs for which the underlying
hardware has already been removed.
By default, PMEM-CSI wipes volumes after usage
(eraseAfter
), so shredding PMEM hardware
after decomissioning it is optional.
The recommended mimimum Linux kernel version for running the PMEM-CSI driver is 4.15. See Persistent Memory Programming for more details about supported kernel versions.
Persistent memory device(s) are required for operation. However, some development and testing can be done using QEMU-emulated persistent memory devices. See the "QEMU and Kubernetes" section for the commands that create such a virtual test cluster.
The PMEM-CSI driver needs pre-provisioned regions on the NVDIMM device(s). The PMEM-CSI driver itself intentionally leaves that to the administrator who then can decide how much and how PMEM is to be used for PMEM-CSI.
Beware that the PMEM-CSI driver will run without errors on a node where PMEM was not prepared for it. It will then report zero local storage for that node, something that currently is only visible in the log files.
When running the Kubernetes cluster and PMEM-CSI on bare metal, the ipmctl utility can be used to create regions. App Direct Mode has two configuration options - interleaved or non-interleaved. One region per each NVDIMM is created in non-interleaved configuration. In such a configuration, a PMEM-CSI volume cannot be larger than one NVDIMM.
Example of creating regions without interleaving, using all NVDIMMs:
$ ipmctl create -goal PersistentMemoryType=AppDirectNotInterleaved
Alternatively, multiple NVDIMMs can be combined to form an interleaved set. This causes the data to be striped over multiple NVDIMM devices for improved read/write performance and allowing one region (also, PMEM-CSI volume) to be larger than single NVDIMM.
Example of creating regions in interleaved mode, using all NVDIMMs:
$ ipmctl create -goal PersistentMemoryType=AppDirect
If the operating system on the nodes does not provide ipmctl
, then
it can also be run inside a container, using the PMEM-CSI image. The same
invocation works with podman
instead of docker
.
$ sudo docker run --privileged --rm -u 0:0 docker.io/intel/pmem-csi-driver:canary ipmctl help
Intel(R) Optane(TM) Persistent Memory Command Line Interface
Usage: ipmctl <verb>[<options>][<targets>][<properties>]
...
When running inside virtual machines, each virtual machine typically
already gets access to one region and ipmctl
is not needed inside
the virtual machine. Instead, that region must be made available for
use with PMEM-CSI because when the virtual machine comes up for the
first time, the entire region is already allocated for use as a single
block device:
$ ndctl list -RN
{
"regions":[
{
"dev":"region0",
"size":34357641216,
"available_size":0,
"max_available_extent":0,
"type":"pmem",
"persistence_domain":"unknown",
"namespaces":[
{
"dev":"namespace0.0",
"mode":"raw",
"size":34357641216,
"sector_size":512,
"blockdev":"pmem0"
}
]
}
]
}
$ ls -l /dev/pmem*
brw-rw---- 1 root disk 259, 0 Jun 4 16:41 /dev/pmem0
Labels must be initialized in such a region, which must be performed once after the first boot:
$ ndctl disable-region region0
disabled 1 region
$ ndctl init-labels nmem0
initialized 1 nmem
$ ndctl enable-region region0
enabled 1 region
$ ndctl list -RN
[
{
"dev":"region0",
"size":34357641216,
"available_size":34357641216,
"max_available_extent":34357641216,
"type":"pmem",
"iset_id":10248187106440278,
"persistence_domain":"unknown"
}
]
$ ls -l /dev/pmem*
ls: cannot access '/dev/pmem*': No such file or directory
On some virtual machines, for example VMware® vSphere, the persistent
memory does not support setting labels and the ndctl init-labels nmem0
command above would fail. What can be done in that case is to
convert the existing namespace from "raw" to "fsdax" mode and
then run PMEM-CSI in LVM mode. Direct mode is not possible because
it depends on creating additional namespaces which in turn depends
on support for labels. The command for conversion is:
$ ndctl create-namespace --force --reconfig=namespace0.0 --mode=fsdax --name=pmem-csi
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":67643637760,
"uuid":"9fa6976c-ab57-491b-a00c-e52d092a4fa8",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0",
"name":"pmem-csi"
}
Note the pmem-csi
name for the namespace: this is how PMEM-CSI in
LVM mode knows that it is allowed to use this namespace. When the VM
provides only "legacy PMEM", ndctl silently drops that name. In that
case, the volume group as to be created manually:
$ vgcreate --force bus0region0fsdax /dev/pmem0
See automatic node setup below for instructions on how to automate this conversion.
This section assumes that a Kubernetes cluster is already available with at least one node that has persistent memory device(s). For development or testing, it is also possible to use a cluster that runs on QEMU virtual machines, see the "QEMU and Kubernetes".
- Make sure that the alpha feature gates CSINodeInfo and CSIDriverRegistry are enabled
The method to configure alpha feature gates may vary, depending on the Kubernetes deployment. It may not be necessary anymore when the feature has reached beta state, which depends on the Kubernetes version.
- Label the cluster nodes that provide persistent memory device(s)
PMEM-CSI manages PMEM on those nodes that have a certain label. For
historic reasons, the default in the YAML files and the operator
settings is to use a label storage
with the value pmem
.
Such a label can be set for each node manually with:
$ kubectl label node <your node> storage=pmem
Alternatively, the Node Feature
Discovery (NFD)
add-on can be used to label nodes automatically. In that case, the
default PMEM-CSI node selector has to be changed to
"feature.node.kubernetes.io/memory-nv.dax": "true"
. The operator has
the nodeSelector
field
for that. For the YAML files a kustomize patch can be used.
PMEM-CSI driver can be deployed to a Kubernetes cluster either using the PMEM-CSI operator or by using reference yaml files provided in source code.
The PMEM-CSI operator facilitates deploying and managing the PMEM-CSI driver on a Kubernetes cluster.
If your cluster supports managing the operators using the Operator Lifecycle Manager, then it is recommended to install the PMEM-CSI operator from the OperatorHub. Follow the instructions shown by the "Install" button. When using this approach, the operator itself always runs with default parameters, in particular log output in "text" format.
If you run an OpenShift cluster, then it is recommended to install the PMEM-CSI operator by following the instructions shown by "Deploy & use" on RedhatCatalog. The recommended approach is "Installing from OperatorHub using the web console".
Alternatively, the you can install the operator manually from YAML files. First install the PmemCSIDeployment CRD:
$ kubectl create -f https://github.com/intel/pmem-csi/raw/devel/deploy/crd/pmem-csi.intel.com_pmemcsideployments.yaml
Then install the PMEM-CSI operator itself:
$ kubectl create -f https://github.com/intel/pmem-csi/raw/devel/deploy/operator/pmem-csi-operator.yaml
The operator gets deployed in a namespace called 'pmem-csi' which gets created by that YAML file.
WARNING: This YAML file cannot be used to stop just the operator while
keeping the PMEM-CSI deployments running. That's because something like
kubectl delete -f pmem-csi-operator.yaml
will delete the pmem-csi
namespace which then also causes all PMEM-CSI deployments that might have
been created in that namespace to be deleted.
Once the operator is installed and running, it is ready to handle
PmemCSIDeployment
objects in the pmem-csi.intel.com
API group.
Refer to the PmemCSIDeployment CRD API
for a complete list of supported properties.
Here is a minimal example driver deployment created with a custom resource:
NOTE: nodeSelector
must match the node label that was set in the
installation and setup section. The PMEM-CSI
scheduler extender and
webhook are not enabled in this basic
installation. See below for
instructions about that.
$ kubectl create -f - <<EOF
apiVersion: pmem-csi.intel.com/v1beta1
kind: PmemCSIDeployment
metadata:
name: pmem-csi.intel.com
spec:
deviceMode: lvm
nodeSelector:
feature.node.kubernetes.io/memory-nv.dax: "true"
EOF
This uses the same pmem-csi.intel.com
driver name as the YAML files
in deploy
and the node label created by NFD (see the hardware
installation and setup section).
Once the above deployment installation is successful, we can see all the driver
pods in Running
state:
$ kubectl get pmemcsideployments
NAME DEVICEMODE NODESELECTOR IMAGE STATUS AGE
pmem-deployment lvm Running 50s
$ kubectl describe pmemcsideployment/pmem-csi.intel.com
Name: pmem-csi.intel.com
Namespace:
Labels: <none>
Annotations: <none>
API Version: pmem-csi.intel.com/v1beta1
Kind: PmemCSIDeployment
Metadata:
Creation Timestamp: 2020-10-07T07:31:58Z
Generation: 1
Managed Fields:
...
Resource Version: 1235740
Self Link: /apis/pmem-csi.intel.com/v1beta1/pmemcsideployments/pmem-csi.intel.com
UID: d8635490-53fa-4eec-970d-cd4c76f53b23
Spec:
Device Mode: lvm
Node Selector:
Storage: pmem
Status:
Conditions:
Last Update Time: 2020-10-07T07:32:00Z
Reason: Driver certificates are available.
Status: True
Type: CertsReady
Last Update Time: 2020-10-07T07:32:02Z
Reason: Driver deployed successfully.
Status: True
Type: DriverDeployed
Driver Components:
Component: Controller
Last Updated: 2020-10-08T07:45:13Z
Reason: 1 instance(s) of controller driver is running successfully
Status: Ready
Component: Node
Last Updated: 2020-10-08T07:45:11Z
Reason: All 3 node driver pod(s) running successfully
Status: Ready
Last Updated: 2020-10-07T07:32:21Z
Phase: Running
Reason: All driver components are deployed successfully
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NewDeployment 58s pmem-csi-operator Processing new driver deployment
Normal Running 39s pmem-csi-operator Driver deployment successful
$ kubectl get pod -n pmem-csi
NAME READY STATUS RESTARTS AGE
pmem-csi-intel-com-controller-79cd9f799d-rt54d 2/2 Running 0 51s
pmem-csi-intel-com-node-4x7cv 2/2 Running 0 50s
pmem-csi-intel-com-node-6grt6 2/2 Running 0 50s
pmem-csi-intel-com-node-msgds 2/2 Running 0 51s
pmem-csi-operator-749c7c7c69-k5k8n 1/1 Running 0 3m
- Get source code
PMEM-CSI uses Go modules and thus can be checked out and (if that should be desired) built anywhere in the filesystem. Pre-built container images are available and thus users don't need to build from source, but they may need some additional files for the following sections. To get the source code, use:
$ git clone https://github.com/intel/pmem-csi
- Choose a namespace
By default, setting up certificates as described in the next step will
automatically create a pmem-csi
namespace if it does not exist yet.
Later the driver will be installed in that namespace.
This can be changed by:
-
setting the
TEST_DRIVER_NAMESPACE
env variable to a different name when invokingsetup-ca-kubernetes.sh
and -
modifying the deployment with kustomize as explained below.
-
Set up certificates
Certificates are required as explained in Security for
running the PMEM-CSI scheduler extender and
webhook. If those are not used, then certificate
creation can be skipped. However, the YAML deployment files always create the PMEM-CSI
controller StatefulSet which needs the certificates. Without them, the
pmem-csi-intel-com-controller
pod cannot start, so it is recommended to create
certificates or customize the deployment so that this Deployment is not created.
On OpenShift, certificates can be created automatically as described in https://docs.openshift.com/container-platform/4.6/security/certificates/service-serving-certificate.html. The PMEM-CSI operator uses that approach and therefore is a simpler way to install PMEM-CSI on OpenShift.
Certificates can be created by running the ./test/setup-ca-kubernetes.sh
script for your cluster.
This script requires "cfssl" tools which can be downloaded.
These are the steps for manual set-up of certificates:
- Download cfssl tools
$ curl -L https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 -o _work/bin/cfssl --create-dirs
$ curl -L https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 -o _work/bin/cfssljson --create-dirs
$ chmod a+x _work/bin/cfssl _work/bin/cfssljson
- Run certificates set-up script
$ KUBCONFIG="<<your cluster kubeconfig path>>" PATH="$PWD/_work/bin:$PATH" ./test/setup-ca-kubernetes.sh
- Deploy the driver to Kubernetes
The deploy/kubernetes-<kubernetes version>
directory contains
pmem-csi*.yaml
files which can be used to deploy the driver on that
Kubernetes version. The files in the directory with the highest
Kubernetes version might also work for more recent Kubernetes
releases. All of these deployments use images published by Intel on
Docker Hub.
For each Kubernetes version, four different deployment variants are provided:
direct
orlvm
: one uses direct device mode, the other LVM device mode.testing
: the variants withtesting
in the name enable debugging features and shouldn't be used in production.
For example, to deploy for production with LVM device mode onto Kubernetes 1.18, use:
$ kubectl create -f deploy/kubernetes-1.18/pmem-csi-lvm.yaml
The PMEM-CSI scheduler extender and webhook are not enabled in this basic installation. See below for instructions about that.
These variants were generated with
kustomize
.
kubectl
>= 1.14 includes some support for that. The sub-directories
of deploy/kustomize-<kubernetes version>
can be used as bases
for kubectl kustomize
. For example:
-
Change namespace:
$ mkdir -p my-pmem-csi-deployment $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF namespace: pmem-driver bases: - ../deploy/kubernetes-1.18/lvm EOF $ kubectl create --kustomize my-pmem-csi-deployment
-
Configure how much PMEM is used by PMEM-CSI for LVM (see Namespace modes in LVM device mode):
$ mkdir -p my-pmem-csi-deployment $ cat >my-pmem-csi-deployment/kustomization.yaml <<EOF bases: - ../deploy/kubernetes-1.18/lvm patchesJson6902: - target: group: apps version: v1 kind: DaemonSet name: pmem-csi-node namespace: pmem-csi path: lvm-parameters-patch.yaml EOF $ cat >my-pmem-csi-deployment/lvm-parameters-patch.yaml <<EOF # pmem-driver is in the container #0. Append arguments at the end. - op: add path: /spec/template/spec/containers/0/args/- value: "-pmemPercentage=90" EOF $ kubectl create --kustomize my-pmem-csi-deployment
-
Wait until all pods reach 'Running' status
$ kubectl get pods -n pmem-csi
NAME READY STATUS RESTARTS AGE
pmem-csi-intel-com-controller-79cd9f799d-rt54d 2/2 Running 0 3m15s
pmem-csi-intel-com-node-8kmxf 2/2 Running 0 3m15s
pmem-csi-intel-com-node-bvx7m 2/2 Running 0 3m15s
pmem-csi-intel-com-node-fbmpg 2/2 Running 0 3m15s
A Kubernetes cluster administrators must define some volume parameters like the filesystem type in storage classes. Users then reference those storage classes when requesting generic ephemeral inline or persistent volumes. The size of volumes can be chosen by users.
xfs
and ext4
are supported filesystem types. In addition to the
normal parameters defined by Kubernetes, PMEM-CSI supports the
following custom parameters in a storage class:
key | meaning | optional | values |
---|---|---|---|
eraseAfter |
Clear all data by overwriting with zeroes after use and before deleting the volume | Yes | true (default), false |
kataContainers |
Prepare volume for use with DAX in Kata Containers. | Yes | false/0/f/FALSE (default), true/1/t/TRUE |
usage |
Determine how a volume is going to be used. | Yes | AppDirect (default), FileIO |
By default, volumes are created for AppDirect enabled applications:
- The namespace
mode is
fsdax
. - Mount parameters include
-o dax
(=-o dax=always
on newer kernels) which ensures that all files are automatically opened in DAX mode, i.e. reads and writes directly access the underlying PMEM.
This might not be ideal for traditional file IO because the page cache is
bypassed, which may affect performance, and because applications have to be
prepared to deal with partially written data sectors in case of crashes. When
the goal is to run traditional applications, then usage=FileIO
may be better:
- In direct mode, the namespace
mode is
sector
. - In LVM mode, the namespace mode is
fsdax
because currently PMEM-CSI doesn't support LVM on top of other namespaces. - Mount parameters do not include
-o dax
.
kataContainers
and usage=FileIO
are mutually exclusive because the former
is about making AppDirect available in Kata Containers. The normal volume
passthrough can be used for usage=FileIO
.
This section uses files from the common example directory. It is not necessary to check out the repository to use them.
Create a storage class with late binding, the recommended mode:
$ kubectl apply -f https://github.com/intel/pmem-csi/raw/devel/deploy/common/pmem-storageclass-late-binding.yaml
storageclass.storage.k8s.io/pmem-csi-sc-late-binding created
Then request a volume which uses that class:
$ kubectl apply -f https://github.com/intel/pmem-csi/raw/devel/deploy/common/pmem-pvc-late-binding.yaml
persistentvolumeclaim/pmem-csi-pvc-late-binding created
At this point, the volume is not yet getting created because of the late binding mode:
$ kubectl describe pvc/pmem-csi-pvc-late-binding
Name: pmem-csi-pvc-late-binding
Namespace: default
StorageClass: pmem-csi-sc-late-binding
Status: Pending
Volume:
Labels: <none>
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: <none>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 0s (x2 over 14s) persistentvolume-controller waiting for first consumer to be created before binding
The volume gets created once the first Pod starts to use it, on a node that is suitable for that Pod:
$ kubectl apply -f https://github.com/intel/pmem-csi/raw/devel/deploy/common/pmem-app-late-binding.yaml
pod/my-csi-app created
After a short while, the volume is created and the pod can run:
$ kubectl get pvc,pods -o wide
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/pmem-csi-pvc-late-binding Bound pvc-ade8dc48-a4c0-4f30-b479-84460a3e0591 4Gi RWO pmem-csi-sc-late-binding 55s
NAME READY STATUS RESTARTS AGE
pod/my-csi-app 1/1 Running 0 47s
The volume was mounted with dax=always
, therefore all file operations and memory regions mapped from that
volume into the address space of an application directly access the underlying PMEM:
$ kubectl exec my-csi-app -- df /data
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ndbus0region0fsdax/pvc-7d-83241976933418f96748a1c18d500c6cba91c1dfaa87145b7893569c 4062912 16376 3820440 1% /data
$ kubectl exec my-csi-app -- mount |grep /data
/dev/ndbus0region0fsdax/pvc-7d-83241976933418f96748a1c18d500c6cba91c1dfaa87145b7893569c on /data type ext4 (rw,relatime,seclabel,dax=always)
A few things can go wrong when trying out the previous example.
This shows up in kubectl get pods --all-namespaces
as failed Pods
and can be investigated with kubectl describe --namespace <driver namespace> pods/<pod name>
and kubectl logs --namespace <driver namespace> <pod name> pmem-driver
or one of the other containers
in that Pod.
When using deployment files from the devel
branch, the corresponding
container canary
image might not have been published yet. Better use
the latest stable release.
This can be checked with kubectl get pods --all-namespaces -o wide
.
Have nodes been labeled as expected by the driver deployment? Check
with kubectl get nodes -o yaml
.
This can happen on clusters where only some worker nodes have PMEM and
the PMEM-CSI scheduler extensions are
not enabled. This can be checked by looking at the selected-node
annotation of the PVC:
$ kubectl get pvc/pmem-csi-pvc-late-binding -o yaml | grep ' volume.kubernetes.io/selected-node:'
volume.kubernetes.io/selected-node: pmem-csi-pmem-govm-worker2
The PMEM-CSI controller pod will detect this and ask the scheduler to pick a node anew by removing that annotation, but it is random whether the next choice is better and starting the Pod may get delayed.
To avoid this, enable the scheduler extensions.
This also can only happen when the PMEM-CSI scheduler
extensions are not enabled. Then volume
creation is attempted repeatedly, potentially on different nodes, but
fails with not enough space
errors:
$ kubectl describe pvc/pmem-csi-pvc-late-binding
Name: pmem-csi-pvc-late-binding
Namespace: default
StorageClass: pmem-csi-sc-late-binding
Status: Pending
Volume:
Labels: <none>
Annotations: volume.beta.kubernetes.io/storage-provisioner: pmem-csi.intel.com
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Used By: my-csi-app
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal WaitForFirstConsumer 7m30s persistentvolume-controller waiting for first consumer to be created before binding
Normal WaitForPodScheduled 6m (x15 over 7m19s) persistentvolume-controller waiting for pod my-csi-app to be scheduled
Warning ProvisioningFailed 3m59s (x12 over 7m19s) pmem-csi.intel.com_pmem-csi-intel-com-node-nwkqv_cc2984e6-915f-4cf2-93a0-e143da407917 failed to provision volume with StorageClass "pmem-csi-sc-late-binding": rpc error: code = ResourceExhausted desc = Node CreateVolume: device creation failed: not enough space
Warning ProvisioningFailed 2m47s (x12 over 7m18s) pmem-csi.intel.com_pmem-csi-intel-com-node-9vlhf_6ac47898-58bf-45e1-b601-5d8f39d21f4e failed to provision volume with StorageClass "pmem-csi-sc-late-binding": rpc error: code = ResourceExhausted desc = Node CreateVolume: device creation failed: not enough space
Normal ExternalProvisioning 2m23s (x28 over 7m19s) persistentvolume-controller waiting for a volume to be created, either by external provisioner "pmem-csi.intel.com" or manually created by system administrator
Normal Provisioning 2m11s (x14 over 7m18s) pmem-csi.intel.com_pmem-csi-intel-com-node-9vlhf_6ac47898-58bf-45e1-b601-5d8f39d21f4e External provisioner is provisioning volume for claim "default/pmem-csi-pvc-late-binding"
Normal Provisioning 107s (x16 over 7m19s) pmem-csi.intel.com_pmem-csi-intel-com-node-nwkqv_cc2984e6-915f-4cf2-93a0-e143da407917 External provisioner is provisioning volume for claim "default/pmem-csi-pvc-late-binding"
The scheduler extensions prevent these useless attempts on nodes with insufficient PMEM. When none of the available nodes have sufficient PMEM, the attempt to schedule the example Pod fails:
$ kubectl describe pod/my-csi-app
Name: my-csi-app
Namespace: default
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 12s (x2 over 12s) default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 only 63484MiB of PMEM available, need 400GiB.
This is usually the result of not preparing the node(s) as describe in persistent memory pre-provisioning.
One way of checking this is to look at the logs of the PMEM-CSI driver
on a node. In this case, region0
was completely unused and the
driver was configured to use 50% of that for an LVM volume group:
$ kubectl get pods --all-namespaces -l app.kubernetes.io/name=pmem-csi-node -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pmem-csi pmem-csi-intel-com-node-d2mfh 3/3 Running 0 75s 192.168.200.66 pmem-csi-pmem-govm-worker3 <none> <none>
pmem-csi pmem-csi-intel-com-node-jkbgz 3/3 Running 0 75s 192.168.133.134 pmem-csi-pmem-govm-worker1 <none> <none>
pmem-csi pmem-csi-intel-com-node-th56d 3/3 Running 0 75s 192.168.220.67 pmem-csi-pmem-govm-worker2 <none> <none>
$ kubectl logs -n pmem-csi pmem-csi-intel-com-node-jkbgz pmem-driver
I0623 07:15:18.710690 1 main.go:73] "PMEM-CSI started." version="v0.9.0-188-gd451ec6f3-dirty"
I0623 07:15:18.711645 1 pmd-lvm.go:328] "LVM-New/setupNS: Checking region for fsdax namespaces" region="region0" percentage=50 size="64Gi" available="64Gi" max-available-extent="64Gi" may-use="32Gi"
I0623 07:15:18.712251 1 pmd-lvm.go:361] "LVM-New/setupNS: Create fsdax namespace" size="32Gi"
I0623 07:15:19.041186 1 region.go:282] "LVM-New/setupNS/CreateNamespace: Namespace created" region="region0" namespace="namespace0.1" usable-size="32254Mi" raw-size="32Gi" uuid="c3e6fe52-d3f2-11eb-b33e-c2b1549139a7"
I0623 07:15:19.079791 1 pmd-lvm.go:422] "LVM-New/setupVG/setupVGForNamespace: Creating new volume group" vg="ndbus0region0fsdax"
I0623 07:15:19.130041 1 mount_linux.go:163] Detected OS without systemd
I0623 07:15:19.130661 1 server.go:54] "GRPC Server: Listening for connections" endpoint="unix:///csi/csi.sock"
I0623 07:15:19.180760 1 pmem-csi-driver.go:305] "PMEM-CSI ready." capacity="32252Mi maximum volume size, 32252Mi available, 32252Mi managed, 64Gi total"
In a production environment, the metrics support could be used to monitor available PMEM per node.
The expectation is that the scripts which bring up nodes can be adapted to prepare the PMEM for usage by PMEM-CSI as explained earlier. But this might not always be easy.
For the case of converting an existing "raw" namespace to "fsdax" mode there is a possibility to do the conversion through a deployed PMEM-CSI driver:
- Install PMEM-CSI in LVM mode without preparing nodes. At this point only the central controller pod will run.
- For each node that has one or more raw namespaces that all need
to be converted, set the
<driver name>/convert-raw-namespaces
label (usuallypmem-csi.intel.com/convert-raw-namespaces
) toforce
. - This will cause the pods of the
pmem-csi-intel-com-node-setup
DaemonSet to run on those nodes. Those pods then will convert the namespaces, create the LVM volume group, remove theconvert-raw-namespaces
label (i.e. the pods will only run once) and add the normal label that enables the PMEM-CSI node driver pods to run. - The normal node driver pods start up and then are ready to provision volumes.
WARNING: the raw namespaces will be converted even when they are active. If data was stored on them, it will be lost after the conversion.
The output of a successful conversion will look like this:
I0623 07:32:52.773207 1 main.go:73] "PMEM-CSI started." version="v0.9.0-188-gd451ec6f3-dirty"
I0623 07:32:52.774386 1 convert.go:79] "ForceConvertRawNamespaces/convert: checking for namespaces"
I0623 07:32:52.774871 1 convert.go:81] "ForceConvertRawNamespaces/convert: checking" bus="{\"dev\":\"ndbus0\",\"dimms\":[{}],\"provider\":\"ACPI.NFIT\",\"regions\":[{}]}"
I0623 07:32:52.775316 1 convert.go:83] "ForceConvertRawNamespaces/convert: checking" region="{\"available_size\":0,\"dev\":\"region0\",\"mappings\":[{}],\"max_available_extent\":0,\"namespaces\":[{}],\"size\":68719476736,\"type\":\"pmem\"}"
I0623 07:32:52.775444 1 convert.go:90] "ForceConvertRawNamespaces/convert: checking" namespace="{\"blockdev\":\"pmem0\",\"dev\":\"namespace0.0\",\"enabled\":true,\"id\":0,\"mode\":\"raw\",\"name\":\"\",\"size\":68719476736,\"uuid\":\"1711a2a0-358d-4b14-a43c-8efa1a9f7154\"}"
I0623 07:32:52.775550 1 convert.go:99] "ForceConvertRawNamespaces/convert: converting raw namespace" namespace="{\"blockdev\":\"pmem0\",\"dev\":\"namespace0.0\",\"enabled\":true,\"id\":0,\"mode\":\"raw\",\"name\":\"\",\"size\":68719476736,\"uuid\":\"1711a2a0-358d-4b14-a43c-8efa1a9f7154\"}"
I0623 07:32:53.397897 1 convert.go:127] "ForceConvertRawNamespaces/convert: setting up volume group" namespace="{\"blockdev\":\"pmem0\",\"dev\":\"namespace0.0\",\"enabled\":false,\"id\":0,\"mode\":\"fsdax\",\"name\":\"\",\"size\":18446744073709551615,\"uuid\":\"00000000-0000-0000-0000-000000000000\"}" vg="ndbus0region0fsdax"
I0623 07:32:53.434094 1 pmd-lvm.go:422] "ForceConvertRawNamespaces/convert/setupVGForNamespace: Creating new volume group" vg="ndbus0region0fsdax"
I0623 07:32:53.457108 1 convert.go:133] "ForceConvertRawNamespaces/convert: converted to fsdax namespace" namespace="{\"blockdev\":\"pmem0\",\"dev\":\"namespace0.0\",\"enabled\":false,\"id\":0,\"mode\":\"fsdax\",\"name\":\"\",\"size\":18446744073709551615,\"uuid\":\"00000000-0000-0000-0000-000000000000\"}" vg="ndbus0region0fsdax"
I0623 07:32:53.457148 1 convert.go:75] "ForceConvertRawNamespaces/convert: successful" converted=1
I0623 07:32:53.479512 1 convert.go:172] "ForceConvertRawNamespaces/havePMEM: Volume group will be used by PMEM-CSI in LVM mode" vg="ndbus0region0fsdax"
I0623 07:32:53.523412 1 convert.go:200] "ForceConvertRawNamespaces/relabel: Change node labels" node="pmem-csi-pmem-govm-master" patch="{\"metadata\":{\"labels\":{\"pmem-csi.intel.com/convert-raw-namespaces\": null, \"feature.node.kubernetes.io/memory-nv.dax\": \"true\"}}}"
I0623 07:32:53.523605 1 pmem-csi-driver.go:326] "Raw namespace conversion is done, waiting for termination signal."
I0623 07:33:03.954098 1 pmem-csi-driver.go:344] "Caught signal, terminating." signal="terminated"
I0623 07:33:05.016426 1 main.go:93] "PMEM-CSI stopped."
It terminates once Kubernetes notices that the pod is no longer
needed. This usually happens quickly, so a log monitoring solution
may be needed to see this output because kubectl logs
does not work
for pods that were already deleted.
The DaemonSet contains some information which is available longer:
$ kubectl describe daemonsets/pmem-csi-intel-com-node-setup
Name: pmem-csi-intel-com-node-setup
Selector: app.kubernetes.io/instance=pmem-csi.intel.com,app.kubernetes.io/name=pmem-csi-node-setup,pmem-csi.intel.com/deployment=lvm-production
Node-Selector: pmem-csi.intel.com/convert-raw-namespaces=force
Labels: app.kubernetes.io/component=node-setup
app.kubernetes.io/instance=pmem-csi.intel.com
app.kubernetes.io/name=pmem-csi-node-setup
app.kubernetes.io/part-of=pmem-csi
pmem-csi.intel.com/deployment=lvm-production
Annotations: deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/component=node-setup
app.kubernetes.io/instance=pmem-csi.intel.com
app.kubernetes.io/name=pmem-csi-node-setup
app.kubernetes.io/part-of=pmem-csi
pmem-csi.intel.com/deployment=lvm-production
pmem-csi.intel.com/webhook=ignore
Service Account: pmem-csi-intel-com-node-setup
Containers:
pmem-driver:
Image: 172.17.42.1:5001/pmem-csi-driver:canary
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/pmem-csi-driver
-v=3
-logging-format=text
-mode=force-convert-raw-namespaces
-nodeSelector={"storage":"pmem"}
-nodeid=$(KUBE_NODE_NAME)
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulCreate 5m47s daemonset-controller Created pod: pmem-csi-intel-com-node-setup-fr9b8
Normal SuccessfulDelete 5m45s daemonset-controller Deleted pod: pmem-csi-intel-com-node-setup-fr9b8
If conversion fails, the pod will exit with an error and then get restarted automatically by Kubernetes to retry the conversion until it succeeds.
It is considered a user error if conversion is requested for a node which has nothing to convert. To make that obvious, the pod will print an error and then exist with an error. That way, the pod continues to exist and the log can be inspected to identify the problem.
Kata Containers support gets enabled via
the kataContainers
storage class parameter. PMEM-CSI then
creates a filesystem inside a partition inside a file. When such a volume
is used inside Kata Containers, the Kata Containers runtime makes sure that
the filesystem is mounted on an emulated NVDIMM device with full DAX support.
On the host, PMEM-CSI will try to mount through a loop device with -o dax
but proceed without -o dax
when the kernel does not support
that. Currently Linux up to and including 5.4 do not support it and it
is unclear when that support will be added In other words, on the host
such volumes are usable, but only without DAX.
When disabled, volumes support DAX on the host and are usable without DAX inside Kata Containers.
Raw block volumes are only supported with
kataContainers: false
. Attempts to create them with kataContainers: true
are rejected.
At the moment (= Kata Containers 1.11.0-rc0), only Kata Containers with QEMU enable the special support for such volumes. Without QEMU or with older releases of Kata Containers, the volume is still usable through the normal remote filesystem support (9p or virtio-fs). Support for Cloud Hypervisor is in progress.
With Kata Containers for QEMU, the VM must be configured appropriately
to allow adding the PMEM volumes to their address space. This can be
done globally by setting the memory_offset
in the
configuration-qemu.toml
file
or per-pod by setting the
io.katacontainers.config.hypervisor.memory_offset
annotation
in the pod meta data. In both cases, the value has to be large enough
for all PMEM volumes used by the pod, otherwise pod creation will fail
with an error similar to this:
Error: container create failed: QMP command failed: not enough space, currently 0x8000000 in use of total space for memory devices 0x3c100000
Note:
- The offset is currently (= Kata Containers 2.1.0) limited to 32 bit, which implies that volumes cannot be larger than 4GiB. An enhancement request for Kata Containers is pending.
- A newer version is also needed for a fix of issue #2018.
- kata-deploy, at least in Kata Containers 2.1.0, does not enable the
memory_offset
annotation, leading tofailed to create containerd task: annotation io.katacontainers.config.hypervisor.memory_offset is not enabled
errors.
The examples for usage of Kata Containers with
ephemeral and
persistent volumes use the pod
label. They assume that the kata-qemu
runtime class is
installed.
For the QEMU test cluster,
setup-kata-containers.sh
can be
used to install Kata Containers. However, this currently only works on
Clear Linux because on Fedora, the Docker container runtime is used
and Kata Containers does not support that one.
This is the original implementation of ephemeral inline volumes for CSI drivers in Kubernetes. It is currently available as a beta feature in Kubernetes.
Volume requests embedded in the pod spec with the csi
field are provisioned as
ephemeral volumes. The volume request could use below fields as
volumeAttributes
:
key | meaning | optional | values |
---|---|---|---|
size |
Size of the requested ephemeral volume as Kubernetes memory string ("1Mi" = 1024*1024 bytes, "1e3K = 1000000 bytes) | No | |
eraseAfter |
Clear all data by overwriting with zeroes after use and before deleting the volume | Yes | true (default), false |
kataContainers |
Prepare volume for use in Kata Containers. | Yes | false/0/f/FALSE (default), true/1/t/TRUE |
Try out ephemeral volume usage with the provided example application.
This approach was introduced in Kubernetes
1.19
with the goal of using them for PMEM-CSI instead of the older
approach. In contrast CSI ephemeral inline volumes, no changes are
needed in CSI drivers, so PMEM-CSI already fully supports this if the
cluster has the feature enabled. See
pmem-app-generic-ephemeral.yaml
for an example.
When using generic ephemeral inline volumes together with storage capacity tracking, the PMEM-CSI scheduler extensions are not needed anymore.
Applications can use volumes provisioned by PMEM-CSI as raw block
devices. Such
volumes use the same "fsdax" namespace mode as filesystem volumes
and therefore are block devices. That mode only supports dax (=
mmap(MAP_SYNC)
) through a filesystem. Pages mapped on the raw block
device go through the Linux page cache. Applications have to format
and mount the raw block volume themselves if they want dax. The
advantage then is that they have full control over that part.
For provisioning a PMEM volume as raw block device, one has to create a
PersistentVolumeClaim
with volumeMode: Block
. See example PVC and
application for usage reference.
That example demonstrates how to handle some details:
mkfs.ext4
needs-b 4096
to produce volumes that support dax; without it, the automatic block size detection may end up choosing an unsuitable value depending on the volume size.- Kubernetes bug #85624 must be worked around to format and mount the raw block device.
NOTE: this sections provides an in-depth explanation that makes no assumptions about how the cluster works. For simpler install instructions on OpenShift see below.
The PMEM-CSI scheduler extender and admission webhook are provided by
the PMEM-CSI controller. They need to be enabled during deployment via
the --schedulerListen=[<listen address>]:<port>
parameter. The
listen address is optional and can be left out. The port is where a
HTTPS server will run. The YAML files already enable this. The
operator has the controllerTLSSecret
and mutatePods
properties in
the DeploymentSpec
.
The controller needs TLS certificates which must be created in
advance. The YAML files expects them in a secret called
pmem-csi-intel-com-controller-secret
and will not work without one.
The operator is more flexible and creates a driver without the
controller by default. This can be changed by setting the
controllerTLSSecret
field in the PmemCSIDeployment
API.
That secret must contain the following data items:
ca.crt
: root CA certificatetls.key
: secret key of the webhooktls.crt
: public key of the webhook
The webhook certificate must include host names that match how the
webhooks are going to be called by the kube-apiserver
(i.e. pmem-csi-intel-com-scheduler.pmem-csi.svc
for a
deployment with the pmem-csi.intel.com
driver name in the pmem-csi
namespace) and by the kube-scheduler
(might be the same service name, through some external
load balancer or 127.0.0.1
when using the node port workaround
described below).
To enable the PMEM-CSI scheduler extender, a configuration file and an
additional --config
parameter for kube-scheduler
must be added to
the cluster control plane, or, if there is already such a
configuration file, one new entry must be added to the extenders
array. A full example is presented below.
The kube-scheduler
must be able to connect to the PMEM-CSI
controller via the urlPrefix
in its configuration. In some clusters
it is possible to use cluster DNS and thus a symbolic service name. If
that is the case, then deploy the scheduler
service as-is
and use https://pmem-csi-scheduler.default.svc
as urlPrefix
. If
the PMEM-CSI driver is deployed in a namespace, replace default
with
the name of that namespace.
In a cluster created with kubeadm, kube-scheduler
is unable to use
cluster DNS because the pod it runs in is configured with
hostNetwork: true
and without dnsPolicy
. Therefore the cluster DNS
servers are ignored. There also is no special dialer as in other
clusters. As a workaround, the PMEM-CSI service can be exposed via a
fixed node port like 32000 on all nodes. Then
https://127.0.0.1:32000
needs to be used as urlPrefix
. Here's how
the service can be created with that node port:
$ mkdir my-scheduler
$ cat >my-scheduler/kustomization.yaml <<EOF
bases:
- ../deploy/kustomize/scheduler
patchesJson6902:
- target:
version: v1
kind: Service
name: pmem-csi-intel-com-scheduler
namespace: pmem-csi
path: node-port-patch.yaml
EOF
$ cat >my-scheduler/node-port-patch.yaml <<EOF
- op: add
path: /spec/ports/0/nodePort
value: 32000
- op: add
path: /spec/type
value: NodePort
EOF
$ kubectl create --kustomize my-scheduler
When the node port is not needed, the scheduler service can be created directly with:
kubectl create --kustomize deploy/kustomize/scheduler
How to (re)configure kube-scheduler
depends on the cluster. With
kubeadm it is possible to set all necessary options in advance before
creating the master node with kubeadm init
. A running cluster can
be modified with kubeadm upgrade
.
One additional
complication with kubeadm is that kube-scheduler
by default doesn't
trust any root CA. The following kubeadm config file solves
this together with enabling the scheduler configuration by
bind-mounting the root certificate that was used to sign the certificate used
by the scheduler extender into the location where the Go
runtime will find it. It works for Kubernetes <= 1.18:
$ sudo mkdir -p /var/lib/scheduler/
$ sudo cp _work/pmem-ca/ca.pem /var/lib/scheduler/ca.crt
# https://github.com/kubernetes/kubernetes/blob/52d7614a8ca5b8aebc45333b6dc8fbf86a5e7ddf/staging/src/k8s.io/kube-scheduler/config/v1alpha1/types.go#L38-L107
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-policy.cfg' <<EOF
{
"kind" : "Policy",
"apiVersion" : "v1",
"extenders" :
[{
"urlPrefix": "https://<service name or IP>:<port>",
"filterVerb": "filter",
"prioritizeVerb": "prioritize",
"nodeCacheCapable": true,
"weight": 1,
"managedResources":
[{
"name": "pmem-csi.intel.com/scheduler",
"ignoredByScheduler": true
}]
}]
}
EOF
# https://github.com/kubernetes/kubernetes/blob/52d7614a8ca5b8aebc45333b6dc8fbf86a5e7ddf/staging/src/k8s.io/kube-scheduler/config/v1alpha1/types.go#L38-L107
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-config.yaml' <<EOF
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
schedulerName: default-scheduler
algorithmSource:
policy:
file:
path: /var/lib/scheduler/scheduler-policy.cfg
clientConnection:
# This is where kubeadm puts it.
kubeconfig: /etc/kubernetes/scheduler.conf
EOF
$ cat >kubeadm.config <<EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
scheduler:
extraVolumes:
- name: config
hostPath: /var/lib/scheduler
mountPath: /var/lib/scheduler
readOnly: true
- name: cluster-root-ca
hostPath: /var/lib/scheduler/ca.crt
mountPath: /etc/ssl/certs/ca.crt
readOnly: true
extraArgs:
config: /var/lib/scheduler/scheduler-config.yaml
EOF
$ kubeadm init --config=kubeadm.config
In Kubernetes 1.19, the configuration API of the scheduler changed. The corresponding command for Kubernetes >= 1.19 are:
$ sudo mkdir -p /var/lib/scheduler/
$ sudo cp _work/pmem-ca/ca.pem /var/lib/scheduler/ca.crt
# https://github.com/kubernetes/kubernetes/blob/1afc53514032a44d091ae4a9f6e092171db9fe10/staging/src/k8s.io/kube-scheduler/config/v1beta1/types.go#L44-L96
$ sudo sh -c 'cat >/var/lib/scheduler/scheduler-config.yaml' <<EOF
apiVersion: kubescheduler.config.k8s.io/v1beta1
kind: KubeSchedulerConfiguration
clientConnection:
# This is where kubeadm puts it.
kubeconfig: /etc/kubernetes/scheduler.conf
extenders:
- urlPrefix: https://127.0.0.1:<service name or IP>:<port>
filterVerb: filter
prioritizeVerb: prioritize
nodeCacheCapable: true
weight: 1
managedResources:
- name: pmem-csi.intel.com/scheduler
ignoredByScheduler: true
EOF
$ cat >kubeadm.config <<EOF
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
scheduler:
extraVolumes:
- name: config
hostPath: /var/lib/scheduler
mountPath: /var/lib/scheduler
readOnly: true
- name: cluster-root-ca
hostPath: /var/lib/scheduler/ca.crt
mountPath: /etc/ssl/certs/ca.crt
readOnly: true
extraArgs:
config: /var/lib/scheduler/scheduler-config.yaml
EOF
$ kubeadm init --config=kubeadm.config
It is possible to stop here without enabling the pod admission webhook. To enable also that, continue as follows.
First of all, it is recommended to exclude all system pods from passing through the web hook. This ensures that they can still be created even when PMEM-CSI is down:
$ kubectl label ns kube-system pmem-csi.intel.com/webhook=ignore
This special label is configured in the provided web hook definition. On Kubernetes >= 1.15, it can also be used to let individual pods bypass the webhook by adding that label. The CA gets configured explicitly, which is supported for webhooks.
$ mkdir my-webhook
$ cat >my-webhook/kustomization.yaml <<EOF
bases:
- ../deploy/kustomize/webhook
patchesJson6902:
- target:
group: admissionregistration.k8s.io
version: v1
kind: MutatingWebhookConfiguration
name: pmem-csi-intel-com-hook
path: webhook-patch.yaml
EOF
$ cat >my-webhook/webhook-patch.yaml <<EOF
- op: replace
path: /webhooks/0/clientConfig/caBundle
value: $(base64 -w 0 _work/pmem-ca/ca.pem)
EOF
$ kubectl create --kustomize my-webhook
NOTE: The scheduler extensions are only needed on OpenShift 4.6 and 4.7. On OpenShift 4.8, storage capacity tracking can and should be used instead.
The operator should be used on OpenShift. When creating the
deployment, set controllerTLSSecret
to the special string
-openshift-
:
$ kubectl create -f - <<EOF
apiVersion: pmem-csi.intel.com/v1beta1
kind: PmemCSIDeployment
metadata:
name: pmem-csi.intel.com
spec:
deviceMode: lvm
nodeSelector:
feature.node.kubernetes.io/memory-nv.dax: "true"
controllerTLSSecret: -openshift-
EOF
The webhook and the API server then get configured by the operator with certificates created automatically by OpenShift.
The scheduler must be configured manually, using the same API as for configuring scheduler policies. This can be done before or after deploying the PMEM-CSI driver. The configuration change can be left in place after removing a PMEM-CSI because it will then be ignored. However, without this step pods that use PMEM-CSI volumes will not get scheduled.
Communication between the kube-scheduler and PMEM-CSI will be done via http and a service that listens on a dynamically allocated host port. This approach is necessary because:
- kube-scheduler uses the host network and thus cannot connect to a service that is only available inside the cluster and
- There is no API for configuring TLS certificates.
First, define the service inside the namespace where the PMEM-CSI operator runs:
oc apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
name: pmem-csi-intel-com-http-scheduler
namespace: pmem-csi
spec:
selector:
app.kubernetes.io/name: pmem-csi-controller
app.kubernetes.io/instance: pmem-csi.intel.com # This must be the name of the PMEM-CSI deployment.
type: NodePort
ports:
- targetPort: 8001
port: 80
EOF
Then create a scheduler policy. If such a policy already exists, the
extenders
section below must be added to it.
oc create -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: scheduler-policy
namespace: openshift-config
data:
policy.cfg: |
{
"kind" : "Policy",
"apiVersion" : "v1",
"extenders" : [
{ "urlPrefix": "http://127.0.0.1:$(oc get service/pmem-csi-intel-com-http-scheduler -n pmem-csi -o jsonpath={.spec.ports[*].nodePort})",
"filterVerb": "filter",
"prioritizeVerb": "prioritize",
"nodeCacheCapable": true,
"weight": 1,
"managedResources": [
{ "name": "pmem-csi.intel.com/scheduler",
"ignoredByScheduler": true
}
]
}
]
}
EOF
Finally, activate the usage of that policy by updating the existing
scheduler/cluster
object. If a policy was set already, this command
will fail with The request is invalid
, in which case the existing
policy config map must be edited.
$ oc patch scheduler/cluster --type json \
--patch '[{"op":"test","path":"/spec/policy/name","value":""}, {"op":"replace","path":"/spec/policy/name","value":"scheduler-policy"}]'
scheduler.config.openshift.io/cluster patched
This causes schedulers to be restarted with a new configuration:
$ oc get events -n openshift-kube-scheduler-operator
...
14m Normal ConfigMapCreated deployment/openshift-kube-scheduler-operator Created ConfigMap/policy-configmap -n openshift-kube-scheduler because it was missing
14m Normal RevisionTriggered deployment/openshift-kube-scheduler-operator new revision 7 triggered by "configmap/policy-configmap has changed"
14m Normal ConfigMapCreated deployment/openshift-kube-scheduler-operator Created ConfigMap/revision-status-7 -n openshift-kube-scheduler because it was missing
14m Normal ConfigMapCreated deployment/openshift-kube-scheduler-operator Created ConfigMap/kube-scheduler-pod-7 -n openshift-kube-scheduler because it was missing
...
13m Normal OperatorStatusChanged deployment/openshift-kube-scheduler-operator Status for clusteroperator/kube-scheduler changed: Progressing changed from True to False ("NodeInstallerProgressing: 1 nodes are at revision 8"),Available message changed from "StaticPodsAvailable: 1 nodes are active; 1 nodes are at revision 6; 0 nodes have achieved new revision 8" to "StaticPodsAvailable: 1 nodes are active; 1 nodes are at revision 8"
13m Normal ConfigMapUpdated deployment/openshift-kube-scheduler-operator Updated ConfigMap/revision-status-8 -n openshift-kube-scheduler:
cause by changes in data.status
13m Normal PodCreated deployment/openshift-kube-scheduler-operator Created Pod/revision-pruner-8-tt-87fkd-master-0 -n openshift-kube-scheduler because it was missing
$ oc get pods -n openshift-kube-scheduler -l app=openshift-kube-scheduler
NAME READY STATUS RESTARTS AGE
openshift-kube-scheduler-tt-87fkd-master-0 3/3 Running 0 11m
$ oc exec -ti -n openshift-kube-scheduler openshift-kube-scheduler-tt-87fkd-master-0 -c kube-scheduler -- cat /etc/kubernetes/static-pod-resources/configmaps/policy-configmap/policy.cfg
{
"kind" : "Policy",
"apiVersion" : "v1",
"extenders" : [
{ "urlPrefix": "https://127.0.0.1:30674",
"filterVerb": "filter",
"prioritizeVerb": "prioritize",
"nodeCacheCapable": true,
"weight": 1,
"managedResources": [
{ "name": "pmem-csi.intel.com/scheduler",
"ignoredByScheduler": true
}
]
}
]
}
Kubernetes 1.19 introduces support for publishing and using storage capacity information for pod scheduling. It became beta in 1.21. PMEM-CSI must be deployed differently to use this feature:
external-provisioner
must be told to publish storage capacity information via command line arguments.- A flag in the CSI driver information must be set for the Kubernetes scheduler, otherwise it ignores that information when considering pods with unbound volume.
The deployments for Kubernetes >= 1.21 do this automatically. The alpha API in 1.19 and 1.20 is no longer supported.
Metrics support is controlled by command line options of the PMEM-CSI driver binary and of the CSI sidecars. Annotations and named container ports make it possible to discover these data scraping endpoints. The metrics kustomize base adds all of that to the pre-generated deployment files. The operator also enables the metrics support.
Access to metrics data is not restricted (no TLS, no client authorization) because the metrics data is not considered confidential and access control would just make client configuration unnecessarily complex.
PMEM-CSI exposes metrics data about the Go runtime, Prometheus, CSI method calls, and PMEM-CSI:
Name | Type | Explanation |
---|---|---|
build_info |
gauge | A metric with a constant '1' value labeled by version. |
scheduler_request_duration_seconds |
histogram | Latencies for PMEM-CSI scheduler HTTP requests by operation ("mutate", "filter", "status") and method ("post"). |
scheduler_in_flight_requests |
gauge | Currently pending PMEM-CSI scheduler HTTP requests. |
scheduler_requests_total |
counter | Number of HTTP requests to the PMEM-CSI scheduler, regardless of operation and method. |
scheduler_response_size_bytes |
histogram | Histogram of response sizes for PMEM-CSI scheduler requests, regardless of operation and method. |
csi_[sidecar|plugin]_operations_seconds |
histogram | gRPC call duration and error code, for sidecar to driver (aka plugin) communication. |
go_* |
Go runtime information | |
pmem_amount_available |
gauge | Remaining amount of PMEM on the host that can be used for new volumes. |
pmem_amount_managed |
gauge | Amount of PMEM on the host that is managed by PMEM-CSI. |
pmem_amount_max_volume_size |
gauge | The size of the largest PMEM volume that can be created. |
pmem_amount_total |
gauge | Total amount of PMEM on the host. |
process_* |
Process information | |
promhttp_metric_handler_requests_in_flight |
gauge | Current number of scrapes being served. |
promhttp_metric_handler_requests_total |
counter | Total number of scrapes by HTTP status code. |
This list is tentative and may still change as long as metrics support is alpha. To see all available data, query a container. Different containers provide different data. For example, the controller provides:
$ kubectl port-forward -n pmem-csi $(kubectl get pods -n pmem-csi -o name -l app.kubernetes.io/name=pmem-csi-controller) 10010
Forwarding from 127.0.0.1:10010 -> 10010
Forwarding from [::1]:10010 -> 10010
And in another shell:
$ curl --silent http://localhost:10010/metrics | grep '# '
# HELP build_info A metric with a constant '1' value labeled by version.
# TYPE build_info gauge
...
An extension of the scrape config is
necessary for Prometheus. When deploying
Prometheus via Helm,
that file can be added to the default configuration with the -f
parameter. The following example works for the QEMU-based
cluster and Helm v3.1.2. In a real
production deployment, some kind of persistent storage should be
provided. The URL can be used instead of
the file name, too.
$ helm install prometheus stable/prometheus \
--set alertmanager.persistentVolume.enabled=false,server.persistentVolume.enabled=false \
-f deploy/prometheus.yaml
NAME: prometheus
LAST DEPLOYED: Tue Aug 18 18:04:27 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
prometheus-server.default.svc.cluster.local
Get the Prometheus server URL by running these commands in the same shell:
export POD_NAME=$(kubectl get pods --namespace default -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
kubectl --namespace default port-forward $POD_NAME 9090
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Server pod is terminated. #####
#################################################################################
...
After running this kubectl port-forward
command, it is possible to
access the Prometheus web
interface
and run some queries there. Here are some examples for the QEMU test
cluster with two volumes created on node pmem-csi-pmem-govm-worker2
.
Available PMEM as percentage:
pmem_amount_available / pmem_amount_managed
Result variable | Value | Tags |
---|---|---|
none | 0.7332986065893997 | instance = 10.42.0.1:10010 |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-dfkrw | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker2 | ||
node = pmem-csi-pmem-govm-worker2 | ||
1 | instance = 10.36.0.1:10010 | |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-z5vnp | ||
kubernetes_pod_node_name pmem-csi-pmem-govm-worker3 | ||
node = pmem-csi-pmem-govm-worker3 | ||
1 | instance = 10.44.0.1:10010 | |
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-zzmsd | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker1 | ||
node = pmem-csi-pmem-govm-worker1 |
Number of CreateVolume
calls in nodes:
pmem_csi_node_operations_seconds_count{method_name="/csi.v1.Controller/CreateVolume"}
Result variable | Value | Tags |
---|---|---|
pmem_csi_node_operations_seconds_count |
2 | driver_name = pmem-csi.intel.com |
grpc_status_code = OK | ||
instance = 10.42.0.1:10010 | ||
job = pmem-csi-containers | ||
kubernetes_namespace = default | ||
kubernetes_pod_container_name = pmem-driver | ||
kubernetes_pod_name = pmem-csi-node-dfkrw | ||
kubernetes_pod_node_name = pmem-csi-pmem-govm-worker2 | ||
method_name = /csi.v1.Controller/CreateVolume | ||
node = pmem-csi-pmem-govm-worker2 |
PmemCSIDeployment
is a cluster-scoped Kubernetes resource in the
pmem-csi.intel.com
API group. It describes how a PMEM-CSI driver
instance is to be created.
The operator will create objects in the namespace in which the operator itself runs if the object type is namespaced.
The name of the deployment object is also used as CSI driver name. This ensures that the name is unique and immutable. However, name clashes with other CSI drivers are still possible, so the name should meet the CSI requirements:
- domain name notation format, including a unique top-level domain
- 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), dots (.), and alphanumerics between.
The name is also used as prefix for the names of all objects created
for the deployment and for the local /var/lib/<name>
state directory
on each node.
Although the operator allows running multiple PMEM-CSI driver deployments, one
has to take extreme care of such deployments by ensuring that not more than
one driver ends up running on the same node(s). Nodes on which a PMEM-CSI
driver could run can be configured by using the nodeSelector
property of
the DeploymentSpec
.
NOTE: Starting from release v0.9.0 reconciling of the Deployment
CRD in pmem-csi.intel.com/v1alpha1
API group is not supported by the
PMEM-CSI operator anymore. Such resources in the cluster must be migrated
manually to new the PmemCSIDeployment
API.
The current API for PmemCSIDeployment
resources is:
Field | Type | Description |
---|---|---|
apiVersion | string | pmem-csi.intel.com/v1beta1 |
kind | string | PmemCSIDeployment |
metadata | ObjectMeta | Object metadata, name used for CSI driver and as prefix for sub-objects |
spec | DeploymentSpec | Specification of the desired behavior of the deployment |
Below specification fields are valid in all API versions unless noted otherwise in the description.
The default values are used by the operator when no value is set for a field explicitly. Those defaults can change over time and are not part of the API specification.
Field | Type | Description | Default Value |
---|---|---|---|
image | string | PMEM-CSI docker image name used for the deployment | the same image as the operator1 |
provisionerImage | string | CSI provisioner docker image name | latest external provisioner stable release image2 |
nodeRegistrarImage | string | CSI node driver registrar docker image name | latest node driver registrar stable release image2 |
pullPolicy | string | Docker image pull policy. either one of Always , Never , IfNotPresent |
IfNotPresent |
logLevel | integer | PMEM-CSI driver logging level | 3 |
logFormat | text | log output format | "text" or "json" 3 |
deviceMode | string | Device management mode to use. Supports one of lvm or direct |
lvm |
controllerTLSSecret | string | Name of an existing secret in the driver's namespace which contains ca.crt, tls.crt and tls.key data for the scheduler extender and pod mutation webhook. A controller is started if (and only if) this secret is specified. Alternatively, the special string -openshift- can be used on OpenShift to let OpenShift create the necessary secrets. |
empty |
controllerReplicas | int | Number of concurrently running controller pods. | 1 |
mutatePods | Always/Try/Never | Defines how a mutating pod webhook is configured if a controller is started. The field is ignored if the controller is not enabled. "Never" disables pod mutation. "Try" configured it so that pod creation is allowed to proceed even when the webhook fails. "Always" requires that the webhook gets invoked successfully before creating a pod. | Try |
schedulerNodePort | int or string | If non-zero, the scheduler service is created as a NodeService with that fixed port number. Otherwise that service is created as a cluster service. The number must be from the range reserved by Kubernetes for node ports. This is useful if the kube-scheduler cannot reach the scheduler extender via a cluster service. | 0 |
controllerResources | ResourceRequirements | Describes the compute resource requirements for controller pod. 4Deprecated and only available in v1alpha1 . |
|
nodeResources | ResourceRequirements | Describes the compute resource requirements for the pods running on node(s). 4Deprecated and only available in v1alpha1 . |
|
controllerDriverResources | ResourceRequirements | Describes the compute resource requirements for controller driver container running on master node. Available since v1beta1 . |
|
nodeDriverResources | ResourceRequirements | Describes the compute resource requirements for the driver container running on worker node(s). Available since v1beta1 . |
|
provisionerResources | ResourceRequirements | Describes the compute resource requirements for the external provisioner sidecar container. Available since v1beta1 . |
|
nodeRegistrarResources | ResourceRequirements | Describes the compute resource requirements for the driver registrar sidecar container running on worker node(s). Available since v1beta1 . |
|
registryCert | string | Encoded tls certificate signed by a certificate authority used for driver's controller registry server | generated by operator self-signed CA |
nodeControllerCert | string | Encoded tls certificate signed by a certificate authority used for driver's node controllers | generated by operator self-signed CA |
registryKey | string | Encoded RSA private key used for signing by registryCert |
generated by the operator |
nodeControllerKey | string | Encoded RSA private key used for signing by nodeControllerCert |
generated by the operator |
caCert | string | Certificate of the CA by which the registryCert and controllerCert are signed |
self-signed certificate generated by the operator |
nodeSelector | string map | Labels to use for selecting Nodes on which PMEM-CSI driver should run. | { "storage": "pmem" } |
pmemPercentage | integer | Percentage of PMEM space to be used by the driver on each node. This is only valid for a driver deployed in lvm mode. This field can be modified, but by that time the old value may have been used already. Reducing the percentage is not supported. |
100 |
labels | string map | Additional labels for all objects created by the operator. Can be modified after the initial creation, but removed labels will not be removed from existing objects because the operator cannot know which labels it needs to remove and which it has to leave in place. | |
kubeletDir | string | Kubelet's root directory path | /var/lib/kubelet |
maxUnavailable | int or string | maximum number of node drivers that are allowed to be down during a rolling update, given as absolute number or percentage of the total number of nodes with the driver | 1 |
1 To use the same container image as default driver image the operator pod must set with below environment variables with appropriate values:
- POD_NAME: Name of the operator pod. Namespace of the pod could be figured out by the operator.
- OPERATOR_NAME: Name of the operator container. If not set, defaults to "pmem-csi-operator"
2 Image versions depend on the Kubernetes release. The operator dynamically chooses suitable image versions. Users have to take care of that themselves when overriding the values.
3 In PMEM-CSI 0.9.0, "json" output is only available for the PMEM-CSI container. The sidecars are still producing plain text messages. This may change in the future. Also, the migration from formatted log messages (= printf style) to structured log messages (message plus key/value pairs) is not complete.
4 Pod level resource requirements (nodeResources
and controllerResources
)
are deprecated in favor of per-container resource requirements (nodeDriverResources
, nodeRegistrarResources
,
controllerDriverResources
and provisionerResources
).
WARNING: although all fields can be modified and changes will be
propagated to the deployed driver, not all changes are safe. In
particular, changing the deviceMode
will not work when there are
active volumes.
A PMEM-CSI Deployment's status
field is a DeploymentStatus
object, which
carries the detailed state of the driver deployment. It is comprised of deployment
conditions, driver component status,
and a phase
field. The phase of a PMEM-CSI deployment is a high-level summary
of where the the PmemCSIDployment is in its lifecycle.
The possible phase
values and their meaning are as below:
Value | Meaning |
---|---|
empty string | A new deployment. |
Running | The operator has determined that the driver is usable1. |
Failed | For some reason, the PmemCSIDeployment failed and cannot be progressed. The failure reason is placed in the DeploymentStatus.Reason field. |
1 This check has not been implemented yet. Instead, the deployment goes straight to Running
after creating sub-resources.
PMEM-CSI DeploymentStatus
has an array of conditions
through which the
PMEM-CSI Deployment has or has not passed. Below are the possible condition
types and their meanings:
Condition type | Meaning |
---|---|
CertsReady | Driver certificates/secrets are available. |
CertsVerified | Verified that the provided certificates are valid. |
DriverDeployed | All the componentes required for the PMEM-CSI deployment have been deployed. |
PMEM-CSI DeploymentStatus
has an array of components
of type DriverStatus
in which the operator records the brief driver components status. This is
useful to know if all the driver instances of a deployment are ready.
Below are the fields and their meanings of DriverStatus
:
Field | Meaning |
---|---|
component | Represents the driver component type; one of Controller or Node . |
status | Represents the state of the component; one of Ready or NotReady . Component becomes Ready if all the instances of the driver component are running. Otherwise, NotReady . |
reason | A brief message that explains why the component is in this state. |
lastUpdateTime | Time at which the status updated. |
The PMEM-CSI operator posts events on the progress of a PmemCSIDeployment
. If the
deployment is in the Failed
state, then one can look into the event(s) using
kubectl describe
on that deployment for the detailed failure reason.
PMEM-CSI operator exposes below metrics data about active PmemCSIDeployment custom resources and it's sub-object in addition to the metrics data provided by the controller-runtime:
Name | Type | Explanation |
---|---|---|
pmem_csi_deployment_reconcile |
counter | Counter that gets incremented on each time a PmemCSIDeployment CR gone through a reconcile loop, labeled with the deployment name and uid. |
pmem_csi_deployment_sub_resource_created_at |
gauge | Timestamp at which a sub resource of the PmemCSIDeployment CR was created by the operator. Labeled by resource details ("name, "namespace", "group", "version", "kind", "uid", "ownedBy"). |
pmem_csi_deployment_sub_resource_updated_at |
gauge | Timestamp at which a sub resource of the PmemCSIDeployment CR was updated by the operator. Labeled by resource details ("name, "namespace", "group", "version", "kind", "uid", "ownedBy"). |
Report a bug by filing a new issue.
Before making your first contribution, be sure to read the development documentation for guidance on code quality and branches.
Contribute by opening a pull request.
Learn about pull requests.
Reporting a Potential Security Vulnerability: If you have discovered potential security vulnerability in PMEM-CSI, please send an e-mail to [email protected]. For issues related to Intel Products, please visit Intel Security Center.
It is important to include the following details:
- The projects and versions affected
- Detailed description of the vulnerability
- Information on known exploits
Vulnerability information is extremely sensitive. Please encrypt all security vulnerability reports using our PGP key.
A member of the Intel Product Security Team will review your e-mail and contact you to collaborate on resolving the issue. For more information on how Intel works to resolve security issues, see: vulnerability handling guidelines.