No affinity on pv when using topology #566

creativie · 2022-08-24T07:34:59Z

Hi
IBM block CSI driver does not add nodeAffinity information to pv when using CSI Topology. This behavior causes pod scheduling and unsuccessful attempts to mount pv outside of those described in secret "supported_topologies".

Versions:
ibm-block-csi-operator.v1.10.0
Openshift 4.11.0 (k8s v1.24.0+9546431)

kasserater · 2024-08-27T09:21:45Z

can you reproduce this with latest 1.11.3 release?

dje4om · 2025-01-30T16:59:47Z

Hi,

We also observe this behavior with topologies, we are currently running the ibm-block CSI 1.12.0 on an a 4.17 openshift cluster
We have 2 zones with 2 differents SVC (1 by zone) and corresponding storage pools, a zone1 node can't access zone2 storage and vice-versa.

The toplogy is respected on creation, but we expect this affinity is also taken into account when rescheduling occurs, but pods still trying to schedule on the other zone where it is technically impossible (and volountary) to find its volume.
We expect these pods in Pending state if no workers available on the expected zone.
As far as i know, it should be an affinity on the pv like @creativie said, it should be something like this :

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.block.csi.ibm.com/zone
          operator: In
          values:
          - zone1

Do we miss something ?

lechapitre · 2025-01-31T05:11:02Z

@creativie / @dje4om

IBM CSI pods filters access to system/volumes in both the host definer and volume operations. Each node on the cluster should only be able access volumes allowed by the zoning labels.

If you want your own pods to have filtered access - you need to do that with K8S topology (node affinity), not CSI topology.

Below is an example - can you please confirm that you defined topology similarly?

If not - I'll need the secret file (of course replace any confidential information with fake values) and also the label information in the node description, also the PV definition in the pod you're trying to run, and how the pod configures access to the mount.

In the following example of one element in a topology-aware secret definition (taken from the IBM CSI documentation):

storage with address demo-management-address-1 is accessible with username/password demo-username-1/demo-password-1
But ONLY BY CSI pods running on nodes that have two labels defined (topology.block.csi.ibm.com/demo-region = demo-region-1, topology.block.csi.ibm.com/demo-zone = demo-zone-1).

Note those are node labels, not annotations (I'm noting explicitly because K8S zoning uses node annotations)

   "demo-management-id-1": {
     "username": "demo-username-1",
     "password": "demo-password-1",
     "management_address": "demo-management-address-1",
     "supported_topologies": [
       {
         "topology.block.csi.ibm.com/demo-region": "demo-region-1",
         "topology.block.csi.ibm.com/demo-zone": "demo-zone-1"
       }
     ]
   },

If you define your pod with a volume mount definition - the corresponding PV must conform to CSI zoning restrictions. To allow you pod to run only on nodes that have access to a specific PV - you also need to use K8S zoning (node affinity).

Of course - CSI cannot control or define affinity of user pods.

If you defined your pods as above and see an issue - I'll I need more details as requested above.

Thanks!

dje4om · 2025-01-31T11:58:44Z

Hi @lechapitre , thx for your reply !

I agree that's it is not the role of the csi to define pod affinity or topology constraints, and in our case, it is natively done through kubernetes topology labels, this one is used : topology.kubernetes.io/zone
It is labels on nodes, not annotations: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone, it's also talking about usage with PersistentVolume and topology awareness

Here is the setup about node labels for topology usage:

oc get nodes -l node-role.kubernetes.io/worker= -L topology.kubernetes.io/zone -L topology.block.csi.ibm.com/region -L topology.block.csi.ibm.com/zone
NAME               STATUS   ROLES                  AGE   VERSION   ZONE    REGION         ZONE        
openshift-worker1  Ready    worker                 41d   v1.30.5   zone1   myregion       zone1
openshift-worker2  Ready    worker                 41d   v1.30.5   zone2   myregion       zone2

By the way, why using specific labels for the csi instead of native ones ? we expect something consistent here, or maybe i don't have the use case in mind, but we would appreciate to be able to rely on our own labels or native one like other csi does

We use the topology aware setup according to the documentation, here is the specific secret configuration, as you can see, a different management_address is configured by zone, like you said, we agree and expect that behaviour from a topology awareness feature:

Each node on the cluster should only be able access volumes allowed by the zoning labels

Also, a volume can be remap only on another worker in the same zone

{
    "id-zone1": {
        "username": "user_zone1",
        "password": "xxxxxxxxxxxxxxxxxxx",
        "management_address": "10.0.0.1",
        "supported_topologies": [
            {
                "topology.block.csi.ibm.com/region": "myregion",
                "topology.block.csi.ibm.com/zone": "zone1"
            }
        ]
    },
    "id-zone2": {
        "username": "user_zone2",
        "password": "xxxxxxxxxxxxxxxxxxx",
        "management_address": "10.10.0.1",
        "supported_topologies": [
            {
                "topology.block.csi.ibm.com/region": "myregion",
                "topology.block.csi.ibm.com/zone": "zone2"
            }
        ]
    }
}

To summarize a "user story" : A pod is created on zone1 (kubernetes decide this for us or we assume we specified the zone), claim a pvc via topology aware storageClass on the correct zone : zone1 and csi create the volume on the expected zone. Then, let's assume we lose zone1 workers, and the pod is allowed to schedule on the other zone: zone2 (could be an error of configuration or whatever), it is currently trying to find its volume already created on zone1, of course, it did not find it and pod will be in error instead of pending state.

Some points about your answer :

Actually, we consider that it is the role of the csi to ensure that created pv are only accessible from the expected zone, and to indirectly inform pods of that, i mean if a pod try to go on zone1 (so a node with native label zone1 and also ibm label zone1), if the pv inform that is on zone2, pod will not try to schedule, this is the role in my understanding of PersistentVolume.spec.nodeAffinity and may solve this behaviour
I am not sure to understand why you talk about host-definer here, by the way, there is an issue with topology awarness secret : "malformed node or string" on host-definer with topology-awareness enabled #662, see "malformed node or string" on host-definer with topology-awareness enabled #662 (comment)
On pod side, it is totally transparent about volume mount stuff etc ..., we rely on the storageClass and csi, even if scheduling is badly configured

Thanks !

lechapitre · 2025-02-01T11:23:31Z

@dje4om ,

As suspected - your use case is not supported by CSI.

The purpose of topology in IBM CSI is to restrict access to volumes from certain nodes. It is not meant to limit resource usage.

It is unclear to me why in your use case you need IBM CSI zoning in the first place - if I understand correctly you want the pod to run on either node selected by OpenShift and that this pod should have access to the same volume regardless of the node selected. Why did you define zoning?

I mentioned host-definer because it also needs to be topology aware - if only node1 is able to access volume1 - then only node1 ports need to be defined on volume1 storage.

I don't know the historical reason for IBM CSI using its own labels - but it does allow more flexibility. Only subset of pods restricted by affinity can have access to certain volumes (restricted by IBM CSI labels).

dje4om · 2025-02-03T09:01:57Z

Hello @lechapitre,

Our case is a very standard one, confusion probably comes due to the way i try to describe how to reproduce

We need zoning because both workers are physicals nodes (it's a simplified example), and are in a different datacenter with there own storage array, this is why we need csi toplogy.

We absolutely don't want a pod to be able to move from one zone to another, and that what's actually happenning, of course, the pod can't schedule (crashloopbackoff) because it can't find its volume, but it tried and this is the issue here.
The parameter on PersistentVolume is a potential improvement to prevent this natively

I'm wondering how you define deployments or statefulsets to ensure the topology is respected ?

lechapitre · 2025-02-05T08:50:33Z

@dje4om ,

So pod starts to run on node1 with access to volume1.
Then pod is rescheduled to node2 and crashes when it tries to access volume1 - but you expect it to fail (or use volume2 instead) - correct?

Can you explain how K8S/OpenShift allows the pod to be rescheduled on node2 if the pod affinity is "node1 only"? (which you should have defined yourself with node annotations, not IBM CSI labels).

You asked about our implementation - we use the labels such that our pods (CSI and host-definer) only "see" volumes which are allowed by label. If topology doesn't allow a node instance of CSI/host-definer to access a volume - then it's simply as if the storage doesn't exist on that node.

dje4om · 2025-02-10T21:39:09Z

Yes, but we don't expect it to fail or to create a new volume, but to remain pending

We observed this behaviour on a statefulset with one replica during node maintenance that we had to force due to a pdb, we expected the pod to remain pending, but it continues to crashloopback even when the node has returned from maintenance.
I will work on an explicit example and scenario to help you reproduce this issue
Of course, it could be an issue of configuration on pod side (as i've already said), but it is certainly a prevention mechanism to improve relability

Fyi, I since check another csi (well know vendor) that support topologies, and this configuration is well defined

Here is some additional informations : https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No affinity on pv when using topology #566

No affinity on pv when using topology #566

creativie commented Aug 24, 2022 •

edited

Loading

kasserater commented Aug 27, 2024

dje4om commented Jan 30, 2025

lechapitre commented Jan 31, 2025 •

edited

Loading

dje4om commented Jan 31, 2025

lechapitre commented Feb 1, 2025

dje4om commented Feb 3, 2025

lechapitre commented Feb 5, 2025

dje4om commented Feb 10, 2025

No affinity on pv when using topology #566

No affinity on pv when using topology #566

Comments

creativie commented Aug 24, 2022 • edited Loading

kasserater commented Aug 27, 2024

dje4om commented Jan 30, 2025

lechapitre commented Jan 31, 2025 • edited Loading

dje4om commented Jan 31, 2025

lechapitre commented Feb 1, 2025

dje4om commented Feb 3, 2025

lechapitre commented Feb 5, 2025

dje4om commented Feb 10, 2025

creativie commented Aug 24, 2022 •

edited

Loading

lechapitre commented Jan 31, 2025 •

edited

Loading