Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No affinity on pv when using topology #566

Open
creativie opened this issue Aug 24, 2022 · 8 comments
Open

No affinity on pv when using topology #566

creativie opened this issue Aug 24, 2022 · 8 comments

Comments

@creativie
Copy link

creativie commented Aug 24, 2022

Hi
IBM block CSI driver does not add nodeAffinity information to pv when using CSI Topology. This behavior causes pod scheduling and unsuccessful attempts to mount pv outside of those described in secret "supported_topologies".

Versions:
ibm-block-csi-operator.v1.10.0
Openshift 4.11.0 (k8s v1.24.0+9546431)

@kasserater
Copy link
Member

can you reproduce this with latest 1.11.3 release?

@dje4om
Copy link

dje4om commented Jan 30, 2025

Hi,

We also observe this behavior with topologies, we are currently running the ibm-block CSI 1.12.0 on an a 4.17 openshift cluster
We have 2 zones with 2 differents SVC (1 by zone) and corresponding storage pools, a zone1 node can't access zone2 storage and vice-versa.

The toplogy is respected on creation, but we expect this affinity is also taken into account when rescheduling occurs, but pods still trying to schedule on the other zone where it is technically impossible (and volountary) to find its volume.
We expect these pods in Pending state if no workers available on the expected zone.
As far as i know, it should be an affinity on the pv like @creativie said, it should be something like this :

spec:
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.block.csi.ibm.com/zone
          operator: In
          values:
          - zone1

Do we miss something ?

@lechapitre
Copy link
Contributor

lechapitre commented Jan 31, 2025

@creativie / @dje4om

IBM CSI pods filters access to system/volumes in both the host definer and volume operations. Each node on the cluster should only be able access volumes allowed by the zoning labels.

If you want your own pods to have filtered access - you need to do that with K8S topology (node affinity), not CSI topology.

Below is an example - can you please confirm that you defined topology similarly?

If not - I'll need the secret file (of course replace any confidential information with fake values) and also the label information in the node description, also the PV definition in the pod you're trying to run, and how the pod configures access to the mount.

In the following example of one element in a topology-aware secret definition (taken from the IBM CSI documentation):

  • storage with address demo-management-address-1 is accessible with username/password demo-username-1/demo-password-1

  • But ONLY BY CSI pods running on nodes that have two labels defined (topology.block.csi.ibm.com/demo-region = demo-region-1, topology.block.csi.ibm.com/demo-zone = demo-zone-1).

  • Note those are node labels, not annotations (I'm noting explicitly because K8S zoning uses node annotations)

       "demo-management-id-1": {
         "username": "demo-username-1",
         "password": "demo-password-1",
         "management_address": "demo-management-address-1",
         "supported_topologies": [
           {
             "topology.block.csi.ibm.com/demo-region": "demo-region-1",
             "topology.block.csi.ibm.com/demo-zone": "demo-zone-1"
           }
         ]
       },
    

If you define your pod with a volume mount definition - the corresponding PV must conform to CSI zoning restrictions. To allow you pod to run only on nodes that have access to a specific PV - you also need to use K8S zoning (node affinity).

Of course - CSI cannot control or define affinity of user pods.

If you defined your pods as above and see an issue - I'll I need more details as requested above.

Thanks!

@dje4om
Copy link

dje4om commented Jan 31, 2025

Hi @lechapitre , thx for your reply !

I agree that's it is not the role of the csi to define pod affinity or topology constraints, and in our case, it is natively done through kubernetes topology labels, this one is used : topology.kubernetes.io/zone
It is labels on nodes, not annotations: https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone, it's also talking about usage with PersistentVolume and topology awareness

Here is the setup about node labels for topology usage:

oc get nodes -l node-role.kubernetes.io/worker= -L topology.kubernetes.io/zone -L topology.block.csi.ibm.com/region -L topology.block.csi.ibm.com/zone
NAME               STATUS   ROLES                  AGE   VERSION   ZONE    REGION         ZONE        
openshift-worker1  Ready    worker                 41d   v1.30.5   zone1   myregion       zone1
openshift-worker2  Ready    worker                 41d   v1.30.5   zone2   myregion       zone2

By the way, why using specific labels for the csi instead of native ones ? we expect something consistent here, or maybe i don't have the use case in mind, but we would appreciate to be able to rely on our own labels or native one like other csi does

We use the topology aware setup according to the documentation, here is the specific secret configuration, as you can see, a different management_address is configured by zone, like you said, we agree and expect that behaviour from a topology awareness feature:

Each node on the cluster should only be able access volumes allowed by the zoning labels

Also, a volume can be remap only on another worker in the same zone

{
    "id-zone1": {
        "username": "user_zone1",
        "password": "xxxxxxxxxxxxxxxxxxx",
        "management_address": "10.0.0.1",
        "supported_topologies": [
            {
                "topology.block.csi.ibm.com/region": "myregion",
                "topology.block.csi.ibm.com/zone": "zone1"
            }
        ]
    },
    "id-zone2": {
        "username": "user_zone2",
        "password": "xxxxxxxxxxxxxxxxxxx",
        "management_address": "10.10.0.1",
        "supported_topologies": [
            {
                "topology.block.csi.ibm.com/region": "myregion",
                "topology.block.csi.ibm.com/zone": "zone2"
            }
        ]
    }
}

To summarize a "user story" : A pod is created on zone1 (kubernetes decide this for us or we assume we specified the zone), claim a pvc via topology aware storageClass on the correct zone : zone1 and csi create the volume on the expected zone. Then, let's assume we lose zone1 workers, and the pod is allowed to schedule on the other zone: zone2 (could be an error of configuration or whatever), it is currently trying to find its volume already created on zone1, of course, it did not find it and pod will be in error instead of pending state.

Some points about your answer :

  • Actually, we consider that it is the role of the csi to ensure that created pv are only accessible from the expected zone, and to indirectly inform pods of that, i mean if a pod try to go on zone1 (so a node with native label zone1 and also ibm label zone1), if the pv inform that is on zone2, pod will not try to schedule, this is the role in my understanding of PersistentVolume.spec.nodeAffinity and may solve this behaviour

  • I am not sure to understand why you talk about host-definer here, by the way, there is an issue with topology awarness secret : "malformed node or string" on host-definer with topology-awareness enabled #662, see "malformed node or string" on host-definer with topology-awareness enabled #662 (comment)

  • On pod side, it is totally transparent about volume mount stuff etc ..., we rely on the storageClass and csi, even if scheduling is badly configured

Thanks !

@lechapitre
Copy link
Contributor

@dje4om ,

As suspected - your use case is not supported by CSI.

The purpose of topology in IBM CSI is to restrict access to volumes from certain nodes. It is not meant to limit resource usage.

It is unclear to me why in your use case you need IBM CSI zoning in the first place - if I understand correctly you want the pod to run on either node selected by OpenShift and that this pod should have access to the same volume regardless of the node selected. Why did you define zoning?

I mentioned host-definer because it also needs to be topology aware - if only node1 is able to access volume1 - then only node1 ports need to be defined on volume1 storage.

I don't know the historical reason for IBM CSI using its own labels - but it does allow more flexibility. Only subset of pods restricted by affinity can have access to certain volumes (restricted by IBM CSI labels).

@dje4om
Copy link

dje4om commented Feb 3, 2025

Hello @lechapitre,

Our case is a very standard one, confusion probably comes due to the way i try to describe how to reproduce

We need zoning because both workers are physicals nodes (it's a simplified example), and are in a different datacenter with there own storage array, this is why we need csi toplogy.

We absolutely don't want a pod to be able to move from one zone to another, and that what's actually happenning, of course, the pod can't schedule (crashloopbackoff) because it can't find its volume, but it tried and this is the issue here.
The parameter on PersistentVolume is a potential improvement to prevent this natively

I'm wondering how you define deployments or statefulsets to ensure the topology is respected ?

@lechapitre
Copy link
Contributor

@dje4om ,

So pod starts to run on node1 with access to volume1.
Then pod is rescheduled to node2 and crashes when it tries to access volume1 - but you expect it to fail (or use volume2 instead) - correct?

Can you explain how K8S/OpenShift allows the pod to be rescheduled on node2 if the pod affinity is "node1 only"? (which you should have defined yourself with node annotations, not IBM CSI labels).

You asked about our implementation - we use the labels such that our pods (CSI and host-definer) only "see" volumes which are allowed by label. If topology doesn't allow a node instance of CSI/host-definer to access a volume - then it's simply as if the storage doesn't exist on that node.

@dje4om
Copy link

dje4om commented Feb 10, 2025

Yes, but we don't expect it to fail or to create a new volume, but to remain pending

We observed this behaviour on a statefulset with one replica during node maintenance that we had to force due to a pdb, we expected the pod to remain pending, but it continues to crashloopback even when the node has returned from maintenance.
I will work on an explicit example and scenario to help you reproduce this issue
Of course, it could be an issue of configuration on pod side (as i've already said), but it is certainly a prevention mechanism to improve relability

Fyi, I since check another csi (well know vendor) that support topologies, and this configuration is well defined

Here is some additional informations : https://kubernetes.io/docs/concepts/storage/persistent-volumes/#node-affinity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants