[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

kenafoster · 2024-11-07T16:24:37Z

Describe the bug

After creating a node group using a custom AMI (new feature as of 2024.9.1), I have not been able to successfully launch a JupyterLab Pod on the new image

Note that the ami_id value references an AMI of type AL2_x86_64_GPU which is also used in a working (non-custom) GPU-backed node group elsewhere in the cluster.

amazon_web_services:
  node_groups:
      gpu-tesla-g4:
        instance: g4dn.xlarge
        min_nodes: 0
        max_nodes: 5
        single_subnet: false
        gpu: true
        launch_template:
          ami_id: ami-xxxxxxxxxxxxxxx

Expected behavior

The JLab pod should start up successfully

OS and architecture in which you are running Nebari

Locally, it's MacOS with Apple Silicon. Don't think that matters

How to Reproduce the problem?

Create a node group with custom image like in the description above. Try to launch a JupyterLab pod using a profile that will require the scheduler to put the pod on a node in this node group.

Command output

The JuptyerLab start up logs below show that the KubeSpawner successfully triggers the scale up for the custom node group and the pod is scheduled on it, but then the pod fails because it can't connect to the PV conda-store-dev-share


2024-11-07T16:08:27.209362Z [Warning] 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2024-11-07T16:08:58Z [Normal] pod triggered scale-up: [{eks-gpu-tesla-g4-c4c981a7-90d8-83c4-7f17-cad152930e82 0->1 (max: 5)}]
2024-11-07T16:11:09Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
2024-11-07T16:11:25.708605Z [Normal] Successfully assigned dev/jupyter-xxx to ip-10-10-x-x.us-gov-west-1.compute.internal

2024-11-07T16:14:26Z [Warning] MountVolume.SetUp failed for volume "conda-store-dev-share" : mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.20.129.35:/ /var/lib/kubelet/pods/e6bfa491-198b-4a84-bade-183b1d6246f6/volumes/kubernetes.io~nfs/conda-store-dev-share Output: mount.nfs: Connection timed out

Versions and dependencies used.

Nebari 2024.9.1

EKS version 1.29

Compute environment

AWS

Integrations

No response

Anything else?

We believe the issue may be related to the node bootstrapping in the user data of the launch template, as there are some discrepancies. The one we have already tried to add manually is --dns-cluster-ip $K8S_CLUSTER_DNS_IP since the PV that it cannot connect to is using an NFS endpoint that is exposed via ClusterIP service. However, that didn't fix the issue.

The text was updated successfully, but these errors were encountered:

dcmcand · 2024-11-08T14:21:17Z

@kenafoster do you have an ami id that could be used to reproduce?

kenafoster · 2024-11-08T14:41:07Z

ami-0f164f6722f3427d3 for AL2_x86_64_GPU in GovCloud. ami-03eb56f0cdfb4a82d for 'AL2_x86_64' (non GPU) in GovCloud

EDIT must be GovCloud west... us-gov-west-1 region

viniciusdc · 2024-11-12T14:52:54Z

There is a high chance that the Internet Adapter needs to be correctly configured within the nodes when using the launch_template to set them up. Things that need to be checked here:

validate what is the configuration for the current default template used by AWS automatically and compare it with the one present in the launch_template nodes
- Need to compare the interface config eth0 and instance metadata;

viniciusdc · 2024-11-13T14:45:17Z

This error was mitigated with 2024.11.1rc2, a proper fix will be introduced in 2024.11.2

kenafoster added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Nov 7, 2024

github-project-automation bot added this to 🪴 Nebari Project Management Nov 7, 2024

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Nov 7, 2024

viniciusdc added provider: AWS area: terraform 💾 and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Nov 7, 2024

viniciusdc self-assigned this Nov 12, 2024

viniciusdc mentioned this issue Nov 12, 2024

Set launch_template.ami_id attrs to private #2842

Merged

10 tasks

viniciusdc mentioned this issue Nov 21, 2024

[RELEASE] 2024.11.1 - Hotfix #2807

Closed

25 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

kenafoster commented Nov 7, 2024 •

edited

Loading

dcmcand commented Nov 8, 2024

kenafoster commented Nov 8, 2024 •

edited

Loading

viniciusdc commented Nov 12, 2024

viniciusdc commented Nov 13, 2024

[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

Comments

kenafoster commented Nov 7, 2024 • edited Loading

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

dcmcand commented Nov 8, 2024

kenafoster commented Nov 8, 2024 • edited Loading

viniciusdc commented Nov 12, 2024

viniciusdc commented Nov 13, 2024

kenafoster commented Nov 7, 2024 •

edited

Loading

kenafoster commented Nov 8, 2024 •

edited

Loading