Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - JupyterLab Pods Fail to Mount Conda PV When on CUSTOM ami_type (AWS Only) #2832

Open
kenafoster opened this issue Nov 7, 2024 · 4 comments
Assignees

Comments

@kenafoster
Copy link
Contributor

kenafoster commented Nov 7, 2024

Describe the bug

After creating a node group using a custom AMI (new feature as of 2024.9.1), I have not been able to successfully launch a JupyterLab Pod on the new image

Note that the ami_id value references an AMI of type AL2_x86_64_GPU which is also used in a working (non-custom) GPU-backed node group elsewhere in the cluster.

amazon_web_services:
  node_groups:
      gpu-tesla-g4:
        instance: g4dn.xlarge
        min_nodes: 0
        max_nodes: 5
        single_subnet: false
        gpu: true
        launch_template:
          ami_id: ami-xxxxxxxxxxxxxxx

Expected behavior

The JLab pod should start up successfully

OS and architecture in which you are running Nebari

Locally, it's MacOS with Apple Silicon. Don't think that matters

How to Reproduce the problem?

Create a node group with custom image like in the description above. Try to launch a JupyterLab pod using a profile that will require the scheduler to put the pod on a node in this node group.

Command output

The JuptyerLab start up logs below show that the KubeSpawner successfully triggers the scale up for the custom node group and the pod is scheduled on it, but then the pod fails because it can't connect to the PV conda-store-dev-share


2024-11-07T16:08:27.209362Z [Warning] 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2024-11-07T16:08:58Z [Normal] pod triggered scale-up: [{eks-gpu-tesla-g4-c4c981a7-90d8-83c4-7f17-cad152930e82 0->1 (max: 5)}]
2024-11-07T16:11:09Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
2024-11-07T16:11:25.708605Z [Normal] Successfully assigned dev/jupyter-xxx to ip-10-10-x-x.us-gov-west-1.compute.internal

2024-11-07T16:14:26Z [Warning] MountVolume.SetUp failed for volume "conda-store-dev-share" : mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.20.129.35:/ /var/lib/kubelet/pods/e6bfa491-198b-4a84-bade-183b1d6246f6/volumes/kubernetes.io~nfs/conda-store-dev-share Output: mount.nfs: Connection timed out

Versions and dependencies used.

Nebari 2024.9.1

EKS version 1.29

Compute environment

AWS

Integrations

No response

Anything else?

We believe the issue may be related to the node bootstrapping in the user data of the launch template, as there are some discrepancies. The one we have already tried to add manually is --dns-cluster-ip $K8S_CLUSTER_DNS_IP since the PV that it cannot connect to is using an NFS endpoint that is exposed via ClusterIP service. However, that didn't fix the issue.

@kenafoster kenafoster added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Nov 7, 2024
@viniciusdc viniciusdc added provider: AWS area: terraform 💾 and removed needs: triage 🚦 Someone needs to have a look at this issue and triage labels Nov 7, 2024
@dcmcand
Copy link
Contributor

dcmcand commented Nov 8, 2024

@kenafoster do you have an ami id that could be used to reproduce?

@kenafoster
Copy link
Contributor Author

kenafoster commented Nov 8, 2024

ami-0f164f6722f3427d3 for AL2_x86_64_GPU in GovCloud. ami-03eb56f0cdfb4a82d for 'AL2_x86_64' (non GPU) in GovCloud

EDIT must be GovCloud west... us-gov-west-1 region

@viniciusdc
Copy link
Contributor

There is a high chance that the Internet Adapter needs to be correctly configured within the nodes when using the launch_template to set them up. Things that need to be checked here:

  • validate what is the configuration for the current default template used by AWS automatically and compare it with the one present in the launch_template nodes
    • Need to compare the interface config eth0 and instance metadata;

@viniciusdc
Copy link
Contributor

This error was mitigated with 2024.11.1rc2, a proper fix will be introduced in 2024.11.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New 🚦
Development

No branches or pull requests

3 participants