You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After creating a node group using a custom AMI (new feature as of 2024.9.1), I have not been able to successfully launch a JupyterLab Pod on the new image
Note that the ami_id value references an AMI of type AL2_x86_64_GPU which is also used in a working (non-custom) GPU-backed node group elsewhere in the cluster.
OS and architecture in which you are running Nebari
Locally, it's MacOS with Apple Silicon. Don't think that matters
How to Reproduce the problem?
Create a node group with custom image like in the description above. Try to launch a JupyterLab pod using a profile that will require the scheduler to put the pod on a node in this node group.
Command output
The JuptyerLab start up logs below show that the KubeSpawner successfully triggers the scale up for the custom node group and the pod is scheduled on it, but then the pod fails because it can't connect to the PV conda-store-dev-share2024-11-07T16:08:27.209362Z [Warning] 0/4 nodes are available: 1 node(s) were unschedulable, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.2024-11-07T16:08:58Z [Normal] pod triggered scale-up: [{eks-gpu-tesla-g4-c4c981a7-90d8-83c4-7f17-cad152930e82 0->1 (max: 5)}]2024-11-07T16:11:09Z [Normal] pod didn't trigger scale-up: 5 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu
2024-11-07T16:11:25.708605Z [Normal] Successfully assigned dev/jupyter-xxx to ip-10-10-x-x.us-gov-west-1.compute.internal
2024-11-07T16:14:26Z [Warning] MountVolume.SetUp failed for volume "conda-store-dev-share": mount failed: exit status 32 Mounting command: mount Mounting arguments: -t nfs 172.20.129.35:/ /var/lib/kubelet/pods/e6bfa491-198b-4a84-bade-183b1d6246f6/volumes/kubernetes.io~nfs/conda-store-dev-share Output: mount.nfs: Connection timed out
Versions and dependencies used.
Nebari 2024.9.1
EKS version 1.29
Compute environment
AWS
Integrations
No response
Anything else?
We believe the issue may be related to the node bootstrapping in the user data of the launch template, as there are some discrepancies. The one we have already tried to add manually is --dns-cluster-ip $K8S_CLUSTER_DNS_IP since the PV that it cannot connect to is using an NFS endpoint that is exposed via ClusterIP service. However, that didn't fix the issue.
The text was updated successfully, but these errors were encountered:
There is a high chance that the Internet Adapter needs to be correctly configured within the nodes when using the launch_template to set them up. Things that need to be checked here:
validate what is the configuration for the current default template used by AWS automatically and compare it with the one present in the launch_template nodes
Need to compare the interface config eth0 and instance metadata;
Describe the bug
After creating a node group using a custom AMI (new feature as of 2024.9.1), I have not been able to successfully launch a JupyterLab Pod on the new image
Note that the ami_id value references an AMI of type AL2_x86_64_GPU which is also used in a working (non-custom) GPU-backed node group elsewhere in the cluster.
Expected behavior
The JLab pod should start up successfully
OS and architecture in which you are running Nebari
Locally, it's MacOS with Apple Silicon. Don't think that matters
How to Reproduce the problem?
Create a node group with custom image like in the description above. Try to launch a JupyterLab pod using a profile that will require the scheduler to put the pod on a node in this node group.
Command output
Versions and dependencies used.
Nebari 2024.9.1
EKS version 1.29
Compute environment
AWS
Integrations
No response
Anything else?
We believe the issue may be related to the node bootstrapping in the user data of the launch template, as there are some discrepancies. The one we have already tried to add manually is
--dns-cluster-ip $K8S_CLUSTER_DNS_IP
since the PV that it cannot connect to is using an NFS endpoint that is exposed via ClusterIP service. However, that didn't fix the issue.The text was updated successfully, but these errors were encountered: