Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

shashiranjan84 · 2024-12-13T18:52:08Z

What happened?

I do not see any entry for AL2023 GPU optimized AMI here. But I do see AWS have optimized AMI for Nvidia

I am trying to update K8s version from 1.29 to 1.31 and also updated the Pulumi EKS from 2.2.1 to 3.4.0

Example

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: '1.31',
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        gpu: true,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

Output of `pulumi about`

CLI          
Version      3.142.0
Go Version   go1.23.3
Go Compiler  gc

Host     
OS       debian
Version  11.7
Arch     x86_64

Backend        
Name           fv-az1490-728
URL            s3://staging-pulumi-state-io
User           root
Organizations  
Token type     personal

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

The text was updated successfully, but these errors were encountered:

flostadler · 2024-12-16T17:22:16Z

Hey @shashiranjan84, sorry you're running into this! AL2023 with GPU support was added to EKS after we've added AL2023 support to the provider.
I'll track this as an enhancement to add the missing AMI type. In the meantime you can provide the AMI ID to the node group component in order to use AL2023 with GPU support or alternatively use the Bottlerocket Operating System.

shashiranjan84 · 2024-12-16T17:27:50Z

Thanks @flostadler . we deploy in multiple regions so was trying to avoid any hardcoding of AMI id.

flostadler · 2024-12-16T18:06:25Z

You do not have to hardcode the AMI id. You can retrieve the region specific AMI from SSM Parameter Store like this:

const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${cluster.eksCluster.version}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply(name =>
  aws.ssm.getParameter({ name }, { async: true })
).apply(result => result.value);

flostadler · 2024-12-17T08:36:47Z

AWS added two new gpu capable optimized AMIs for AL2023. One is for Nvidia based instances, the other is for Neuron based instances (trn1, inf1, etc.).

Adding support for the Nvidia based one is rather easy, but before adding Neuron support we'll need to extend the AMI selection to be instance type aware. So far it's only architecture aware.

shashiranjan84 · 2024-12-17T15:35:15Z

AWS added two new gpu capable optimized AMIs for AL2023. One is for Nvidia based instances, the other is for Neuron based instances (trn1, inf1, etc.).

Adding support for the Nvidia based one is rather easy, but before adding Neuron support we'll need to extend the AMI selection to be instance type aware. So far it's only architecture aware.

Makes sense

shashiranjan84 · 2024-12-17T15:44:46Z

You do not have to hardcode the AMI id. You can retrieve the region specific AMI from SSM Parameter Store like this:
const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${cluster.eksCluster.version}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply(name =>
  aws.ssm.getParameter({ name }, { async: true })
).apply(result => result.value);

I was trying to switch to Al2300 GPU AMI but after updating I am not seeing any nodes. I was expecting it to be rolling update of the nodes but now I am seeing no nodes

const EKS_VERSION = '1.29';
const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply((name) =>
      aws.ssm.getParameter({ name,  }, { async: true, provider: ... }),
    ).apply((result) => result.value);

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: EKS_VERSION,
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        // gpu: true,
       nodeAmiId: ami,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

shashiranjan84 · 2024-12-17T16:04:00Z

To give a context, we currently on Kubernetes version 1.29 and trying to upgrade to 1.31. EKS and Kubernetes plugin version are respectively 2.2.1 and 4.8.1, which we also planning to upgrade to latest. What would be best migration approach to avoid downtime?

flostadler · 2024-12-19T18:13:34Z

@shashiranjan84 the EKS provider 2.x.x does not support AL2023 and Bottlerocket. You'll need to upgrade to version 3 of the provider.

Self managed node groups (like the cluster default node group) require more careful handling to guarantee downtime-less updates generally. If possible, I'd recommend you to upgrade to either using managed node groups or EKS Auto Mode instead.

I'd recommend you to first upgrade to EKS provider version 3 following this guide: https://www.pulumi.com/registry/packages/eks/how-to-guides/v3-migration. It shouldn't replace your existing node groups if you set the operatingSystem to AL2.

shashiranjan84 · 2024-12-19T23:14:11Z

After updating to EKS 3.5 and updating default node group(after setting AMI id), we seeing this error at end of deployment

 Error: unknown resource type urn:pulumi:main.bob-infra.staging::bob-infra::eks:index:Cluster$eks:index:VpcCni::us-east-1-staging-bob-vpc-cni: Error: unknown resource type urn:pulumi:main.bob-infra.staging::bob-infra::eks:index:Cluster$eks:index:VpcCni::us-east-1-staging-bob-vpc-cni

flostadler · 2024-12-20T09:30:15Z

@shashiranjan84 this sounds like a separate problem. Can you please open another issue for this and include code and steps to reproduce this. Thanks a lot!
Given that you're on an older version of 2.x you might have to first update to a more recent version first (e.g. v2.8.1). I have a suspicion that this is related to this bug #1087 that was fixed in v2.7.2.

Anyways, let's take this to a new issue and we can dig into it! Feel free to tag me there

shashiranjan84 · 2024-12-21T17:35:35Z

@flostadler I also notice when I create a managed node group with a amiID, worker node instance do not have same amiID.

const EKS_VERSION = '1.30';
const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply((name) =>
      aws.ssm.getParameter({ name,  }, { async: true, provider: ... }),
    ).apply((result) => result.value);

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: EKS_VERSION,
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        // gpu: true,
       nodeAmiId: ami,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

    new eks.ManagedNodeGroup(
      `${regionalNamespace}-ng`,
      {
        cluster,
        amiId:  ami,
        gpu: true,
        instanceTypes: ['g5.xlarge'],
        ignoreScalingChanges: true,
        scalingConfig: {
          minSize: 1,
          maxSize: 2,
          desiredSize: DESIRED_RESOURCE_COUNT[config.env].node.desiredCapacity,
        },
        diskSize: 200,
        nodeRole: cluster.instanceRoles[0],
        enableIMDSv2: true,
        labels: {
          'transcend.io/k8s_version': '1.31'
        },
        tags: {
          ...projectTags,
          'k8s.io/cluster-autoscaler/enabled': 'true',
          [`k8s.io/cluster-autoscaler/${regionalNamespace}`]: 'owned',
        },
      },
      { provider: MAIN_REGION_PROVIDERS[mainRegion] },
    );

Is that expected?

flostadler · 2024-12-23T09:06:51Z

@shashiranjan84 Only the nodes part of the managed node group will have that AMI ID. The nodes part of the default self-managed node group will have a different AMI ID.
If you see something different, please open another issue so we can dig into it. Thanks!

This change adds support for the AL2023 x86_64 GPU optimized AMI. See [AWS docs](https://docs.aws.amazon.com/eks/latest/userguide/retrieve-ami-id.html) for a list of supported AMIs. The AMI type (`AL2023_x86_64_NVIDIA`) is taken from the [AWS API schema](https://docs.aws.amazon.com/eks/latest/APIReference/API_CreateNodegroup.html#AmazonEKS-CreateNodegroup-request-amiType). Note: adding support for the Neuron based AMI type is tracked in #1526. This will require making the AMI selection instance type aware. Relates to #1526

shashiranjan84 · 2024-12-23T14:35:00Z

Here I am explicitly providing same node AMI ID for both self managed and managed node group assuming they will stack up with same GPU optimized AMI ID. Even when I hardcoded AMI id in managed node group, worked node in managed group was showing different AMI id, as if it completely ignoring AMI ID property

flostadler · 2024-12-24T09:20:15Z

FYI version v3.6.0 was released with support for nvidia based GPUs for AL2023. I'm closing this issue for now and opened this one (#1561) for adding Neuron support.

@shashiranjan84 I'll continue looking into the other issues you've opened, but you'll not have to need to use the AMI override anymore. The provider should now select the appropriate AMI for all instances with NVIDIA GPUs

shashiranjan84 · 2024-12-24T13:20:53Z

Thanks a lot!

shashiranjan84 added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Dec 13, 2024

flostadler added kind/enhancement Improvements or new features and removed needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Dec 16, 2024

flostadler mentioned this issue Dec 17, 2024

Add Nvidia GPU optimized AL2023 AMI #1534

Merged

flostadler added the resolution/fixed This issue was fixed label Dec 24, 2024

flostadler self-assigned this Dec 24, 2024

flostadler closed this as completed Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

shashiranjan84 commented Dec 13, 2024 •

edited

Loading

flostadler commented Dec 16, 2024

shashiranjan84 commented Dec 16, 2024

flostadler commented Dec 16, 2024

flostadler commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

flostadler commented Dec 19, 2024

shashiranjan84 commented Dec 19, 2024

flostadler commented Dec 20, 2024 •

edited

Loading

shashiranjan84 commented Dec 21, 2024

flostadler commented Dec 23, 2024

shashiranjan84 commented Dec 23, 2024

flostadler commented Dec 24, 2024 •

edited

Loading

shashiranjan84 commented Dec 24, 2024

Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

Comments

shashiranjan84 commented Dec 13, 2024 • edited Loading

What happened?

Example

Output of pulumi about

Additional context

Contributing

flostadler commented Dec 16, 2024

shashiranjan84 commented Dec 16, 2024

flostadler commented Dec 16, 2024

flostadler commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

shashiranjan84 commented Dec 17, 2024

flostadler commented Dec 19, 2024

shashiranjan84 commented Dec 19, 2024

flostadler commented Dec 20, 2024 • edited Loading

shashiranjan84 commented Dec 21, 2024

flostadler commented Dec 23, 2024

shashiranjan84 commented Dec 23, 2024

flostadler commented Dec 24, 2024 • edited Loading

shashiranjan84 commented Dec 24, 2024

shashiranjan84 commented Dec 13, 2024 •

edited

Loading

Output of `pulumi about`

flostadler commented Dec 20, 2024 •

edited

Loading

flostadler commented Dec 24, 2024 •

edited

Loading