Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is AMI for OS: AL2023, GPU support: true, architecture: x86_64 missing? #1526

Closed
shashiranjan84 opened this issue Dec 13, 2024 · 15 comments
Closed
Assignees
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed

Comments

@shashiranjan84
Copy link

shashiranjan84 commented Dec 13, 2024

What happened?

I do not see any entry for AL2023 GPU optimized AMI here. But I do see AWS have optimized AMI for Nvidia

Image

I am trying to update K8s version from 1.29 to 1.31 and also updated the Pulumi EKS from 2.2.1 to 3.4.0

Example

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: '1.31',
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        gpu: true,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

Output of pulumi about

CLI          
Version      3.142.0
Go Version   go1.23.3
Go Compiler  gc

Host     
OS       debian
Version  11.7
Arch     x86_64

Backend        
Name           fv-az1490-728
URL            s3://staging-pulumi-state-io
User           root
Organizations  
Token type     personal

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@shashiranjan84 shashiranjan84 added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Dec 13, 2024
@flostadler
Copy link
Contributor

Hey @shashiranjan84, sorry you're running into this! AL2023 with GPU support was added to EKS after we've added AL2023 support to the provider.
I'll track this as an enhancement to add the missing AMI type. In the meantime you can provide the AMI ID to the node group component in order to use AL2023 with GPU support or alternatively use the Bottlerocket Operating System.

@flostadler flostadler added kind/enhancement Improvements or new features and removed needs-triage Needs attention from the triage team kind/bug Some behavior is incorrect or out of spec labels Dec 16, 2024
@shashiranjan84
Copy link
Author

Thanks @flostadler . we deploy in multiple regions so was trying to avoid any hardcoding of AMI id.

@flostadler
Copy link
Contributor

You do not have to hardcode the AMI id. You can retrieve the region specific AMI from SSM Parameter Store like this:

const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${cluster.eksCluster.version}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply(name =>
  aws.ssm.getParameter({ name }, { async: true })
).apply(result => result.value);

@flostadler
Copy link
Contributor

AWS added two new gpu capable optimized AMIs for AL2023. One is for Nvidia based instances, the other is for Neuron based instances (trn1, inf1, etc.).

Adding support for the Nvidia based one is rather easy, but before adding Neuron support we'll need to extend the AMI selection to be instance type aware. So far it's only architecture aware.

@shashiranjan84
Copy link
Author

AWS added two new gpu capable optimized AMIs for AL2023. One is for Nvidia based instances, the other is for Neuron based instances (trn1, inf1, etc.).

Adding support for the Nvidia based one is rather easy, but before adding Neuron support we'll need to extend the AMI selection to be instance type aware. So far it's only architecture aware.

Makes sense

@shashiranjan84
Copy link
Author

You do not have to hardcode the AMI id. You can retrieve the region specific AMI from SSM Parameter Store like this:

const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${cluster.eksCluster.version}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply(name =>
  aws.ssm.getParameter({ name }, { async: true })
).apply(result => result.value);

I was trying to switch to Al2300 GPU AMI but after updating I am not seeing any nodes. I was expecting it to be rolling update of the nodes but now I am seeing no nodes

const EKS_VERSION = '1.29';
const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply((name) =>
      aws.ssm.getParameter({ name,  }, { async: true, provider: ... }),
    ).apply((result) => result.value);

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: EKS_VERSION,
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        // gpu: true,
       nodeAmiId: ami,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

Image

@shashiranjan84
Copy link
Author

To give a context, we currently on Kubernetes version 1.29 and trying to upgrade to 1.31. EKS and Kubernetes plugin version are respectively 2.2.1 and 4.8.1, which we also planning to upgrade to latest. What would be best migration approach to avoid downtime?

@flostadler
Copy link
Contributor

@shashiranjan84 the EKS provider 2.x.x does not support AL2023 and Bottlerocket. You'll need to upgrade to version 3 of the provider.

Self managed node groups (like the cluster default node group) require more careful handling to guarantee downtime-less updates generally. If possible, I'd recommend you to upgrade to either using managed node groups or EKS Auto Mode instead.

I'd recommend you to first upgrade to EKS provider version 3 following this guide: https://www.pulumi.com/registry/packages/eks/how-to-guides/v3-migration. It shouldn't replace your existing node groups if you set the operatingSystem to AL2.

@shashiranjan84
Copy link
Author

After updating to EKS 3.5 and updating default node group(after setting AMI id), we seeing this error at end of deployment

 Error: unknown resource type urn:pulumi:main.bob-infra.staging::bob-infra::eks:index:Cluster$eks:index:VpcCni::us-east-1-staging-bob-vpc-cni: Error: unknown resource type urn:pulumi:main.bob-infra.staging::bob-infra::eks:index:Cluster$eks:index:VpcCni::us-east-1-staging-bob-vpc-cni

@flostadler
Copy link
Contributor

flostadler commented Dec 20, 2024

@shashiranjan84 this sounds like a separate problem. Can you please open another issue for this and include code and steps to reproduce this. Thanks a lot!
Given that you're on an older version of 2.x you might have to first update to a more recent version first (e.g. v2.8.1). I have a suspicion that this is related to this bug #1087 that was fixed in v2.7.2.

Anyways, let's take this to a new issue and we can dig into it! Feel free to tag me there

@shashiranjan84
Copy link
Author

@flostadler I also notice when I create a managed node group with a amiID, worker node instance do not have same amiID.

const EKS_VERSION = '1.30';
const ami = pulumi.interpolate`/aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2023/x86_64/nvidia/recommended/image_id`.apply((name) =>
      aws.ssm.getParameter({ name,  }, { async: true, provider: ... }),
    ).apply((result) => result.value);

const cluster = new eks.Cluster(
      `${regionalNamespace}-cluster`,
      {
        name: `${regionalNamespace}`,
        version: EKS_VERSION,
        vpcId: ...,
        privateSubnetIds: ...
        publicSubnetIds: ...,
        enabledClusterLogTypes: ['api', 'audit', 'authenticator'],
        tags: projectTags,
        endpointPrivateAccess: true,
        endpointPublicAccess: true,
        nodeAssociatePublicIpAddress: false,
        providerCredentialOpts: {
          profileName: aws.config.profile,
        },
        roleMappings: [
          ...
        ],
        instanceType: 'g5.2xlarge',
        // gpu: true,
       nodeAmiId: ami,
        nodeRootVolumeSize: 200,
        ...,
      },
      { provider: ... },
    );

    new eks.ManagedNodeGroup(
      `${regionalNamespace}-ng`,
      {
        cluster,
        amiId:  ami,
        gpu: true,
        instanceTypes: ['g5.xlarge'],
        ignoreScalingChanges: true,
        scalingConfig: {
          minSize: 1,
          maxSize: 2,
          desiredSize: DESIRED_RESOURCE_COUNT[config.env].node.desiredCapacity,
        },
        diskSize: 200,
        nodeRole: cluster.instanceRoles[0],
        enableIMDSv2: true,
        labels: {
          'transcend.io/k8s_version': '1.31'
        },
        tags: {
          ...projectTags,
          'k8s.io/cluster-autoscaler/enabled': 'true',
          [`k8s.io/cluster-autoscaler/${regionalNamespace}`]: 'owned',
        },
      },
      { provider: MAIN_REGION_PROVIDERS[mainRegion] },
    );

Is that expected?

@flostadler
Copy link
Contributor

@shashiranjan84 Only the nodes part of the managed node group will have that AMI ID. The nodes part of the default self-managed node group will have a different AMI ID.
If you see something different, please open another issue so we can dig into it. Thanks!

flostadler added a commit that referenced this issue Dec 23, 2024
This change adds support for the AL2023 x86_64 GPU optimized AMI. See
[AWS
docs](https://docs.aws.amazon.com/eks/latest/userguide/retrieve-ami-id.html)
for a list of supported AMIs.

The AMI type (`AL2023_x86_64_NVIDIA`) is taken from the [AWS API
schema](https://docs.aws.amazon.com/eks/latest/APIReference/API_CreateNodegroup.html#AmazonEKS-CreateNodegroup-request-amiType).

Note: adding support for the Neuron based AMI type is tracked in
#1526. This will require
making the AMI selection instance type aware.

Relates to #1526
@shashiranjan84
Copy link
Author

Here I am explicitly providing same node AMI ID for both self managed and managed node group assuming they will stack up with same GPU optimized AMI ID. Even when I hardcoded AMI id in managed node group, worked node in managed group was showing different AMI id, as if it completely ignoring AMI ID property

@flostadler flostadler added the resolution/fixed This issue was fixed label Dec 24, 2024
@flostadler flostadler self-assigned this Dec 24, 2024
@flostadler
Copy link
Contributor

flostadler commented Dec 24, 2024

FYI version v3.6.0 was released with support for nvidia based GPUs for AL2023. I'm closing this issue for now and opened this one (#1561) for adding Neuron support.

@shashiranjan84 I'll continue looking into the other issues you've opened, but you'll not have to need to use the AMI override anymore. The provider should now select the appropriate AMI for all instances with NVIDIA GPUs

@shashiranjan84
Copy link
Author

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

2 participants