Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: High init duration when deployed through SAM. #6715

Closed
AmruthjithSunil opened this issue Feb 20, 2024 · 5 comments
Closed

Bug: High init duration when deployed through SAM. #6715

AmruthjithSunil opened this issue Feb 20, 2024 · 5 comments
Labels
blocked/close-if-inactive Blocked for >14 days with no response, will be closed if still inactive after 7 days blocked/more-info-needed More info is needed from the requester. If no response in 14 days, it will become stale. stage/bug-repro The issue/bug needs to be reproduced

Comments

@AmruthjithSunil
Copy link

AmruthjithSunil commented Feb 20, 2024

Description:

When deployed through sam the init duration is very high(5s) but when we deploy manually it takes less than 1s. Functionally its working fine in sam deployment. Only issue is the high init duration. We changed only layers in sam with the layer we had in manual deployment then it was working as expected(low init duration).

Steps to reproduce:

  • Deploy an AWS Lambda function using AWS SAM.
  • Measure the initialization duration of the Lambda function.
  • Deploy the same Lambda function manually using the AWS CLI.
  • Measure the initialization duration of the manually deployed Lambda function.
  • Compare the initialization durations between the two deployment methods.

sam deploy commands:
1 sam build --use-container -t ./template.yaml
2 sam deploy --config-env prod

with and without docker(--use-container) we tried
We have 3 more custom util py files in layer.

template.yaml

Transform: AWS::Serverless-2016-10-31
Description:  stack

Parameters:
  environment:
    Type: String
    Default: dev
    Description: The environment to deploy to
    AllowedValues:
      - dev
      - prod
    ConstraintDescription: must specify dev or prod

Globals:
  Function:
    Environment:
      Variables:
        environment: !Sub ${environment}
        region_name: !Ref AWS::Region
    EphemeralStorage:
      Size: 512
    Timeout: 5
    Handler: lambda_function.lambda_handler
    Runtime: python3.11
    Architectures:
      - arm64
    Layers:
      - !Ref LayerBase
    MemorySize: 256

Resources:

  LayerBase:
    Type: AWS::Serverless::LayerVersion
    Properties:
      LayerName: -base
      Description:  base layer
      ContentUri: './utils'
      CompatibleRuntimes:
        - python3.11
      CompatibleArchitectures:
        - arm64
    Metadata:
      PackageType: Zip
      BuildMethod: python3.11
      BuildArchitecture: arm64
    #  BuildPlatform: Amazon Linux 2
  
  
  CustomPolicyForSecrets:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: CustomPolicyForSecrets
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - secretsmanager:GetSecretValue
            Resource: '*'
      Roles:
        - !Ref BasicLambdaRole
        - !Ref PriviligedLambdaRole
    
  LogPolicyForLambda:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: LogPolicyForLambda
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: !Sub arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/${AWS::StackName}-*
      Roles:
        - !Ref BasicLambdaRole
        - !Ref PriviligedLambdaRole
  
  BasicLambdaRole:
    Type: AWS::IAM::Role
    Description: Role to Read secret managers
    # refer LogPolicyForLambda to this role
    Properties:
      RoleName: MyEmptyRole
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole

  PriviligedLambdaRole:
    Type: AWS::IAM::Role
    Description: Role for SQS access + secretsmanager access
    Properties:
      Policies:
        - PolicyName: CustomPolicyForSQS
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - sqs:*
                Resource: '*'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole

    
  PostChats:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: ./post__chats
      Description: "Main function which handles user's chats with  AI"
      MemorySize: 512
      Timeout: 90
      PackageType: Zip
      Handler: lambda_function.lambda_handler
      Runtime: python3.11
      EphemeralStorage:
        Size: 512
      EventInvokeConfig:
        MaximumEventAgeInSeconds: 21600
        MaximumRetryAttempts: 2
      AutoPublishAlias: live
      DeploymentPreference:
        Type: AllAtOnce
      ProvisionedConcurrencyConfig:
        ProvisionedConcurrentExecutions: 2
      Role: !GetAtt PriviligedLambdaRole.Arn
      Events:
        PostApiEvent:
          Type: Api
          Properties:
            Path: /chats-sam/post
            Method: POST

  FetchChats:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: ./fetch__chats
      Description: "Function to handle fetch chats with  AI"
      EventInvokeConfig:
        MaximumEventAgeInSeconds: 21600
        MaximumRetryAttempts: 2
      Role: !GetAtt BasicLambdaRole.Arn
      Events:
        FetchApiEvent:
          Type: Api
          Properties:
            Path: /chats-sam/fetch
            Method: POST
      Timeout: 10
      PackageType: Zip

We tried both arm and x86 architectures and different python versions.

requirementes.txt

numpy==1.24.4
openai==1.3.8
pgvector==0.2.4
psycopg==3.1.6
psycopg-binary==3.1.6
requests==2.31.0
pytz
phonenumbers

Observed result:

Test Event Name
test

Response
{
"statusCode": 200,
"status": "success",
"headers": {
"Access-Control-Allow-Origin": "*"
},
"messages": "[]"
}

Function Logs
START RequestId: 6ae5c841-833f-4d0a-abd1-dcf4c19ec834 Version: $LATEST
[INFO] 2024-02-20T06:51:59.375Z 6ae5c841-833f-4d0a-abd1-dcf4c19ec834 Received event: {
"user_id": "9fcb1fec0d522dcfcfaeb64da8d30b4a",
"organization_id": 11
}
END RequestId: 6ae5c841-833f-4d0a-abd1-dcf4c19ec832
REPORT RequestId: 6ae5c841-833f-4d0a-abd1-dcf4c19ec834 Duration: 15.53 ms Billed Duration: 16 ms Memory Size: 256 MB Max Memory Used: 256 MB Init Duration: 4970.89 ms

Request ID
6ae5c841-833f-4d0a-abd1-dcf4c19ec832

Expected result:

Init Duration should be less than 1000ms

Additional environment details (Ex: Windows, Mac, Amazon Linux etc)

  1. OS: tried Ubuntu, Mac
  2. sam --version: SAM CLI, version 1.109.0
  3. AWS region: US-east-2
{
  "version": "1.109.0",
  "system": {
    "python": "3.11.3",
    "os": "Linux-6.5.0-18-generic-x86_64-with-glibc2.35"
  },
  "additional_dependencies": {
    "docker_engine": "Not available",
    "aws_cdk": "Not available",
    "terraform": "Not available"
  },
  "available_beta_feature_env_vars": [
    "SAM_CLI_BETA_FEATURES",
    "SAM_CLI_BETA_BUILD_PERFORMANCE",
    "SAM_CLI_BETA_TERRAFORM_SUPPORT",
    "SAM_CLI_BETA_RUST_CARGO_LAMBDA"
  ]
}
@AmruthjithSunil AmruthjithSunil added the stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. label Feb 20, 2024
@lucashuy
Copy link
Contributor

lucashuy commented Feb 21, 2024

Hi, I was not able to reproduce your slow cold start with a project deployed via SAM CLI:

SAM CLI

  1. sam init a Python 3.11 hello world template
  2. Modified the template to add the following resource:
  HelloWorldLayer:
    Type: AWS::Serverless::LayerVersion
    Properties:
      ContentUri: ./layer
      CompatibleRuntimes:
        - python3.11
      CompatibleArchitectures:
        - x86_64
    Metadata:
      BuildMethod: python3.11

The only thing that is inside of ./layer is a requirements.txt file with the same dependencies noted in the original issue.

  1. Modified the hello world function to add the following new properties:
EphemeralStorage:
  Size: 512
Layers:
  - !Ref HelloWorldLayer
  1. Update the handler to import a package that is defined from the layer (just import, not using any methods)
  2. Ran sam build -u
  3. Ran sam deploy
  4. Ran sam remote invoke
$ sam remote invoke --stack-name sam-app-6715-2
Invoking Lambda Function HelloWorldFunction                                                                                                                         
START RequestId: ee5bfe82-1b21-4c20-af40-fc9eaea477c3 Version: $LATEST
END RequestId: ee5bfe82-1b21-4c20-af40-fc9eaea477c3
REPORT RequestId: ee5bfe82-1b21-4c20-af40-fc9eaea477c3  Duration: 1.43 ms       Billed Duration: 2 ms   Memory Size: 128 MB     Max Memory Used: 76 MB  Init Duration: 679.37 ms
{"statusCode": 200, "body": "{\"message\": \"hello world\"}"}%

AWS CLI

When creating a Lambda manually via the AWS CLI:

  1. Zipped the function and layer built directory from previous sam build
  2. Create a Lambda layer:
aws lambda publish-layer-version --layer-name manual-layer3 --zip-file fileb://hellolayer.zip --compatible-runtimes python3.11 --compatible-architectures x86_64
  1. Create a Lambda function:
aws lambda create-function --function-name manual3 --role <some role that is able to `InvokeFunction`> --layers <layer ARN from previous command> --zip-file fileb://hellofunction.zip --handler app.lambda_handler --runtime python3.11 --ephemeral-storage Size=512
  1. Invoke the Lambda function
$ aws lambda invoke --function-name manual3 /dev/stdout       
{"statusCode": 200, "body": "{\"message\": \"hello world\"}"}%

Inspecting the Cloudwatch logs for the function, we can see that the init time is roughly the same as one the one deployed with SAM CLI:

REPORT RequestId: 44daa145-cf6f-4ccb-a86c-379107756e76	Duration: 1.54 ms	Billed Duration: 2 ms	Memory Size: 128 MB	Max Memory Used: 76 MB	Init Duration: 665.80 ms

Is there a reproducible project that you can share somewhere where we can reproduce this? Something interesting to point out, is that in your function runtime logs it looks like the memory is being maxed out. This may have something to do with the slow init times?

@lucashuy lucashuy added stage/bug-repro The issue/bug needs to be reproduced and removed stage/needs-triage Automatically applied to new issues and PRs, indicating they haven't been looked at. labels Feb 21, 2024
@jysheng123 jysheng123 added the blocked/more-info-needed More info is needed from the requester. If no response in 14 days, it will become stale. label Mar 4, 2024
@AmruthjithSunil
Copy link
Author

AmruthjithSunil commented Mar 5, 2024

Found the issue. It was because of same module importing twice in a python file which was in the layer.

import phonenumbers
from phonenumbers import geocoder, carrier, timezone

removed second line. Then init duration during warm up reduced from 5s to 0.5s. Also Max Memory Used gone down from 250MB to 70MB.

@hnnasit
Copy link
Contributor

hnnasit commented Mar 6, 2024

Hi @AmruthjithSunil, glad you found the issue. Does that address your concerns when running sam commands now?

@hnnasit hnnasit added the blocked/close-if-inactive Blocked for >14 days with no response, will be closed if still inactive after 7 days label Mar 6, 2024
@mildaniel
Copy link
Contributor

Closing as the issue seems to be resolved.

Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked/close-if-inactive Blocked for >14 days with no response, will be closed if still inactive after 7 days blocked/more-info-needed More info is needed from the requester. If no response in 14 days, it will become stale. stage/bug-repro The issue/bug needs to be reproduced
Projects
None yet
Development

No branches or pull requests

5 participants