Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LPDAAC S3 credential rotation dynamic tiler lambda (for HLS) #25

Open
abarciauskas-bgse opened this issue Feb 21, 2022 · 18 comments
Open
Assignees

Comments

@abarciauskas-bgse
Copy link
Contributor

The dynamic tiler may be requesting data from Earthdata cloud buckets, such as the HLS data provided by LP.DAAC. The tiler need to have some sort of credentials to requests to those files. This could be done by storing URS credentials in a .netrc file but @sharkinsspatial has created EDL credential rotations for direct S3 access which should be faster than authenticating through URS for each request: https://github.com/NASA-IMPACT/edl-credential-rotation. We should probably re-use this approach in our backend API.

@anayeaye anayeaye self-assigned this Mar 21, 2022
@anayeaye
Copy link
Collaborator

I deployed a separate edl-credential-rotation stack for delta-backend-dev. In the raster-api handler in delta-backend-dev a small change is needed to use the edl aws session credentials from the lambda environment.

With these changes the raster-api is able to pick up and use the credentials however the API is currently deployed in an isolated subnet which is preventing us from accessing the external s3 files. Unfortunately the CDK VPC configuration is blocking the deployment of private-with-nat lambdas due to poor CIDR block planning. This change management plan includes steps to resolve the VPC issue.

feature/edl-4-rasterapi contains the lambda changes as well as some minor changes to GDAL environment variables.

@sharkinsspatial
Copy link

@anayeaye @vincentsarago is helping investigate some GDAL optimizations for our use cases that will most likely affect https://github.com/NASA-IMPACT/delta-backend/compare/feature/edl-4-rasterapi#diff-08a35aa423ced1c2c9aeb17d6a439c22744578a3b1cbdfee77f2f26be39554c1. Is this a good location to ping you with updates as we learn more?

@anayeaye
Copy link
Collaborator

@sharkinsspatial thanks--this is a great place for updates!

@abarciauskas-bgse
Copy link
Contributor Author

@anayeaye change management plan looks great, we should add it to a VEDA project folder so we can re-use it or reference it in the future. Thanks for writing it up.

A few questions below but I think we want to send this to the front end developers (Daniel, Ricardo, Hanbyul), data publishers (Iksha, Slesa) and the ESA development team which has been using the Staging API ASAP so they are aware staging may go down for 1-2 days next week - do you agree?

Questions about the change management plan:

Two resource changes are needed for the delta backend stack that cannot be implemented with a simple CDK deployment.

Can we make it clear here that the plan is to deploy a new stack and once we have verified its operational to update the domain name servers to point to the new stack endpoints?

The pgstac database needs to be upgraded to a new schema that will allow us to ingest temporally dense data like CMIP6.

Can we make it clear we are upgrading pgstac which is a schema for the postgresql database in RDS (as opposed to the version of posgresql itself) and from version XX to XX and a link https://github.com/stac-utils/pgstac. Also add that we will be also creating a snapshot of the existing database and using it to restore the existing datasets to the new database and schema?

Confirm database snapshot retention period is adequate for this transition work

I think adequate here just means that there is no risk of there having been changes to the database between the date of the most recent snapshot and when we use it to populate the new database, is that your definition as well?

Test

What types of tests will you run?

@anayeaye
Copy link
Collaborator

anayeaye commented Apr 1, 2022

@abarciauskas-bgse Thank you for your change management review comments! I have updated the document and agree that we need to share with the wider VEDA team ASAP. As far as staging going down, I think this plan ensures that staging will not be down for more than an hour or two but we will have a window when new data ingests would be lost--the dev stack work should give us a good estimate of how long that will be.

I am not sure that the RDS restore plan is even viable (I hope it is!). I think that I can test it tomorrow and then tighten up the dates for sharing.

@sharkinsspatial
Copy link

@manilmaskey brought up a valid question in today's IMPACT meeting that made me consider the fact that we should have a broader strategy for cross account bucket access with the DAACs. I adopted the temporary S3 credential rotation strategy for the HLS tiler because our delivery timelines for integration with the FIRMS application were extremely short and this didn't leave adequate change to coordinate with LPDAAC on a large administrative change.

@tracetechnical and I chatted a bit about this today and given the frequent maintenance windows and periodic instability of EDL it would be a good idea to have someone from IMPACT engage directly with the relevant DAACs and check if cross account policies with read access can be enabled for all roles in our accounts. There are several approaches for tackling this but it would be good to first determine if this is feasible from a policy perspective. cc @abarciauskas-bgse @anayeaye

@anayeaye
Copy link
Collaborator

Still pushing this EDL service forward as a temporary solution until cross account policies are established. PR #50 handles the VPC CIDR range limitations that were preventing us from adding the private-with-nat subnets needed to render HLS data on the map.

Currently there is not an edl-login-service deployed for the delta-backend (I took it down while navigating the VPC changes). The feature branch for the delta-backend raster-api changes needed for EDL is still open but will need a catch up when we come back to this issue.

@anayeaye
Copy link
Collaborator

anayeaye commented May 3, 2022

This work is on hold, we should consider an alternate tiler for HLS data. PR #56 documents how credential rotation was added to a test delta backend stack and why it cannot be used as is (tl;dr we can only tile HLS or our own hosted COGs in a single tiler).

@anayeaye
Copy link
Collaborator

anayeaye commented May 6, 2022

Noting a possible solution to the issue raised in pr 56 from @abarciauskas-bgse @vincentsarago @sharkinsspatial:

Add an additional tiler to the delta-backend deployment that will receive EDL tokens and use the dataset configuration or collection metadata to choose what tiler is used.

@abarciauskas-bgse abarciauskas-bgse changed the title Add S3 credential rotation to our dynamic tiler lambda Add LPDAAC S3 credential rotation dynamic tiler lambda (for HLS) May 9, 2022
@anayeaye
Copy link
Collaborator

anayeaye commented May 9, 2022

@abarciauskas-bgse Here are some notes about what I think the delta-backend can do to support HLS for the trilateral release. I think that the second scenario is what you are proposing and I can get started on it if I have the right idea...

Short term trilateral release commitment

Two possible short term solutions exist in which we provide a us-west delta-backend stack deployed with a snapshot of the staging database (and redeploy as needed to add latest staging-stack ingests). In both the cloudformation stack will have a new name (like delta-backend-west) with the possibility of moving custom domain API users over to this new us-west backend in the future (i.e. cut over traffic from https://staging-stac.delta-backend.xyz to this new backend).

Scenario 1 (single delta backend in us-west only supporting LPDAAC-CLD)

  1. Deploy latest delta-backend to us-west-2 with a LPDAAC credential rotation service using a snapshot of the latest staging pgstac database.
  2. Provide tiler base url and obtain contact list for events that cause this url to change (custom domain is fixed but some VPN users will need the raw API Gateway url). This tiler will work for LPDAAC map layers and fail for other STAC collections.

Scenario 2 (multiple tilers one delta backend deployed in us-west)

Deploy latest delta backend with 3 provider-dedicated tilers

  • Primary titiler serves public S3 COGs and COGs co-located with the delta-backend stack (what is already running today)
  • LPDAAC-CLD tiler for HLS data (and any others hosted in LPDAAC-CLD)
  • ORNLDAAC-CLD tiler for GEDI L4B "
  1. Add new construct(s) to delta-backend for 2 additional tilers that will support
  2. LPDAAC and ORNL (low effort, essentially copy paste and add new resource names).
  3. Deploy latest delta-backend to us-west-2 with a LPDAAC (and possibly ORNL) credential rotation service using a snapshot of the latest staging pgstac database.
  4. Provide tiler base url and obtain contact list for events that cause this url to change (custom domain is fixed but some VPN users will need the raw API Gateway url)

Work required

  • Integrate raster-api env variable changes to use session credentials obtained by credential rotation service (we have already done this in us-west backend pr 56)
  • Deploy credential rotation service(s) (we have also already verified the edl-credential-rotation service for the delta-backend)
  • Support required: keeping the us-west stack postgres database up to date with to what is ingested in staging. Providing updates to system users/developers when tiler API gateway url changes (not expected but possible especially if we switch from scenario 1 to 2 midstream).
  • [scenario 2 only] Add new dedicated raster-api/tiler constructs to delta-backend deployment
  • [ORNL only] Minor adjustment to edl-credential-rotation to parametrize the auth url for ORNL-CLOUD. This can be completed in sequence after the HLS tiler is running

@abarciauskas-bgse
Copy link
Contributor Author

To summarize my conversation with @anayeaye yesterday, I believe we want to deliver a parameterized endpoint so that clients can still use the same API endpoint for doing visualization but pass a parameter identifying the data provider. The reasoning behind this is that, while many datasets will live in the "VEDA data store bucket", other datasets in our API will be maintained by other "data providers" - most likely to be DAACs. While we will probably need some things to be true for all VEDA data providers (in that we have some way of accessing the data from our systems), I think it's the case that we will have different backend implementations to make requests of these providers, such as different S3 credentials.

IN order to make this work we need to:

  • Add a data provider field to collections, at least when it is different from "VEDA" - data that this program maintains in our buckets
  • Inform clients that when making requests for items and collections with a specified provider, certain endpoints (such as /cog/tiles) should include the provider parameter
  • Our endpoint for /cog/tiles should take a parameter (?provider=lpdaac) and then route that request to a specific tiler endpoint which has credentials for that provider

What do you think about this approach @anayeaye @vincentsarago @sharkinsspatial

@vincentsarago
Copy link
Contributor

Our endpoint for /cog/tiles should take a parameter (?provider=lpdaac) and then route that request to a specific tiler endpoint which has credentials for that provider

@abarciauskas-bgse the problem with this approach is that we assume that we will get credential on each tile request which might not possible (throttle) cc @sharkinsspatial

@abarciauskas-bgse
Copy link
Contributor Author

When you say we will get credential do you mean get the AWS credentials? I was still anticipating we use aws edl credential rotation lambda which I think can include an aws sessions key, not sure if that helps with throttling.

@vincentsarago
Copy link
Contributor

@abarciauskas-bgse oh, so every 30min or so we get credential for multiple provider then on user request we use one of the available credential?

@abarciauskas-bgse
Copy link
Contributor Author

There are multiple lambdas, one for each provider, each gets new credentials every 30 minutes

@sharkinsspatial
Copy link

@abarciauskas-bgse We have a few options here. Due to restrictions on Lambda reserved environment variables keys https://docs.aws.amazon.com/lambda/latest/dg/configuration-envvars.html the credential environment variables AWS_ACCESS_KEY, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY are not set at the Lambda environment variable level and instead we set ACCESS_KEY, ACCESS_KEY_ID, SECRET_ACCESS_KEY which then get mapped to the correct environment variable keys at handler instantiation https://github.com/NASA-IMPACT/delta-backend/pull/56/files#diff-c6579356c48fc61c45cac3e22a45ce276b7dcf42ebe1cf4c0a5417fc22fca4ccR6-R11. We'll have to confirm the Lambda context caching mechanics with @vincentsarago but you could also theoretically have a single Lambda whose handler sets these based on a request query parameter such as
?provider=lpdaac -> os.environ["AWS_SECRET_ACCESS_KEY"] = os.environ["LPDAAC_SECRET_ACCESS_KEY"].

Additionally, all of these environment settings can be injected more explicitly in the mangum application via rasterio.Env(session=session) which could also be modified to use an explicit provider query parameter on a per request basis.

@sharkinsspatial
Copy link

Also linking to Patrick's document here for reference which outlines potential longer term strategies around this issue https://docs.google.com/document/d/18GyoMZj0I2HKAXwqyeziO0ISbOwHxo1TN4eAlR4mH3U/view.

@vincentsarago
Copy link
Contributor

@sharkinsspatial FYI we don't use os.environ in impact-tiler but create an AWSSession which we will forward to the rasterio Env https://github.com/NASA-IMPACT/impact-tiler/blob/master/infrastructure/lambda/cog_application.py#L43-L50

This is done at app creation level but could in theory also be done a request level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants