-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AWS provisioning to raster api cdk workflow for ingestion pipelines #22
Comments
@jvntf Which resources need to be created in the VPC? Checkout this CDK code in the APT stack. It does much of what you're describing above:
|
thanks @leothomas! i will check out this code |
@jvntf @leothomas Are there defined estimates for
The architectural requirements you have here (a Lambda which needs public internet access and simultaneously need to communicate with a db on a private subnet) have been widely discussed across projects (I believe @leothomas And I had some discussions surrounding this issue for APT). I'm not very experienced with AWS networking intricacies but the one concern we might have with the approach used in APT here is that given the potentially large volume of transfer data potential there is the possibility of substantial NAT Gateway data processing and data transfer costs. @abarciauskas-bgse @wildintellect and Phil Varner prepared a very detailed document discussing some of this case https://docs.google.com/document/d/1uYr6XnEQY9Bx7_uamia9aGimQW99L52IXg6LIrsdH2A/edit# I'd be interested in us landing on a canonical decision for how we handle this case. In previous projects we have taken the low security approach of placing the RDS instance in public subnets to avoid NAT Gateway charges for the massive data transfers we have downloading Sentinel 2 data from ESA (https://github.com/NASA-IMPACT/hls-sentinel2-downloader-serverless). This database only stores non-sensitive ephemeral log data so this was a tradeoff, but I'm unsure what the optimal approach is here. If the answer for question 1 above will always be S3 then I believe this is a moot point and an S3 VPC Endpoint should eliminate the NAT Gateway overhead costs, but someone with more experience in this area might have better details. |
cc @edkeeble For reference as we investigate architecture options. |
@sharkinsspatial I did some digging this afternoon on the feasibility of using a single lambda function associated with both the private and public subnets in the VPC (to attempt to bypass the NAT gateway charges when downloading large amounts of data). As @wildintellect already stated, this is not possible (never should have doubted you!). While you can assign a lambda to multiple subnets, this is essentially defining a pool of subnets to which the lambda might connect. Each time the lambda is invoked, a random subnet is selected from that pool and a network interface is created within that subnet. Hence, all subnets in the pool must be functionally identical. If we expect the ingestion process to download large amounts of data from the Internet, we could use a two-step approach:
Any number of ways to handle the above approach: we could use step functions, an s3 trigger for the second function, or the first function could invoke the second function directly. |
@edkeeble @wildintellect Thanks so much for investigating this. It is great to have some clarity around this Lambda limitation. I agree with fully with a 2 step approach. I don't know what the scale / rate of ingestion would be like for this service but my guess is that @bitner would recommend periodic batch loads of large
This might be slight overkill but it gives us a nice throttle to control how we are interacting with the database. I've been considering refactoring HLS to use a similar approach as our error reprocessing logic sometimes results in making thousands of near simultaneous db inserts (which would make @bitner 😢 ) |
As a reminder to myself + the group, I think if we successfully use this multi-lambda multi-subnet type approach we should make sure to demo it both to Development Seed Earthdata team and NASA IMPACT team as a proposed best practice for cost effective handling of external data transfer in Lambdas while maintaining an RDS in a private VPC subnet. |
For the cloud optimized data pipelines, we need to set certain resources to be created within the VPC in order for the ingestion Lambda's to work properly (until now this has been done manually).
At least one Lambda task needs to write to the database, while also maintaining access to the internet. Following this guide, we should create the following when deploying the raster api cdk construct:
Tha lambda that needs to acc
cc @anayeaye @abarciauskas-bgse
The text was updated successfully, but these errors were encountered: