Implement saving/loading dataset metadata from S3 #76

leothomas · 2020-11-04T16:31:07Z

Dynamically extracting dataset domain requires performing an S3 scan each time a spotlight is requested, which can be slow (on the order of ~5-10 seconds). Since this data is not updated all that often, it's not necessary to dynamically generate it with each request.

Possible solution 1: a lambda function regularly (daily? weekly? hourly?) scans the S3 bucket and stores dataset metadata as a json file. Whenever the /dataset endpoint is hit, the API reads the file and returns data. An EventBride rule is used to trigger the lambda at the desired interval

Pro:
- User never has to wait for the "processing" to complete when requesting dataset metadata
Con:
- Have to implement a separate Lambda + EventBridge rule in CDK stack
- Have to manually re-trigger the lambda to force refresh

Possible solution 2: the /dataset endpoint attempts to read dataset metadata from a file. If the file does not exist, or is older than the max interval (hour? day? week?), the endpoint scans the bucket, updates the metadata, writes back to the file and returns the updated data to the user. If the file does exist and is not older than the max interval, return the data from the file.

Pro:
- All code modifications are contained within the API (no need to create separate Lambda + EventBridge rule)
- Trigger refresh by deleting the dataset metadata files
Con:
- One request per timeout period will still have to scan the full S3 bucket

The text was updated successfully, but these errors were encountered:

leothomas · 2020-11-04T16:32:27Z

Scheduling Lambda executions using EventBridge

drewbo · 2020-11-04T17:16:32Z

I think solution 1 with a daily refresh is great. We can also schedule the executions via CloudWatch rule

leothomas · 2021-01-15T23:38:17Z

Closing this issue as this feature has been implemented: #95

leothomas assigned leothomas, olafveerman and drewbo Nov 4, 2020

leothomas closed this as completed Jan 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement saving/loading dataset metadata from S3 #76

Implement saving/loading dataset metadata from S3 #76

leothomas commented Nov 4, 2020

leothomas commented Nov 4, 2020

drewbo commented Nov 4, 2020

leothomas commented Jan 15, 2021

Implement saving/loading dataset metadata from S3 #76

Implement saving/loading dataset metadata from S3 #76

Comments

leothomas commented Nov 4, 2020

leothomas commented Nov 4, 2020

drewbo commented Nov 4, 2020

leothomas commented Jan 15, 2021