Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement saving/loading dataset metadata from S3 #76

Closed
leothomas opened this issue Nov 4, 2020 · 3 comments
Closed

Implement saving/loading dataset metadata from S3 #76

leothomas opened this issue Nov 4, 2020 · 3 comments
Assignees

Comments

@leothomas
Copy link
Contributor

Dynamically extracting dataset domain requires performing an S3 scan each time a spotlight is requested, which can be slow (on the order of ~5-10 seconds). Since this data is not updated all that often, it's not necessary to dynamically generate it with each request.

Possible solution 1: a lambda function regularly (daily? weekly? hourly?) scans the S3 bucket and stores dataset metadata as a json file. Whenever the /dataset endpoint is hit, the API reads the file and returns data. An EventBride rule is used to trigger the lambda at the desired interval

Pro:
- User never has to wait for the "processing" to complete when requesting dataset metadata
Con:
- Have to implement a separate Lambda + EventBridge rule in CDK stack
- Have to manually re-trigger the lambda to force refresh

Possible solution 2: the /dataset endpoint attempts to read dataset metadata from a file. If the file does not exist, or is older than the max interval (hour? day? week?), the endpoint scans the bucket, updates the metadata, writes back to the file and returns the updated data to the user. If the file does exist and is not older than the max interval, return the data from the file.

Pro:
- All code modifications are contained within the API (no need to create separate Lambda + EventBridge rule)
- Trigger refresh by deleting the dataset metadata files
Con:
- One request per timeout period will still have to scan the full S3 bucket

@leothomas
Copy link
Contributor Author

@drewbo
Copy link
Contributor

drewbo commented Nov 4, 2020

I think solution 1 with a daily refresh is great. We can also schedule the executions via CloudWatch rule

@leothomas
Copy link
Contributor Author

Closing this issue as this feature has been implemented: #95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants