You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dynamically extracting dataset domain requires performing an S3 scan each time a spotlight is requested, which can be slow (on the order of ~5-10 seconds). Since this data is not updated all that often, it's not necessary to dynamically generate it with each request.
Possible solution 1: a lambda function regularly (daily? weekly? hourly?) scans the S3 bucket and stores dataset metadata as a json file. Whenever the /dataset endpoint is hit, the API reads the file and returns data. An EventBride rule is used to trigger the lambda at the desired interval
Pro:
- User never has to wait for the "processing" to complete when requesting dataset metadata
Con:
- Have to implement a separate Lambda + EventBridge rule in CDK stack
- Have to manually re-trigger the lambda to force refresh
Possible solution 2: the /dataset endpoint attempts to read dataset metadata from a file. If the file does not exist, or is older than the max interval (hour? day? week?), the endpoint scans the bucket, updates the metadata, writes back to the file and returns the updated data to the user. If the file does exist and is not older than the max interval, return the data from the file.
Pro:
- All code modifications are contained within the API (no need to create separate Lambda + EventBridge rule)
- Trigger refresh by deleting the dataset metadata files
Con:
- One request per timeout period will still have to scan the full S3 bucket
The text was updated successfully, but these errors were encountered:
Dynamically extracting dataset domain requires performing an S3 scan each time a spotlight is requested, which can be slow (on the order of ~5-10 seconds). Since this data is not updated all that often, it's not necessary to dynamically generate it with each request.
Possible solution 1: a lambda function regularly (daily? weekly? hourly?) scans the S3 bucket and stores dataset metadata as a json file. Whenever the
/dataset
endpoint is hit, the API reads the file and returns data. An EventBride rule is used to trigger the lambda at the desired intervalPro:
- User never has to wait for the "processing" to complete when requesting dataset metadata
Con:
- Have to implement a separate Lambda + EventBridge rule in CDK stack
- Have to manually re-trigger the lambda to force refresh
Possible solution 2: the
/dataset
endpoint attempts to read dataset metadata from a file. If the file does not exist, or is older than the max interval (hour? day? week?), the endpoint scans the bucket, updates the metadata, writes back to the file and returns the updated data to the user. If the file does exist and is not older than the max interval, return the data from the file.Pro:
- All code modifications are contained within the API (no need to create separate Lambda + EventBridge rule)
- Trigger refresh by deleting the dataset metadata files
Con:
- One request per timeout period will still have to scan the full S3 bucket
The text was updated successfully, but these errors were encountered: