-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache subtitles in S3 storage #287
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #287 +/- ##
========================================
- Coverage 1.54% 1.50% -0.05%
========================================
Files 11 11
Lines 1102 1132 +30
Branches 162 170 +8
========================================
Hits 17 17
- Misses 1085 1115 +30 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit puzzled about this issue/PR now.
Original goal I expressed in the issue was to avoid being blocked by yt-dlp ban when subtitles did not changed. Goal is not reached in this PR since we still need to get the list of subtitles with yt-dlp.
At the same time, caching as well the list of subtitles is probably not something wishable, we usually do not cache in S3 the responses to API calls, but rather resources which take time/resources to recompute (reencoded videos / images). The idea of caching subtitles in S3 was probably already a deviation from this usual behavior.
To help sort this out, can you please share some data about how this change makes the scraper run faster or not?
The reason I used yt-dlp to get the list of subtitles is that the scraper doesn't know which language subtitles need to be downloaded. Without the To avoid calling yt-dlp entirely in this scenario, we could save two zipped files in the S3 cache:
WDYT? |
This seems a potential improvement, but does it really help the scraper run faster? Because it has the drawback that we do not know when we should invalidate these to update them. |
Let's pause this issue/PR to let me reflect a bit on this |
This PR modifies the
download_subtitles
method to cache subtitles in S3. The modified method now works as follows:yt-dlp
(e.g.,en
,fr
,de
) and store them inrequested_subtitle_keys
.requested_subtitle_keys
and attempt to download each subtitle file from the S3 cache. If a file is - successfully downloaded from the S3 cache, remove the corresponding key fromrequested_subtitle_keys
.requested_subtitle_keys
, download the subtitles usingyt-dlp
.Close #277