Skip to content

Commit

Permalink
Update download_subtitles method to cache subtitles in S3 storage
Browse files Browse the repository at this point in the history
  • Loading branch information
dan-niles committed Aug 4, 2024
1 parent 60b85b5 commit d34a734
Show file tree
Hide file tree
Showing 2 changed files with 55 additions and 5 deletions.
1 change: 1 addition & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed
- Disable preloading of subtitles in video.js in `zimui` (#38)
- Update `download_subtitles` method to cache subtitles in S3 storage (#277)

## [3.0.0] - 2024-07-29

Expand Down
59 changes: 54 additions & 5 deletions scraper/src/youtube2zim/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -882,21 +882,70 @@ def add_video_subtitles_to_zim(self, video_id: str):
def download_subtitles(self, video_id, options):
"""download subtitles for a video"""

def get_subtitle_s3_key(code: str) -> str:
return f"subtitles/{video_id}/subtitle.{code}.vtt"

Check warning on line 886 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L885-L886

Added lines #L885 - L886 were not covered by tests

def get_subtitle_path(code: str) -> str:
return options["y2z_videos_dir"].joinpath(f"{video_id}/video.{code}.vtt")

Check warning on line 889 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L888-L889

Added lines #L888 - L889 were not covered by tests

options_copy = options.copy()
options_copy.update({"skip_download": True, "writethumbnail": False})
options_copy.update(

Check warning on line 892 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L892

Added line #L892 was not covered by tests
{"skip_download": True, "writethumbnail": False, "listsubs": True}
)

# Fetch the list of requested subtitles
try:
with yt_dlp.YoutubeDL(options_copy) as ydl:
ydl.download([video_id])
info = ydl.extract_info(video_id, download=False)
requested_subtitles = (

Check warning on line 900 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L899-L900

Added lines #L899 - L900 were not covered by tests
info.get("requested_subtitles", {}) if info else None
)
if not requested_subtitles:
return True
requested_subtitle_keys = list(requested_subtitles.keys())
except Exception as e:
logger.error(f"Could not fetch subtitles for {video_id}: {e}")
return False

Check warning on line 908 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L904-L908

Added lines #L904 - L908 were not covered by tests

# Download subtitles from cache if available
if self.s3_storage:
for subtitle_key in requested_subtitles:
subtitle_path = get_subtitle_path(subtitle_key)
s3_key = get_subtitle_s3_key(subtitle_key)
logger.debug(

Check warning on line 915 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L913-L915

Added lines #L913 - L915 were not covered by tests
f"Attempting to download subtitles for {video_id} from cache..."
)
if self.download_from_cache(s3_key, subtitle_path, ""):
requested_subtitle_keys.remove(subtitle_key)

Check warning on line 919 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L919

Added line #L919 was not covered by tests

# Download subtitles using yt-dlp
try:

Check warning on line 922 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L922

Added line #L922 was not covered by tests
if len(requested_subtitle_keys) > 0:
options_copy.update(

Check warning on line 924 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L924

Added line #L924 was not covered by tests
{"sublangs": requested_subtitle_keys, "listsubs": False}
)
with yt_dlp.YoutubeDL(options_copy) as ydl:
ydl.download([video_id])
except Exception:
logger.error(f"Could not download subtitles for {video_id}")

Check warning on line 930 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L928-L930

Added lines #L928 - L930 were not covered by tests
else:
# upload to cache only if everything went well
if self.s3_storage:
for subtitle_key in requested_subtitle_keys:
subtitle_path = get_subtitle_path(subtitle_key)
s3_key = get_subtitle_s3_key(subtitle_key)
logger.debug(f"Uploading subtitle for {video_id} to cache ...")
self.upload_to_cache(s3_key, subtitle_path, "")

Check warning on line 938 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L935-L938

Added lines #L935 - L938 were not covered by tests

# save subtitle keys to local cache for generating JSON files later
subtitles_list = self.fetch_video_subtitles_list(video_id)
# save subtitles to cache for generating JSON files later
save_json(
self.subtitles_cache_dir,
video_id,
subtitles_list.dict(by_alias=True),
)
self.add_video_subtitles_to_zim(video_id)
except Exception:
logger.error(f"Could not download subtitles for {video_id}")
return True

Check warning on line 948 in scraper/src/youtube2zim/scraper.py

View check run for this annotation

Codecov / codecov/patch

scraper/src/youtube2zim/scraper.py#L948

Added line #L948 was not covered by tests

def download_video_files_batch(self, options, videos_ids):
"""download video file and thumbnail for all videos in batch
Expand Down

0 comments on commit d34a734

Please sign in to comment.