Skip to content

Latest commit

 

History

History
170 lines (106 loc) · 5.69 KB

README.md

File metadata and controls

170 lines (106 loc) · 5.69 KB

HoloSubs Search

Tool for searching transcriptions of vtuber videos.

Uses:

  • Metadata from Holodex
  • Subtitles and audio from Youtube and some archive sites
  • pyannote-audio for speaker diarization
  • Whisper for transcription

example.png

Setup

  • Use Python 3.11+
  • Install dependencies with python3.11 -m pip install -r requirements.txt
  • Start with HOLODEX_API_KEY env variable
  • (Optional) Start with HUGGINGFACE_TOKEN env variable

See env_config.py for configurable ENV variables.

Quickstart

  • Fetch list of all Hololive channels

    python3.11 -m holo_subs_search --holodex-fetch-org-channels Hololive
  • Go to ./data/channels/ and delete the channels you don't care about. This will greatly limit the amount of data that will have to be downloaded, and will speed everything up.

  • Fetch list of all videos for the channels you did not delete and collabs on other channels.

    python3.11 -m holo_subs_search --holodex-refresh-videos
  • Fetch subtitles for all videos (This takes a while). Only English subtitles are downloaded by default.

    python3.11 -m holo_subs_search --youtube-fetch-subtitles
    python3.11 -m holo_subs_search --youtube-fetch-subtitles --youtube-fetch-subtitles-langs en jp id
  • (Optional) Fetch subtitles for membership videos. This requires you to have membership and be logged in your browser.

    python3.11 -m holo_subs_search --video-filter flags:includes:youtube-membership --youtube-fetch-subtitles --youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww --youtube-cookies-from-browser chrome

    If you plan to commit your data to public git repo, run python3.11 -m holo_subs_search --storage-git-privacy public to automatically create .gitignore files that will exclude all membership content from git.

  • Search

    python3.11 -m holo_subs_search --search "solo live"
    python3.11 -m holo_subs_search --search "solo.*?live" --search-regex
    python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:youtube langs:includes:en

Audio Transcription

Some videos completely lack subtitles or the subtitles are VERY bad. In this case, you can try to download the audio and transcribe it yourself. The results can be better or worse depending on a lot of variables.

Most of these steps assume that you have Docker with support for Nvidia GPU installed, a relatively powerful/new Nvidia GPU (like RTX 3090), and at least 64GB RAM (for loading long audio to memory). Everything should also be able to run on CPU, but you would have use the *-cpu containers, and it would be a lot slower.

Check PERFORMANCE.md if you are currious about how long this can take.

Download all audio files

python3.11 -m holo_subs_search --youtube-fetch-audio

Diarize Audio

Detecting which parts of the audio file contain speech, is needed to prevent hallucinations later in the transcription step. As a bonus, this will also allow us to detect which lines were spoken by who in the future.

Transcribe Audio

  • Start Whisper server

    docker compose up faster-whisper-server-cuda

    Alternatively, you should be able to use the Whisper from official OpenAI API with WHISPER_BASE_URLS, WHISPER_API_KEYS ENVs and --whisper-model parameter.

  • Transcribe the audio files into subtitles

    python3.11 -m holo_subs_search --whisper-transcribe-audio --whisper-langs multi

    This will try to automatically detect the language of audio segments, so the resulting transcription might contain multiple languages.

    Translations to specific language can be created with --whisper-langs en, but the results are not that good, so it's not recommended. But, you can also use this parameter if Whisper is incorrectly detecting the language.

  • Search transcribed audio

    python3.11 -m holo_subs_search --search "solo live" --search-subtitle-filter source:eq:whisper

Everything Together

Here is command that:

  • Refreshes list of videos
  • Does not process videos that already have subtitles generated by Whisper
  • Downloads subtitles and audio from Youtube and archive sites, including membership videos for channel UCHsx4Hqa-1ORjQTh9TYDhww
  • Diarizes and transcribes the audio into subtitles with Whisper
  • Deletes downloaded audio to free disk space
python3.11 -m holo_subs_search 
--holodex-refresh-videos
--video-filter subtitle_sources:excludes:whisper

--youtube-memberships UCHsx4Hqa-1ORjQTh9TYDhww
--youtube-cookies-from-browser chrome
--youtube-fetch-subtitles
--youtube-fetch-subtitles-langs en
--youtube-fetch-audio
--youtube-clear-audio

--pyannote-diarize-audio
--whisper-transcribe-audio
--whisper-langs multi

Processed Data

If you don't want to spend many hours/days downloading and processing everything, then data for some channels can be found in following repos:

Use --storage PATH to search data in the downloaded repo.

Development

Use pre-commit